General guidelines for quality assurance of open data
Avoid non-processable data formats
Share data in an open and reusable format.
Use a standardised character encoding
It is recommended to use an internationally used character encoding, such as UTF-8.
Proper naming of columns
Use only lower-case characters. Fields and their specifications must be listed in the data dictionary. Special characters, accents or punctuation marks should not be used. Spaces should be replaced by hyphens.
Avoid missing values
To avoid confusion, the publisher should clearly mark missing values as null values (N/A).
Avoid duplication of records
Standardise data collection and storage, centralising the process in a single information system, so that duplications are easily detectable and can be automatically eliminated.
Standardising data values
To standardise the structure and values of fields, it is advisable to use reference vocabularies. The structure should be documented in the data dictionary.
Provide an adequate amount of data to facilitate analysis
Publishers should ensure that a reasonable amount of data is published so that there is sufficient context.
Formatting of date
Dates must always be encoded using the ISO standard, i.e. yyyy-mm-dd for date and hh:mm:ss for time.
Formatting of numeric data
Use as decimal separator the decimal point (internationalisation). Avoid thousand separators. Negative values with sign (-). In columns with integer values, do not use decimal separators or mix text with numeric values.
Avoid mixing numerical scales
Try to keep the scale unchanged over time. If necessary, provide data in both scales and document the change.
Avoid mixing ranges in the same dataset
Publish data at the highest level of disaggregation. If this is not possible, maintain consistency across all values of the variable.
Incorporate variables with geographic information
Publish data with geographical coordinates in two independent columns: «latitude» and «longitude».
Avoid incorporating subtotals, totals or groupings
Present the highest possible level of disaggregation of the data.
Avoid fragmentation of data and difficult to locate data
Improve the organisation and labelling of content, with the need to establish connections between the different datasets.