Preparing datasets before making them available

Datasets must be released in a machine readable, reusable and open format, and personal, health and/or confidential information must be de-identified and aggregated.

This chapter contains the following key actions:

  1. Datasets must be released in a machine readable, reusable and open format.
  2. Personal, health and/or confidential information must be de-identified and aggregated.

5.1 Selecting a format

Datasets must be made available via the web in a machine readable open format.

Machine readable refers to a medium that stores data in a format that is readable by an electronic device and is therefore more usable, able to be manipulated and complies with accessibility standards. A machine readable file has a structure which allows easy interrogation of the contents. Machine readable open data can be used with spreadsheet software, statistics software, and custom written code. By releasing raw datasets in open standard formats, more people will be able to use and reuse them without having to invest in special software.

Extracting data from unstructured documents (for example, PDF’s) is enormously time consuming and can be the most laborious part of a data driven research project, and it can also lead to unnecessary errors.

A dataset may also be made available in its original format (for example, PDF) where that format makes the dataset easier to understand or reuse. It is often useful to have a well formatted document to cross reference and sense check the machine readable data. The original document should be referenced within the metadata description of the machine readable file. The original document can also be referenced in the description of the dataset on the data directory.

5.1.1 Open standard formats

An open format is a specification for storing and manipulating content that any developer may use, that is usually maintained by a standards organisation, and is not locked into a propitiatory product.

The preferred open formats for datasets to be made available are:

  • CSV (comma separated values) for simple spreadsheets and simple databases.
  • (Note: CSV files can be previewed within the Data Directory without the need to download the file, allowing end users to decide it the file is suitable for its purposes)
  • XML (extensible markup language) – a general purpose markup language for complex datasets, standardised by the main international standards organisation for the World Wide Web.
  • XBRL (extensible business reporting language) - XBRL is a freely available global standard, standards based way to communicate and exchange business information between business systems.

Note: Excel and most databases allow users to export files to CSV or XML.

Open standards for spatial data are maintained by the Open Geospatial Consortium and include:

  • KML (formerly Keyhole Markup Language) an XML language focused on geographic visualisation, including annotation of maps and images
  • WMS (Web Map Service) a protocol that allows georeferenced map images to be served over the web
  • WFS (Web Feature Service) allows requests for geographical features to be drawn across the web
  • WCS (Web Coverage Service Interface Standard) provides access to coverage data in forms that are useful for client side rendering, as input into scientific models, and for other clients
  • ESRI Shapefile – geospatial vector data format for geographic information systems software

5.1.2 Proprietary formats

A proprietary format is typically a file format that is restricted for use by a company or organisation. The restrictions can include the control of the specification of the encoding format or licences that only the company or licensees may use. Proprietary formats are usually discouraged for release and every effort should be made to release data in an open format. If however the main users require specific proprietary formats then they will be need to be considered for release.

The following formats have already been considered and deemed acceptable for release via the Policy due to the common usage of the format. An open format should always be considered as well for release.

  • XLS and XLSX (Excel Workbook) Main spreadsheet format which holds data in worksheets, charts, and macros
  • GTFS (General Transit Feed Specification) defines a common format for public transportation schedules and associated geographic information

5.1.3 Application Programming Interface

An API is a set of procedures, protocols, and tools that are used in developing software applications. An API makes it easier to for the programmer to develop software by providing all the building blocks necessary to connect and interrogate the dataset. For developers who build upon web services it is easier to take advantage of external services and data as an API to enhance its offering.

APIs are particularly suitable for real time or regularly updated information.

5.2 De-identifying and aggregating data

De-identified datasets are datasets where any information linking the data to an individual or business are deleted or modified to remove the capacity for identification.

The Public Records Office Victoria (PROV) provides guidance on personal and private records, as named in section 9 of the Public Records Act 1973. The relevant PROV fact sheet notes that information described as personal or private records ‘covers such material as personnel records, medical records, police and prison records and case records concerning students, welfare recipients, children in government care or compensation claimants’. 25

To ensure that datasets containing personal26, health27 and/or confidential information are correctly and consistently de-identified and or aggregated in order to be made available under the Policy, a formal procedure must be documented and adhered to by agencies.

For simple datasets, the documented procedure could be in the form of a checklist to be completed when each dataset is made available.

This procedure must not be shared or released with the datasets to third parties as knowledge of the procedure could potentially allow re identification.

An example of identified and de-identified datasets can be seen below. The difference between Table 3 and Table 4 is the removal of specific age and location detail.

Table 3: Identifiable data
Age (Years) Gender Location Diagnosis
16 Male 3844 Broken Arm
20 Female 3170 Diabetes
44 Male 3166 Heart Disease
93 Female 3666 Arthritis

Table 4: De‑identified data based on generalisation of the information
Age (Years) Gender Location Diagnosis
Under 30 Male Regional Broken Arm
Under 30 Female Metropolitan Diabetes
30‑59 Male Metropolitan Heart Disease
60+ Female Regional Arthritis

Note: The method used to de‑identify the table above is not applicable to all circumstances. It is used as an illustration only.

5.2.1 How do I de-identify a dataset?

For detailed information on de-identification, agencies can refer to a number of expert sources including Australian Bureau of Statistics (ABS), National ICT Australia (NICTA) and the Commonwealth Scientific and Industrial Research Organisation (CSIRO).

ABS provides guidance on confidentialised unit record files (CURFs).28 CURFs are confidentialised by removing name and address information, by controlling and limiting the amount of detail available, and by very slightly modifying or deleting data where it is likely to enable identification of individuals or businesses.

The ABS and Australian Government has a number of factsheets to support de-identification, in particular from the National Statistical Service. These include:

  • National Statistical Service – How to confidentialise data: the basic principles.29
  • National Statistical Service – 11. Confidentiality and Privacy.30
  • Frequently Asked Questions – About CURFs – How is CURF data confidentialised?31

5.2.2 How do I avoid the re identification of data?

As well as de-identifying information, effort should be made to ensure the risk of potential re identification is low. This will include:

  • dealing with small cell sizes and points at which data is aggregated
  • ensuring that the information made available can’t be linked to other publically available data which could then increase the risk of potential re identification

5.3 Preparing a data quality statement

When choosing to use a dataset, users need to assess if it is fit for purpose. A key factor in this decision is often the quality of the data. A data quality statement is an effective mechanism for communicating information about how data can be used, and can act as a qualifier or disclaimer for a dataset.

Data quality statements are an important risk management tool against datasets being misrepresented or misused, therefore the Policy recommends that one is created

A data quality statement can also be used to define key terms and identify any data standards which have been used.

5.3.1 How do I create a data quality statement?

It is recommended that the ABS, through its National Statistical Service (NSS) online data quality tool be used by agencies to create a data quality statement.

The data quality statement tool can be located on the ABS website.

The tool outlines the seven dimensions that demonstrate fitness for purpose and provides users with a step by step guide to assessing the quality of their data. Additional information on these dimensions can be found at 1520.0 – ABS Data Quality Framework, May 2009.32

A template data quality statement produced by this tool is included at Appendix 2: Data quality statement.

It is important to note that other data quality statement processes are available for public sector use. Holders of spatial datasets may wish to use the ISO Standards approach to data quality statements, outlined in the Victorian Spatial Council’s Spatial Information Data Quality Guidelines33. This document provides general principles for data quality, rather than a tool for agencies to apply to datasets. The Department of Environment, Land, Water and Planning has expertise in data quality for spatial datasets and can be contacted for information on this.

25. Public Records Office Victoria Fact Sheet Closure of Public Records under Section 9 of the Public Records Act 1973

26. Personal information as defined by the Privacy and Data Protection Act 2014.

27. Health information as defined by the Victorian Health Records Act 2001.







Reviewed 29 December 2019

Was this page helpful?