Characteristics of Open Data CSV Files
Abstract: This work analyzes an Open Data corpus containing 200K tabular resources with a total file size of 413 GB from a data consumer perspective. Our study shows that ∼10 % of the resources in Open Data portals are labelled as a tabular data of which only 50 % can be considered CSV files. The study inspects the general shape of these tabular data, reports on column and row distribution, analyses the availability of (multiple) header rows and if a file contains multiple tables. In addition, we inspect and analyze the table column types, detect missing values and report about the distribution of the values.
Authors: Johann Mitlöhner, Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres, Vienna University of Economics and Business, Vienna, Austria
Conference: The 2nd International Conference on Open and Big Data (OBD 2016), 22-24 August 2016, Vienna, Austria