This document is a checklist for talking to partners regarding sharing data. It is simply a list of reasons to share. As we gain more experience we may reorder the points below from the most to the least popular reasons to share data.
Sharing data might not necessarily mean the general public. It could be a special interest group, within an organisation, legally joined entities, project partners, including a consortium, or other groups. Please bear this in mind when reading the motivations below.
1. Farm out potential problems
Probably the most common motivation for externalising a dataset. “More eyes on the data” hopefully leads to new ideas, takes, perspectives on a dataset, as well as the sheer number of analysts. ‘Challenges’ such as the Data for Development (D4D) by Orange in Ivory Coast for tracking human mobility. The KDD conference (via) Kaggle release data for competition with that the prestige of presenting findings and awards, recommendation systems seem to like open and large datasets, and have featured datasets from Spotify. Smaller datasets, for practicing and building toolchains, experience and confidence (as well as the problems in ML) such as dividing data into training and deployment have used handwritten digits as images for practicing character recognition algorithms such as MNIST.
2. Improve internal data practices
By sharing high quality and correct data. One example is database schemas which contain redundant features. In database parlance this is called [normalisation]. Another motivation might be for an entity to have an internal (re)organisation suitable for sharing, e.g. into two groups i) Exportable for analytics ii) Internal use only.
Exporting higher quality data, more suited to data scientists is coupled to the idea of “Data Readiness” coined by Neil Lawrence @ Sheffield University, UK [readiness]. Essentially, this is a TRL for data. Typically cleaning (or wrangling) data consumes 80% of the project time whilst analytics only 20% (a Pareto rule). RI.SE has a version of readiness data levels augmented by examples from industry [bada].
3. Productivity-based cleaning
Increases the worthiness for external parties. Removing, annotating, changing formats saves a large amount of storage, transmission and processing. Although seemingly trivial reducing the size and #features can save precious resources, not just disk sizes but infrastructure, multirack systems, GPUs plus a whole new set of practices. Big data e.g. Spark processing, versus a single machine and so on. Deciding on a human (e.g. text, CSV) or machine readable (binary, Parquet) format can have a significant impact on an ML pipeline, including people and an organisations’ workforce.
Often self-annotated or fields suitable for analysts can make the data much more valuable. An example from the Swedish road traffic authority is to convert an internally used road position marker to GPS (and vice versa). Not only does this allow to perform lookups in traditional GIS systems, but also facilitate the inclusion of other data, for example road & weather conditions, a “data as a service” and makes the original data source much more valuable as a resource. A company which usefully aggregates data is recordedfuture.com.
ML works best with ‘clean’ data, e.g. PCA.
4. Leverage open-source tools
Open source processing tools. Many, tools for analytics in a variety of languages exist. Quite often these cover different areas.
Machine learning in Python (Tensorflow+Keras (popular in industry, Pytorch (academia), Statistical, classic, (R) and stan-mc for Bayesian stats. Programming for reasoning and search (Prolog, constraint coding, optimisation ILOPS). Remember that machine learning is not only deep neural network coding, although popular in 2019.
Handling the large amount of data with more ‘serious’ programming languages brings in C++, Fortran (numerical), Scala (a functional language well suited for big data data), RUST a type-safe(r) language and industry popular Java.
5. Attract talented people
new actors, i.e. young idealist people to new sectors. One example is the innovative use of data Data 4 Development, by Orange labs, recruited 20 new people due to their release of their telecom data in the Ivory coast. Data challenges, the best performing ML algorithm, for example attract many people. Tech meetups, with open data, it is more attractive for new data scientists to work with companies that have interesting [datasets].
6. Conform to open standards
Very important for many organisations. Sharing can really occur using open formats. This includes opening up access to data, via APIs, network databases even open interfaces. So, one has to regularise the data, change format so that it can be shared with coders, data analysts that know certain formats and tools. Probably SQL in the database world is the best analogy.
7. EU initiatives
Within the EU new laws make sharing data, at least in a regulatory sense essential. Any requests to see data must be open. GDPR is already implemented but formats used in AI are less clear. These need to be established.
8. Be competitive against others
against large players, for example phone operators cannot trace single users without reason (police permission needed) whilst mobile phone apps access GPS positions giving them a ‘data advantage’ (furthermore many more users). Not to mention camera, photos and contacts. Therefore incumbents are not on a level playing field as far as data is concerned.
9. Stimulate new ideas
innovation, particularly in traditional industries. New or additional data practices through sharing, makes industry, especially traditional ones think in new ways. Predictive maintenance is a good example, CNC machines can gather data, and factories can select fields, make (some) anonymous and available for finding errors in the machines, often cloud based. Preparing data in advance of processing is applicable in predictive maintenance.
10. Leverage tools
scientific tools: Differential privacy and multi tenancy cloud processing. Correlation analysis between data, recommendation systems.
11. Build an Ecosystem
importance (Trafiklab, City as a platform, Transport, Internationaldataspaces.org). The ecosystem within which the data will be generated / consumed is important, and may be specific, i.e. not transferable between application areas / use cases.
12. Incremental development
Sharing enables each partner to work. Progress by one can motivate the other to work, improve the data.