Data sharing motivation

This document is a checklist for talking to partners regarding sharing data. It is simply a list of reasons to share. It is a living document and the most up to date can be found at ianmarsh.org/sharing-data (this page). As we gain more experience we may reorder the points below from most to least popular reasons to share data.

Sharing data might not necessarily mean the general public. It could be a special interest group, within an organisation, to legally joined entities, project partners, including a consortium or maybe the public.

1. Farm out potential problems

Probably the most common motivation for externalising a dataset. “More eyes on the data” hopefully leads to new ideas, takes, perspectives on a dataset, as well as the sheer number of analysts. ‘Challenges’ such as the Data for Development (D4D) by Orange in Ivory Coast for tracking human mobility. The KDD conference (via) Kaggle release data for competition with that the prestige of presenting findings and awards, recommendation systems seem to like open and large datasets, and have featured datasets from Spotify.  Smaller datasets, for practicing and building toolchains, experience and confidence (as well as the problems in ML) such as dividing data into training and deployment have used handwritten digits as images for practicing character recognition algorithms such as MNIST. 

2. Improve internal data practices

By sharing high quality and correct data. One example is database schemas which contain redundant features. In database parlance this is called [normalisation]. Another motivation might be for an entity to have an internal (re)organisation suitable for sharing, e.g. into two groups i) Exportable for analytics ii) Internal use only. 

Exporting higher quality data, more suited to data scientists is coupled to the idea of “Data Readiness” coined by Neil Lawrence @  Sheffield University, UK [readiness]. Essentially, this is a TRL for data. Typically cleaning (or wrangling) data consumes 80% of the project time whilst analytics only 20% (a Pareto rule). RI.SE has a version of readiness data levels augmented by examples from industry [bada].

3. Cleaning for productivity

Increases the worthiness for external parties. Removing, annotating, changing formats saves a large amount of storage, transmission and processing. Although seemingly trivial reducing the size and #features can save precious resources, not just disk sizes but infrastructure, multirack systems, GPUs plus a whole new set of practices. Big data e.g. Spark processing, versus a single machine and so on. Deciding on a human (e.g. text, CSV) or machine readable (binary, Parquet) format can have a significant impact on an ML pipeline, including people and an organisations’ workforce.

Often self-annotated or fields suitable for analysts can make the data much more valuable. An example from the Swedish  road traffic authority is to convert an internally used road position marker to GPS (and vice versa). Not only does this allow to perform lookups in traditional GIS systems, but also facilitate the inclusion of other data, for example road & weather conditions, a “data as a service” and makes the original data source much more valuable as a resource. A company which usefully aggregates data is recordedfuture.com.

ML works best with ‘clean’ data, e.g. PCA.

4. Leverage open source tools

Open source processing tools. Many, tools for analytics in a variety of languages exist. Quite often these cover different areas. 

Machine learning in Python (Tensorflow+Keras (popular in industry, Pytorch (academia), Statistical, classic, (R) and stan-mc for Bayesian stats. Programming for reasoning and search (Prolog, constraint coding, optimisation ILOPS).  Remember that machine learning is not only deep neural network coding, although popular in 2019.

Handling the large amount of data with more ‘serious’ programming languages brings in C++, Fortran (numerical), Scala (a functional language well suited for big data data), RUST a type-safe(r) language and industry popular Java. 

5. Attract talented people

new actors, i.e. young idealist people to new sectors. One example is the innovative use of data Data 4 Development, by Orange labs, recruited 20 new people due to their release of their telecom data in the Ivory coast. Data challenges, the best performing ML algorithm, for example attract many people. Also at Meetups, with open data, it is more attractive for new data scientists to work with companies that have interesting [datasets].  

6. Conform to open standards + EU initiatives

to open standards. Very important, opens up or closes access to data, APIs, network databases even open interfaces (quite common with courses to try examples) e.g. http://www.xyz/8080.

7. Be competitive against others

against large players, for example phone operators cannot trace single users without reason (police permission needed) whilst mobile phone apps access GPS positions giving them a ‘data advantage’ (furthermore many more users). Not to mention camera, photos and contacts. Therefore incumbents are not on a level playing field as far as data is concerned.

8. Stimulate new ideas

innovation, particularly in traditional industries. New or additional data practices through sharing, makes industry, especially traditional ones think in new ways. Predictive maintenance is a good example, CNC machines can gather data, and factories can select fields, make (some) anonymous and available for finding errors in the machines, often cloud based.  Preparing data in advance of processing is applicable in predictive maintenance.

9. Leverage tools

scientific tools: Differential privacy and multi tenancy cloud processing. Correlation analysis between data, recommendation systems.

10. Build an Ecosystem

importance (Trafiklab, City as a platform, Transport, Internationaldataspaces.org). The ecosystem within which the data will be generated / consumed is important, and may be specific, i.e. not transferable between application areas / use cases.