Data Quality – Page 2

Band C Verified that data exists, can be accessed and compared, and no privacy or confidentiality issues are show stoppers.

Band B Correctness and coherence

[B1] Missing values identified and handled, privacy issues resolved, collection process well understood, coherent use of units, data source merges verified correct, limitations and uncorrectable defects identified, preliminary modelling helps to identify problems

Band A Suitability w.r.t. a specific set or class of queries/tasks

[A1] Verified suitable and complete for a given task, may e.g. require manual annotation/labeling to qualify, to verify A1 status model suitable to the task must be more or less determined.

There is a definite overlap between analysis of the preprocessing process as described in section and the proposal in. While consider some aspects, e.g. privacy and legality concerns, it ignores the iterative character of some of the preprocessing steps, and is incomplete in terms of classification criteria. On the other hand, for managing data analysis projects, the proposed ratings, should be very useful, if only explicit and general criteria for each rating can be determined.

In the work with the traffic data project we have dealt with three further aspects of data readiness levels. These are seemingly are not covered by the present version of Lawrence’s concept. The first issue (which is very common and which we have encountered in other projects too) is that of timing of delivery of data. Delay in delivery leads to less time for data preprocessing and less time for analysis. However, a more serious concern with delays in delivery is that the most experienced data scientist may have to leave the project for more pressing tasks. This is related to the time dimension in Information logistics.

A second issue, related to the first one, is that the analysis may require skills that is not known in advance due to initial uncertainty about the quality and information content of the data. Thus, an experienced researcher may wait for data and subsequently pre-process it just to find out that another skill set would be more useful for the analysis task. Both the first and the second issues will lead to inefficiency in the project, timeliness is always a consideration. Note interaction with the information provider, concerning questions and clarifications will lead to timing issues.

A third aspect concerns the role of the data scientists receiving the data to be analyzed. While academic research often require clean and reliable data, industrially motivated data analysis research has always been more tolerant to noisy and imperfect data. The success of a data analysis project lies not only in data quality. Rather, researchers with the right experience should be able to draw useful conclusions by selecting analysis methods according to the quality of the data. It could thus be relevant for the assessment of data readiness to take into account the level preparedness for dealing with imperfect data at the research organization.

Further, the suitability of the data is relative to both the problem and the chosen analysis methods. Thus, for data analysis to succeed, it is required that the data scientist can select the models and algorithm for which he or she can make the most use of the suitability of the data. (With the wrong methods, data analysis can fail no matter the availability, quality, or suitability of the data). In the traffic flow project we have on occasions had to reconsider the selection of analysis methods as quality and information contents of data have revealed.

Data Readiness Levels examples

We first describe our method for assessing the data readiness levels in BADA. Then we give an account for our assessments of the data readiness levels in the two use cases Traffic Safety and Hazard Warnings. We include also an assessment of the data readiness level in a project from the BADA program.

Method for assessing data readiness

In the BADA project, we started making data readiness assessments explicitly towards the end of the project, after most of the activities described below had ended. Ad hoc assessments may have little value for a project at hand. Rather, the benefit with such ad-hoc assessments lies in that encountered data readiness issues can be evaluated with respect to the impact those issues had on the project results. For example, in Anobada (see below), the high quality of the initially given data may have led the project initially to downgrade the significance of availability of ground truth data. It was not until the end of the project the non-availability of ground truth was realised as an issue.

Data readiness can be assessed before, during, and after a project. It is clear that making such assessments before a project may be beneficial by allowing better resource allocation, and setting realistic expectations of pace and outcome of the project. However, the true (or verified) data readiness level seem to be accessible only after the data has been thoroughly inspected (i.e. experimented with, wrangled, cleaned, transformed, etc.) It is only after such inspection we can know the quality of the data and potentially start to be able to evaluate its suitability for the problem. Making assessments during the project is likely a better compromise. Explicit assessment of the readiness levels of given data during the course of a project should allow for early mitigation of detected needs for more data or alternative data sources.

Our method for assessment of data readiness levels is based on posing a short list of questions to the data scientists involved in the activity at hand. The list of questions is given in the box in the Figure below. The internal boxes serve to capture the project motivation, problem formulation, data readiness expectations, detected issues, respectively conclusions based on the findings.

The first question serves to ensure that the problem is well-formulated and that the expectation on the data analysis is well-conceived. The second and third questions seek to establish that the chosen methods go well with both the data sources and the expectation on the analysis. Further, the third question will also serve to set well-conceived expectations on the data quality.

Questions 4-6 capture our assumptions about the data and contrast that to the verified state of the data. It is very common that the data readiness levels are overestimated. For example, in one of the use cases in BADA (not included below and intended to be about air quality), it was assumed that useful data was available and could have been classified as C1. It turned out that the data never had been stored. This was thus a clear case of hearsay data, and accordingly the verified data readiness level for the air quality use data was in fact C4.

Answers to question 7-9 should should give a comprehensive account for the status of the data readiness. Finally, based on those answers, Question 10 should lead to either a final assessment of the data readiness or to actions for improving the data readiness levels in the project.

Note that projects with several data sources very well can have one data readiness assessment for each data source.

Data readiness I: traffic safety

Traffic Safety

The motivation and goal was to combining incident data (STRADA incident database) with other data sources in order to better explain root causes for traffic accidents. The intention is to enable the use of such knowledge in Real-time warnings application.

Main methods

Correlation analysis of incidents with external factors such as geographic location, time of day, road type, road condition, and local weather available from public databases

Data sources

[STRADA] Incident database owned by Transportstyrelsen

Assumed initial DR level?

[STRADA] C1
[SMHI] C1

Verified initial DR level?

[STRADA] C1, but privacy issues for data with high temporal resolution data,
[SMHI] B4: Low spatial resolution of weather data. Meteorological expertise required for interpolating readings to incident position

Unexpected availability issues

The contact person at Transportstyrelsen went on parental leave and Transportstyrelsen did not appoint a replacement. Therefore, a request for additional data was delayed several months (in addition to the delay mentioned in item (7) below).

Band C issues?

[STRADA] Delay of 4-6 month for delivery of initial STRADA extract, and low temporal resolution in the first iteration

Band B issues?

[STRADA] “VIN no.” (vehicle type) was considered very sensitive for business reasons and were never made available for all vehicles. Low temporal resolution in initial STRADA extract due to privacy issues. Higher temporal resolution was negotiated by eliminating non-essential sensitive fields in the extract for a subset of vehicles. Part of incident reports entered in natural language with an incomplete encoding of many parameters.

Information crucial for understanding the cause and severity of the incident represented with images in pdf format.

[SMHI] Problems with the low spatial resolution of weather data so far unresolved.

Band A Issues?

None. Data quality issues are being compensated for by selection of models and algorithms have.

Conclusions from the safety use case

Work with the use case the project is still ongoing. The low spatial resolution of weather data, the incomplete extract from STRADA, and the difficulty in interpreting bitmapped data in the STRADA database, are factors that may limit the relevance of the outcome of the analysis. However, improving availability and the resolution of the data sources will likely improve the usefulness of the analysis.

Data readiness II: Hazard warning lights

The motivation of this use case to disambiguate what an activation of hazard warning could mean. Simply put, a driver pushing a hazard in the city could mean a breakdown, parking, queue accumulation or waiting for a pickup. In a rural setting, this is much more likely to mean some form of help is needed. The motivation is to separate out these scenarios using external data sources, in this case the position of the vehicle.

Motivation

Essentially this is a technology showcase. As a showcase we should show how technology can be used to help in social settings. A hazard warning signal (both rear indicators blinking) is context, cultural and location-sensitive. Drivers could press the warning, often a large red triangle on the dashboard, to indicate i) parking or waiting to park (“move on, it’s my place” ii) I am at the end of queue, take care and slow down iii) I have broken down iv) Even celebrations such as wedding are associated with the vehicles sending out audio and visual signals. Other uses of the hazard warnings are surely possible in other cultures.

Goal

The goal is to clarify the use of hazard warning based on the location of press. By classifying and enriching the hazard warning events, correlated with with a position, helps disambiguate the reasons behind a press. As a first step separating urban or rural will give a central office some indication of what to do. Naturally, the urban presses could be disambiguated with further information, for example, from the vehicle itself (to separate breakdowns or parking). As an external information source we used a Geo-Information System, in this case, OpenStreetMaps.

The main method we make use of is data correlation between 1 stream (presses in XML format) and 1 database lookup in OpenStreetMaps. The issue is really a press as a GPS location might not match more than 1 road, or an intersection, or a crossing, or the GPS location might not be present in OpenStreetMaps, as it does not cover all GPS pairs. Therefore, an expanding radius approach is needed, where the location is broadened to find the closest location, where a vehicle might be stationary. Recall too, that OpenStreetMaps is often a user contributed.

In this case, we actually generated GPS data. This is not as uncommon as one imagines, as data processing pipelines need to be built, often without data sources. Occasionally privacy issues are encountered, however systems can be built and then used in company’s internal systems. Open source software tends to follow this pattern, code is developed in the public domain, and then used modified internally. We used the GPS locations to simulate that hazard warning button was pressed. Then used OpenStreetMaps to check if this location is either rural/urban and used that to classify the event into hazard/parking.

Data sources

[Openstreetmaps] Open source map system
[Hazard warning stream] Format, AMQ messages

Assumed readiness levels

[Openstreetmaps] A
[Hazard warning] C1
[GPS] C1

Verified readiness levels

[Openstreetmaps] A
[Hazard warning] C1
[GPS] C4

Unexpected availability issues: None

Band C issues. GPS data was not available and had to be simulated.
Band B issues. Geographical resolution of hazard warnings were different depending on car manufacturers geolocation to likely location. Mapping to the road to the problem. Not enough metadata available e.g. the driving direction.
Band A issues. No evaluation of suitability done. Lack of GPS data for vehicles had negative effect on applicability of analysis.

Conclusions from this use case

The data was too sparse and of too low resolution for making the analysis. However, by simulating data we were able to fulfill the main goal of the use case, which was to illustrate through implementation how a big data infrastructure would be set up for dealing with massive automotive data streams.

Data Readiness III: Vehicle operational data

Anobada was a one-year project funded by FFI within the BADA program during 2016. The goal of Anobada was to enable detection of deviation in vehicles operations based on operational data. The project aimed at finding interesting anomalies in existing data sets. The intended application of the results is to improve monitoring of vehicle fleets.

The main methods used in Anobada come from statistical data analysis. Through data analysis, the project built a number of statistical models based on the given data including Principal Component Analysis, Gaussian Mixture Models, and Markov Fields. Anomalies where defined in terms of likelihood of data points with respect to such cluster models.

Data came from vehicle sensors and were collected at maintenance cycles. Data was delivered as a csv file.

Assumed initial DR level was C1. The data was known to exist and non-disclosure agreements were signed early in the process. We expected some noise in the data as well as missing data. Unknown relevance of the data with respect to finding interesting anomalies.

Verified DR level was C1, alternatively B4. It turned out that data had been collected irregularly and with varying volumes from time to time. This made the analysis task harder, prompting rethinking of the choice of analysis methods. Further, the data was not too noisy and it was easy to clean and correct it. Industrial applications accept some noise in the data, therefore the B4 classification.

Unexpectedly the project received an additional set of data to aid in the classification of the anomalies. The data arrived towards the end of the project time, and was therefore not analysed thoroughly enough to be of use in the project.

The main issue in Band C was that ground truth was missing. The importance of ground truth was initially underestimated.No issues in Band B. Despite some noise, irregularities, and missing data, the data analysis was performed by experienced researchers who were able to compensate and deal with unclean data.

In band A there were issues. The analysis helped in finding interesting statistical anomalies. However, in order to make use of the data, the anomalies needed to be categorized in terms of vehicle operations to make industrial sense. Put in other words, the project did not have access to ground truth data. The project relied on having access to experts in the the domain of vehicles for interpreting the anomalies. This turned out to be hard to access, so therefore the relevance of the found anomalies are still to be assessed.

Conclusions. The project was successful in analysing the initially given data, which also had a sufficiently high data readiness level. However, the missing ground truth made it hard to make practical use/sense of the otherwise successful analysis.

Data Readiness IV: Weather data (SMHI)

[Ownership] Incident database owned by Transportstyrelsen

[SMHI] Publicly available weather data incl. temperature, pressure, precipitation at time and position of incident (see www.smhi.se)

Data sources

[SMHI] C1

Weather differences

Nordic and equatorial.

Future Work

Time series jumps

Spare parts

Jumps or interventions

Conclusions

We have assessed the data readiness levels in BADA with respect to the classification suggested by Lawrence. In that work, we have also identified some factors that potentially can be used for refining Lawrence’s levels.

The goal for assessments of data readiness levels, whether they are made before, during, or after a project, should be to gain experience to make early and precise decisions in future data analysis projects. By making such assessments explicitly in writing, we believe that such experiences more easily can be shared and disseminated with others.

References

[1] Neil Lawrence, Data Readiness Levels, paper.

[2] Zane Selvans, Automated data wrangling, link.

[3] HoloClean: Holistic Data Repairs with Probabilistic Inference, paper, github.

[4] Björn Bjurling, Per Krueger and Ian Marsh, Data Readiness for BADA, SICS Technical Report T2017:08, report.

[5] Nordic cooperation on data to boost the development of solutions with artificial intelligence, Nordic Council of Ministers, link.

[6] The FAIR (Findable, Accessible, Interoperable, Reusable) guiding Principles for scientific data management and stewardship, link.

Pages: 1 2