Data Quality

Introduction

One main motivation by using data readiness is to gain understanding of requirements on data quality for (big) data analysis. This report aims at sharing experiences gained from dealing with automotive and traffic data in our use case. We present our experiences in terms of the concept of data readiness levels as introduced by Prof. Neil Lawrence, UK [1].

TRL levels (reference BMC Health Services Research)

In analogy with the concept of technical readiness levels (TRL), data readiness levels (DRL) were introduced with the intention to provide a tool for achieving better planning and resource allocation in data analysis projects, by characterization of various levels data availability, quality, and suitability.

As an example, a major part of the work in a data analysis project is performed in preprocessing of data. It is further not unusual that a considerable amount of time is spent waiting and asking for more relevant data than initially may have been made available. A formalization and appreciation of the data analysis process could potentially benefit the planning of data analysis projects. Lawrence’s contribution is outlined below. In the area of Information logistics, one can find several related contributions dealing with the importance of getting the right data to the right data scientist at the right time .

Lawrence argues that his contribution should be seen as a first step towards normalizing the full process of data analysis. He suggests a basic order of readiness levels, and invites contributions that can be used to refine characterization of the levels of readiness.

Similar TRL and DRLs.

LevelTechnical Readiness Level (TRL)Data Readiness Level (DRL)
9.System proven in operational environmentData being used in an operational environment
8.System complete and qualified Data complete and correct
7.Integrated pilot system demonstratedOut data makes sense in pilot system
6.Prototype system verifiedPrototype system verified
5.Laboratory testing of integrated systemLaboratory tested of integrated system
4.Laboratory testing of prototype component or processLaboratory verification of dataset or processing
3.Critical function, proof of concept establishedCritical dataset(s), proof of concept established
2.Technology concept and/or application formulatedDataset concept formulated
1.Basic principles are observed and reportedBasic data gathered and stored
Technical and data readiness levels (TRL) version I

DRLs closer to AI data processing

LevelTechnical Readiness Level (TRL)Data Readiness Level (DRL)
9.System proven in operational environmentData providing value in Machine Learning (ML) applications
8.System complete and qualified Data, system and algorithms qualified
7.Integrated pilot system demonstratedAlgorithms+data+system from various partners producing insights
6.Prototype system verifiedML Pipelines (test data and algorithms) tests passed
5.Laboratory testing of integrated systemEach partner test cases correct
4.Laboratory testing of prototype component or process
3.Critical function, proof of concept established
2.Technology concept and/or application formulated
1.Basic principles are observed and reported
Technical and version readiness levels (TRL) II

The experience from working with data in projects can be well described in terms of Lawrence’s initial levels of readiness. Further, we have encountered issues that seemingly are not covered by Lawrence’s initial account. With further research, these issues may qualify for extending Lawrence’s work on data readiness levels (more below)

The report is structured as follows. In Section, we will review and analyse workflow methods used in the preprocessing of data in the preparation for applications of analytic methods. We will also highlight some aspects of modelling process which are relevant for the what Lawrence has termed “data readiness”.

Lawrence concept of data readiness is described in Section XYZ. It contains an account of the data readiness levels in the use cases we have worked with. These accounts should illustrate the usefulness of the proposed concept as well as some potential pitfalls in data analysis projects.

The data analysis process

Data analysis and machine learning methods relies on a wide variety of methods and techniques, but share a need for usually large amounts of well structured and well understood data. Many, perhaps most, analysis projects spend a majority of their resources on collection, selection, correction, collation and processing of the data. This is the case regardless of the type of problem considered, whether it involves detection, diagnosis or autonomous control.

For this process to be as efficient as possible, there is a need for methods for assessing the availability, quality and suitability of the data, as well as a systematic workflow for processing the data and prepare it for use by analytic techniques. The three items of accessibility, quality, and availability, are also the main focus in Lawrence’s account for data readiness are:

  1. Availability
  2. Data quality (e.g. correctness and completeness)
  3. Suitability for a given purpose.

For each aspect, Lawrence proposes introducing a rating, which would make it easier to plan and allocate resources to a project with a given set of goals. We shall reflect these levels also in this review of the work process.

Data sources are almost universally organized as tables, normally with rows representing several aspects of a single observation, e.g. the time and place of an event or reading. For the purpose of data analysis, we refer to the observations as data points, or entries. Table columns represent several aspects or properties of the observations, and we refer to these as parameters. An observation is thus represented as one data entry with values given for some or each of the parameters. One set or sequence of data is referred to as a data source.

In order to prepare data for the use of analytic methods, or automatisation, we generally need to select, process and compare entries and parameters from several sources, which is an error prone and often exploratory process. In principle, this work should precede the development of models that are used by the method, but in practice it is often interleaved with at least preliminary modeling activities.

The data processing pipeline(s)

Selection, correction, processing and collation of data sources typically occurs in repeated cycles where each step is based on preliminary assessment or analysis of earlier data versions derived from the same sources. Preparation of the data is therefore interleaved with successive interpretation and preliminary modelling. The preprocessing is incremental in the sense that a preliminary analysis step often results in a refined specification of how to interpret, correct or rearrange the data in later steps.

Challenges

In many applications, data occur in the form of logs of e.g. sensor readings, operations performed, events or transactions. Often, the data is distributed over several databases, and entries are often only loosely coupled, e.g. by the co-occurrence of similar identifiers, numbers or text strings, or by overlapping time series with different resolution and data given in different units or formats.

Errors in the data, plus errors therefrom

Values of parameters in the sources are often missing, or erroneous. Sometimes errors in the observation processes are encoded as particular values outside the normal range of the parameters. In other cases, the fact that a certain value is missing or invalid, can be valuable for the analysis, as an indication of e.g. an error in the underlying process. E.g. interpreting a missing timestamp as an indication that a process step has not been performed, or has failed. For sensor data there are also systematic errors and drift which can, when known and understood, be used to reconstruct the correct parameter values. To reduce lead times, such errors and omissions should be identified as early as possible.

Deviations and anomalies

In the general case, detecting anomalous data (e.g. outliers, artifacts) is a difficult analytic problem in it’s own right, but is often needed during data preprocessing and can include analysis step that are as complex as the ones used for modelling and interpretation. Failing to take this into account, can easily lead to misleading results.

Data preprocessing steps and dependencies

Figure shows a selection of common preprocessing steps, their interdependencies, and possible iterations in the process of bringing the data into a shape suitable for application of analytic methods. The following paragraphs elaborates and exemplifies the most crucial.

Problem formulation, unit and format analysis

The object of the analysis, e.g. anomaly detection, diagnosis, classification, or automation, clearly constrains the choice of data sources for the project. It is rarely feasible to collect and prepare “all” data without at least a general idea of how it will be used. The first step in any analysis project is therefore to formulate the object of the analysis as concretely as possible. Its exact formulation is normally revised and refined as examination of the available data progress. Early in the process this can influence which data sources should be included in the analysis, and how to interpret their parameters.

This step in the preprocessing pipeline can include versions of at least the following steps:

  1. Specification of problem formulation and selection of available data sources for the given problem.
  2. Unification of units and identifiers in separate data sources
  3. Determination of what constitutes an outcome relevant to the analysis object, in each data source
  4. Format analysis
    1. Determination of parameter value range (for e.g. detection of outliers and unexpected encodings)
    2. Determination of the data type of the parameters in each data source
  5. Rudimentary interpretation and re-encoding of free text parameters
  6. Visualisation, inspection and validation of reformulated data

Reconstruction and interpretation

At least in cases where values for the parameters in a data source are generated by sensors, or humans, the data entries are often incomplete and/or inconsistent. Detecting and correcting or discounting the contribution of such entries from the analysis of each parameter is crucial to obtain reliable results. This can involve examining one or more parameters in related data entries in the same, or in other data sources, and/or reinterpretation and recoding.

  1. Reconstruction of missing and obviously erroneous parameter values, where possible
  2. Identification, classification and (re-)encoding of error states in the generation process that gave rise to missing and/or anomalous data
  3. Visualisation, inspection and validation of reconstructed data

Selection and sorting

While not all data analytic methods are sensitive to large number of parameters, many are, and selecting the ones most relevant to the object of the analysis is generally necessary and requires preliminary analysis of the correlation between parameters within and between different data sources. Some parameters may also have to be skipped if an insufficient proportion of the entries have valid values. Orthogonally, we may have to skip entries for which valid values could not be reconstructed. Finally data from several sources may have to be compared or merged into a format suitable for the modelling and interpretation of the analysis objective. This sometimes involves resampling, parameter fitting, derivation, trend analysis etc. and often involves making crucial modeling choices.

Selection of parameters generally depends on:

  1. The type analysis
  2. Domain knowledge
  3. Correlation analysis between candidate parameters and against of the object model variables
  4. Identification and sub-selection of parameters representing the same or very similar information

Selection and sorting of entries:

  1. Filtering or discounting of entries for which values of the selected parameters are missing or invalid
  2. Filtering or discounting of entries with uncorrectable or anomalous values
  3. Sorting and merging data sources by e.g. summing, averaging, resampling, gradient detection, etc.

Modelling

Depending on modelling choices related to the analysis objective, further analysis may be facilitated by the introduction of derived parameters or further specifications:

  1. Derived parameters such as sums, differences, derivatives, or rescalings
  2. Specification through development and evaluation of model for analysis objective, e.g. detection, diagnosis, prediction, or automation.

The result of this activity frequently requires that we return to an earlier data preparation step.

Lawrences’ data readiness levels

Based on an original version of Data Readiness for Big Data Analytics, 2017 by Björn Bjurling, Per Kruger and Ian Marsh (paper)

Application of models to data is fraught. Data-generating collaborators often only have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy. All these aspects obstruct the sharing and interconnection of data, and the eventual interpretation of data through machine learning or other approaches. In project reporting, a major challenge is in encapsulating these problems and enabling goals to be built around the processing of data. Project overruns can occur due to failure to account for the amount of time required to curate and collate. But to understand these failures we need to have a common language for assessing the readiness of a particular data set. This position paper proposes the use of data readiness levels: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.

In the state of the data before and during this process, is analysed as a method to improve the planning of data analysis projects. The author classifies project phases broadly in three phases, or “bands” depending on the knowledge and understanding of the available data, and its usefulness for a given objective. Each phase generally presupposes answers to questions to answers provided in the earlier phases, but as noted above preprocessing of the data is often an iterative process, which influences how to classify the complete set of data sources deemed necessary to achieve the analysis objective.

The first phase, bandC C, involves assessing the availability and accessibility parameters and entris in existing data sources. Lawrence does not explicitly pose necessary and sufficient conditions for a given rating in any of the phase, but appears to propose rates running between 1-4 within each of them and where rated 1 means that the project is ready to proceed to the next phase in the preparation and processing of data.

Ideally there should be a finite number of validation criteria for each data source for the project to be able to rate it at a given level but that is not yet the case for the proposal as stated in. As it stands, it proposes broad criteria for only a subset of the phases, in summary: