Data Cleaning and Validation

Data Cleaning is the process of detecting and correcting erroneous data
- Also called data cleansing
- Typically updates or changes/transforms faulty data to be correct
Data Validation is the process of checking that data is correct or has been properly corrected
- Typically rejects faulty data
The two can be combined
Goal of data cleaning and data validation is to increase data quality
Thereby increasing the overall usefullness of the data

Data Quality

Data quality: Measure of how useful or usable a dataset is
Typically an aggregate over all data in a dataset
Different aspects contributes to good and poor data quality
- Validity: Comformity to knowledge/business rules/constraints
- Accuracy: Conformity to truth/predefined standard
- Comleteness: How much of required data is known/contained in dataset
- Consistency: Equivalence/agreement accross/within systems/datasets
- Uniformity: Conformity of formats, units of measure, etc.
Data quality also used to evaluate need for data cleaning/validation
Note: Quality is specific to use!

Data Cleaning

Process with input messy data and output clean data
Messy data typically contains different types of errors and incoherencies
These needs to be fixed
Result of fixing these is the clean output
Note, not all errors and incoherencies are possible to fix in general!

Data Cleaning vs Data Integration

Can view differences in two datasets as errors that needs to be corrected (i.e. cleaned)
Thus, integration and cleaning are sometimes quite similar
However, data cleaning is focused on increasing data quality
Integration always focuses on harmonizing two or more datasets
Note: Data cleaning sometimes done after integration to remove errors/duplicates/etc. from integration

Sources of Error

Human input
- Human data input errors (e.g. name = "Lief Harrald Karlsn")
- Ambiguities in input form (e.g. height = '1.64' (meters), '164' (centimeters), '5 4.6' (feet/inches))
- Wrong input format (e.g. birthdate = '1956-03-11', '11. March 1956', '1956')
Hardware
- Faulty sensors
- Corruption of files and streams
Programs
- Faulty data transformations
- Bugs

Types of Error

Syntactic errors (violations of formats, types, etc.)
- Lexical violation (person(name, age, height), value: ('Lisa', '180'))
- Domain format violation (format: <surename>, <firstname>, value: Leif Harald Karlsen)
- Irregularities (e.g. different units used (meters, centimeters and feet/inches))

Semantic errors (violations of constraints, knowledge, business rules, etc.)
- Integrity constraint violations (age >= 0 or hasSupervisor(x, y) then supervisor(y))
- Contradictions (age != now() - date_of_birth, or salary > 0 AND unemployed, etc.)
- Duplicates ({name: 'Ole', height: '165.4'}, {name: 'Ole', height: '165.399'})
- Invalid tuples (Tuple not possible in reality, but does not violate any (specified) syntactic or semantic constraint)

Cleaning Techniques

Data cleaning is complex and time-consuming
We will look at simple techniques that fit with our tools
- Mostly by using SQL
- However, similary for SPARQL or other query languages
- Techniques also applicable in normal programming languages
However, for cleaning properly messy data, might need proper cleaning framework
- See e.g. Problems, Methods and Challenges in Comprehensive Data Cleansing for examples

Example: Setting

Assume we have a database containing a list of employees:

CREATE TABLE emp.employee (
    name text,
    position text, -- 'researcher', 'lecturer' or 'professor'
    employed date,
    salary int
);

Data looks like this (download here):

      name       |  position  |  employed  | salary 
-----------------+------------+------------+--------
 Karla Loe       | researcher | 2020-01-18 | 501990
 Legor Ivani     | professor  | 2021-02-18 | 632950
 Martin Schwartz | lecturer   | 2007-11-01 | 562900
 Ye Shin         | lecturer   | 2014-10-18 | 511400
 Guri Quin       | professor  | 2010-10-01 | 699070
 Kari Borgen     | lecturer   | 2010-11-08 | 545230
 Ole Nilsen      | lecturer   | 2001-03-04 | 513190
 Hannah Stern    | professor  | 2019-05-17 | 670500
 Karl Hansen     | professor  | 2016-09-10 | 651735
 Bo Belle        | professor  | 2011-08-18 | 701930
 Ove Bole        | researcher | 2014-02-11 | 517480
 Vera Louise     | researcher | 2006-09-10 | 529100
 Ida Persson     | researcher | 2016-12-10 | 520800

Example: Problem

Unfortunately, parts of the database got damaged and lost half of the records!
We now only have these records:

      name       |  position  |  employed  | salary 
-----------------+------------+------------+--------
 Martin Schwartz | lecturer   | 2007-11-01 | 562900
 Kari Borgen     | lecturer   | 2010-11-08 | 545230
 Guri Quin       | professor  | 2010-10-01 | 699070
 Hannah Stern    | professor  | 2019-05-17 | 670500
 Karl Hansen     | professor  | 2016-09-10 | 651735
 Mary Smith      | researcher | 2020-01-18 | 501990
 Vera Louise     | researcher | 2006-09-10 | 529100
(7 rows)

Example: Solution

Solution: Make all employees fill out physical forms with info, scan and automatically parse to CSV:

Legor Ivani;professor;18.02.2021;632950
Ye Shin;lecturer;2014-10-18;56254
Kari Bargen;lectorer;2010-11-08;545230
Hanna Stern;proffessor;2019-05-17;670500.78
Ida Persson;researcher;10.12.2016;520800
Mary Smith;researcher;2020-01-18;501990.0
Vera Louise;resercher;2006-09-10;529100
Martin Schwartz;lecturer;2007-11-01;61919
Bo Belle;profesor;2011-08-18;-701930
Ove Bole;senior researcher;2014-02-11;717480
Guri Ouin;professor;2010-10-01;699070
Karl Hansen;professor in mathematics;2016-09-10;651734.99
Ola Nilsen;lecturer;2001-03-04;513190
Karla Loe;researcher;2020-01-18;501990

CSV result can be downloaded here
Problems: Human typos, different formats and units, small errors in scanning, duplicates
Need to load it into a table with all columns as text

CREATE TABLE emp.employee_form (
    name text,
    position text,
    employed text,
    salary text
);

cat employee_form.csv | psql <flags> -c "COPY emp.employee_form FROM STDIN DELIMITER ';';"

Harmonizing Units with CASE

We can see that the salaries are sometimes off
One is negative, and some are an order of magnitude below what they should be
However, it may seem like the latter gave their salaries in EUR instead of NOK
Thus, need to convert them, and will here use a simple CASE-expression:

CREATE VIEW emp.fix_salary AS
WITH
  sal AS (
    SELECT name, position, employed,
        abs(round(salary::float))::int AS salary -- parse to nearest int and remove minus
    FROM emp.employee_form
  )
SELECT name, position, employed,
    (CASE
      WHEN salary < 100000 THEN salary * 9 
      ELSE salary
     END) AS salary
FROM sal;

Converting Formats with Regular Expressions

Some of the dates are on the wrong format (e.g. 18.02.2021)
Can fix this by simply using regular expressions, splitting, etc.:

CREATE VIEW emp.fix_employed AS
WITH
  err_dates AS (
    SELECT name, position, employed, salary, regexp_split_to_array(employed, '\.') AS darr
    FROM emp.fix_salary
    WHERE employed ~ '\d\d\.\d\d\.\d\d\d\d'
  )
SELECT name, position, concat_ws('-', darr[3], darr[2], darr[1])::date AS employed, salary
FROM err_dates
UNION ALL
SELECT name, position, employed::date, salary
FROM emp.fix_salary
WHERE NOT employed ~ '\d\d\.\d\d\.\d\d\d\d';

Fixing Typos with Similarity Measures

We know that position should be one of 'researcher', 'lecturer' or 'professor'
Thus, should pick the one that is most similar
Many similarity measures for text
Here we will use one based on Trigrams
- PostgreSQL extension pg_trgm
Require the extensions listed
So to fix positions, we can do:

CREATE VIEW emp.fix_position AS
WITH
  -- Following uses the trigrams-based similarity measure to find similarity
  -- between position in form an list of positions below
  sim AS ( 
    SELECT e.name, p.position, similarity(e.position, p.position) AS similarity
    FROM emp.fix_employed AS e, -- cross join
         (VALUES ('researcher'), ('lecturer'), ('professor')) AS p(position)
  )
SELECT e.name,
      (SELECT s.position
       FROM sim AS s
       WHERE s.name = e.name
       ORDER BY similarity DESC
       LIMIT 1) AS position, -- pick most similar position
      e.employed,
      e.salary
FROM emp.fix_employed AS e;

Duplicate Elimination: Problem

Detect and merge duplicate statements (tuples, triples, etc.)
Duplicates are not always completely equal (inaccuracies, rounding errors, typos, etc.)
Can use similarity measures to detect similarities between statements

However, similarity is domain and use-case dependent
- E.g. these registrations for an event are probably the same statement:
  - {name: 'Kari Nilsen', registered: '2001-03-04 10:31:28'}
  - {name: 'Kari Nilsen', registered: '2001-03-04 10:31:29'}
- However, these measurements from sensors are probably not the same statement
  - {sensor: 203, value: 22.1, time_measured: '2001-03-04 10:31:28'}
  - {sensor: 203, value: 22.1, time_measured: '2001-03-04 10:31:29'}

Need to decide which attributes might vary, and when they should be considered the same
Note: Duplicate elimination/merging (and entity resolution) important also in data integration

Remove Diplicates with Similarity Measures

We will assume that rows with similar name and equal position denote the same employee
Will assume that difference in names in duplicates are only typos
We will therefore use the Levenshtein-difference
- Levenshtein (PostgreSQL extension fuzzystrmatch)
- Also requires said extension
Levenshtein-difference is an int denoting how many atomic edits one needs to make to go from one to the other
- levenshtein('hello', 'hallo') = 1
- levenshtein('hello', 'halo') = 2
- levenshtein('hello', 'halo der') = 6
Thus, typos would typically give a difference of 1 or 2

CREATE VIEW emp.fixed AS 
WITH
  dup AS ( -- Find duplicates to remove
    SELECT f.name, f.position, f.employed, f.salary
    FROM emp.fix_position AS f 
         JOIN emp.employee AS e USING (position)
    WHERE levenshtein(f.name, e.name) <= 2
  )
SELECT *
FROM emp.employee
UNION ALL (
  SELECT * FROM emp.fix_position
  EXCEPT 
  SELECT * FROM dup
);

Perfect Cleaning?

Cleaning script can be downloaded here
Our result was not perfect
I.e. we did not get back the original result
See e.g.:

SELECT * FROM emp.fixed
EXCEPT 
SELECT * FROM emp.employee_orig ;

The reason are typos not fixable with our (general) domain knowledge
Will fix one more of these with validation (in a bit)

Real World vs. Our Use-Case

This use-case showed some simple data cleaning techniques in practice
The main point was to see what we need to do and how we should think
Note how we constantly used domain knowledge and made assumptions about the data, e.g.:
- Salaries are always above 100 000
- Positions are one of 'researcher', 'professor', 'lecturer'
- Dates use ISO-format
- No two people have almost the same names
This knowledge and assumptions are important to keep track of!
Thus, capturing this knowledge and assumptions into queries/views is perhaps not best-practice
- Difficult to reuse on other datasets
- Difficult to maintain
- Difficult to read/understand
If cleaning done with queries/views, perhaps capture essential knowledge and assumptions into functions

More Complex Use Case

Data cleaning can be much more complex!
Imagine if the data looked like this:

Legor Ivani;professor;beginning of March 2021;632950
lecturer;Ye Shin;2014-10-18;56254
Kari Bargen;lecturer 50% and 50% student;2010-11-08;545230
Hanna Stern;professor;2019-05-17;64
Ida Persson;became researcher on 10.12.2016;see previous comment;520800
Mary Smith;2020-01-18;501990.0

Data Validation

After cleaning the data, it is important to validate the result
There might still be errors not detected and fixed by our data cleaning
Validation typically checks:
- Types, ranges and formats
- Cross references and codes (correct wrt. external sources)
- Structure (correct references, properties of whole datasets, etc.)
- Consistency (no contradictions within data)
Checks are created based on (more) domain knowledge
Thus, different from data verification that checks equality between datasets, hash-sums, etc.
- Typically done after data migration

Validation of Our Data Cleaning

Can simply add sufficient constraints to our table
Constraints should enocode what must hold according to our knowledge of the domain

CREATE TABLE emp.employee_validated (
    name text,
    position text CHECK (position IN ('professor', 'lecturer', 'researcher')),
    employed date CHECK (employed < now()),
    salary float CHECK ((position != 'professor' AND salary >= 400000 AND salary < 600000)
                        OR (position = 'professor' AND salary >= 600000 AND salary <= 800000))   
);

INSERT INTO emp.employee_validated
SELECT *
FROM emp.fixed; -- Fails due to wrong salary for Ove Bole

Errors during validation can either be rejected, or treated further by program or human

IN5800 – Data Cleaning and Validation

Data Cleaning and Validation

Data Quality

Data Cleaning

Data Cleaning vs Data Integration

Sources of Error

Types of Error

Cleaning Techniques

Example: Setting

Example: Problem

Example: Solution

Harmonizing Units with CASE

Converting Formats with Regular Expressions

Fixing Typos with Similarity Measures

Duplicate Elimination: Problem

Remove Diplicates with Similarity Measures

Perfect Cleaning?

Real World vs. Our Use-Case

More Complex Use Case

Data Validation

Validation of Our Data Cleaning