IN5800 – Data Cleaning and Validation

Leif Harald Karlsen

Data Cleaning and Validation

Data Quality

Data Cleaning

Data Cleaning vs Data Integration

Sources of Error

Types of Error

Cleaning Techniques

Example: Setting

CREATE TABLE emp.employee (
    name text,
    position text, -- 'researcher', 'lecturer' or 'professor'
    employed date,
    salary int
);
      name       |  position  |  employed  | salary 
-----------------+------------+------------+--------
 Karla Loe       | researcher | 2020-01-18 | 501990
 Legor Ivani     | professor  | 2021-02-18 | 632950
 Martin Schwartz | lecturer   | 2007-11-01 | 562900
 Ye Shin         | lecturer   | 2014-10-18 | 511400
 Guri Quin       | professor  | 2010-10-01 | 699070
 Kari Borgen     | lecturer   | 2010-11-08 | 545230
 Ole Nilsen      | lecturer   | 2001-03-04 | 513190
 Hannah Stern    | professor  | 2019-05-17 | 670500
 Karl Hansen     | professor  | 2016-09-10 | 651735
 Bo Belle        | professor  | 2011-08-18 | 701930
 Ove Bole        | researcher | 2014-02-11 | 517480
 Vera Louise     | researcher | 2006-09-10 | 529100
 Ida Persson     | researcher | 2016-12-10 | 520800

Example: Problem

      name       |  position  |  employed  | salary 
-----------------+------------+------------+--------
 Martin Schwartz | lecturer   | 2007-11-01 | 562900
 Kari Borgen     | lecturer   | 2010-11-08 | 545230
 Guri Quin       | professor  | 2010-10-01 | 699070
 Hannah Stern    | professor  | 2019-05-17 | 670500
 Karl Hansen     | professor  | 2016-09-10 | 651735
 Mary Smith      | researcher | 2020-01-18 | 501990
 Vera Louise     | researcher | 2006-09-10 | 529100
(7 rows)

Example: Solution

Legor Ivani;professor;18.02.2021;632950
Ye Shin;lecturer;2014-10-18;56254
Kari Bargen;lectorer;2010-11-08;545230
Hanna Stern;proffessor;2019-05-17;670500.78
Ida Persson;researcher;10.12.2016;520800
Mary Smith;researcher;2020-01-18;501990.0
Vera Louise;resercher;2006-09-10;529100
Martin Schwartz;lecturer;2007-11-01;61919
Bo Belle;profesor;2011-08-18;-701930
Ove Bole;senior researcher;2014-02-11;717480
Guri Ouin;professor;2010-10-01;699070
Karl Hansen;professor in mathematics;2016-09-10;651734.99
Ola Nilsen;lecturer;2001-03-04;513190
Karla Loe;researcher;2020-01-18;501990
CREATE TABLE emp.employee_form (
    name text,
    position text,
    employed text,
    salary text
);
cat employee_form.csv | psql <flags> -c "COPY emp.employee_form FROM STDIN DELIMITER ';';"

Harmonizing Units with CASE

CREATE VIEW emp.fix_salary AS
WITH
  sal AS (
    SELECT name, position, employed,
        abs(round(salary::float))::int AS salary -- parse to nearest int and remove minus
    FROM emp.employee_form
  )
SELECT name, position, employed,
    (CASE
      WHEN salary < 100000 THEN salary * 9 
      ELSE salary
     END) AS salary
FROM sal;

Converting Formats with Regular Expressions

CREATE VIEW emp.fix_employed AS
WITH
  err_dates AS (
    SELECT name, position, employed, salary, regexp_split_to_array(employed, '\.') AS darr
    FROM emp.fix_salary
    WHERE employed ~ '\d\d\.\d\d\.\d\d\d\d'
  )
SELECT name, position, concat_ws('-', darr[3], darr[2], darr[1])::date AS employed, salary
FROM err_dates
UNION ALL
SELECT name, position, employed::date, salary
FROM emp.fix_salary
WHERE NOT employed ~ '\d\d\.\d\d\.\d\d\d\d';

Fixing Typos with Similarity Measures

CREATE VIEW emp.fix_position AS
WITH
  -- Following uses the trigrams-based similarity measure to find similarity
  -- between position in form an list of positions below
  sim AS ( 
    SELECT e.name, p.position, similarity(e.position, p.position) AS similarity
    FROM emp.fix_employed AS e, -- cross join
         (VALUES ('researcher'), ('lecturer'), ('professor')) AS p(position)
  )
SELECT e.name,
      (SELECT s.position
       FROM sim AS s
       WHERE s.name = e.name
       ORDER BY similarity DESC
       LIMIT 1) AS position, -- pick most similar position
      e.employed,
      e.salary
FROM emp.fix_employed AS e;

Duplicate Elimination: Problem

Remove Diplicates with Similarity Measures

CREATE VIEW emp.fixed AS 
WITH
  dup AS ( -- Find duplicates to remove
    SELECT f.name, f.position, f.employed, f.salary
    FROM emp.fix_position AS f 
         JOIN emp.employee AS e USING (position)
    WHERE levenshtein(f.name, e.name) <= 2
  )
SELECT *
FROM emp.employee
UNION ALL (
  SELECT * FROM emp.fix_position
  EXCEPT 
  SELECT * FROM dup
);

Perfect Cleaning?

SELECT * FROM emp.fixed
EXCEPT 
SELECT * FROM emp.employee_orig ;

Real World vs. Our Use-Case

More Complex Use Case

Legor Ivani;professor;beginning of March 2021;632950
lecturer;Ye Shin;2014-10-18;56254
Kari Bargen;lecturer 50% and 50% student;2010-11-08;545230
Hanna Stern;professor;2019-05-17;64
Ida Persson;became researcher on 10.12.2016;see previous comment;520800
Mary Smith;2020-01-18;501990.0

Data Validation

Validation of Our Data Cleaning

CREATE TABLE emp.employee_validated (
    name text,
    position text CHECK (position IN ('professor', 'lecturer', 'researcher')),
    employed date CHECK (employed < now()),
    salary float CHECK ((position != 'professor' AND salary >= 400000 AND salary < 600000)
                        OR (position = 'professor' AND salary >= 600000 AND salary <= 800000))   
);

INSERT INTO emp.employee_validated
SELECT *
FROM emp.fixed; -- Fails due to wrong salary for Ove Bole