Data Cleaning
Data Cleaning (also caleed data cleansing) is the process of detecting and correcting erroneous data, typically by updating or changing/transforming faulty data to correct data. Erroneous data can also be viewed as data of poor data quality, where data quality is the measure of how useful or usable a dataset is. Data quality is typically an aggregate over all data in a dataset, where different aspects contributes to good and poor data quality, such as:
- Validity: Comformity to knowledge/business rules/constraints
- Accuracy: Conformity to truth/predefined standard
- Comleteness: How much of required data is known/contained in dataset
- Consistency: Equivalence/agreement accross/within systems/datasets
- Uniformity: Conformity of formats, units of measure, etc.
Note that data quality is specific to use.
Relevant Articles and Tutorials
- Wikipedia’s page on Data Cleaning
- Problems,
Methods and Challenges in Comprehensive Data Cleansing
- Somewhat outdated (2003), but gives a good overview of approaches and challenges
- Cleaning and Transforming Data with SQL