Chapter 10 – Data Preparation

Contents

10.1  Overview
10.2  Stages of Data Preparation
10.3  Creating a Dataset
10.3.1  Extraction and Gathering of Data
10.3.2  Entity Reconciliation and Linking
10.3.3  Exception Sets
10.4  Manipulation and Transformation of Data
10.4.1  Types of Data Value
10.4.2  Transforming to the Right Kind of Data
10.5  Numerical Transformations
10.5.1  Information
10.5.2  Normalising Data
10.5.3  Missing Values -- Filling the Gaps
10.5.4  Outliers -- Dealing with Extremes
10.6  Non-numeric Transformations
10.6.1  Media Data
10.6.2  Text
10.6.3  Structure Transformation
10.7  Automation and Documentation
10.8  Summary

Glossary items referenced in this chapter

arithmetic mean, auto-associative memory, bag of words, bootstrapping, cardinality of set, clustering, confidence in output, cosine similarity, crowdsourcing, data cleaning, data documentation, data reduction, data validation, data wrangling, database, database identifier, decision tree, delta, domain-specific knowledge, ECG , entity recognition, entropy, exception sets, genetic algorithm, heterogeneous sources, heuristic evaluation function, human computation, human intelligence, infinite impulse response, information, information processing, internet of things, ISBN, Jaccard similarity, JSON, keyword matching, least squares estimate, linear regression, linked data, Llanfairpwll, logarithmic transform, long-tail distribution, machine learning, moving average, National Insurance number, natural language processing, neural network, Normal distribution, normalisation, OCR (optical character recognition), optimal, optional values, outliers, pattern matching, pre-whitening, principal components analysis, privacy, probability, provenance, pseudonymisation, Python, recommender systems, regular expression, relational database, Restricted Boltzmann Machine, Robert Wadlow, SAIL databank, sampling bias, sanity checks, screen scraping, script, semantic integrity, semi-structured data, set theory, sigmoid function, similarity measure, social media, standard deviation, stop words, threshold, time stamp, uniform distribution, unique identifier, unsupervised learning, validation rules, variance, web scraping, XML