Contents
- 10.1 Overview
- 10.2 Stages of Data Preparation
- 10.3 Creating a Dataset
- 10.3.1 Extraction and Gathering of Data
- 10.3.2 Entity Reconciliation and Linking
- 10.3.3 Exception Sets
- 10.4 Manipulation and Transformation of Data
- 10.4.1 Types of Data Value
- 10.4.2 Transforming to the Right Kind of Data
- 10.5 Numerical Transformations
- 10.5.1 Information
- 10.5.2 Normalising Data
- 10.5.3 Missing Values -- Filling the Gaps
- 10.5.4 Outliers -- Dealing with Extremes
- 10.6 Non-numeric Transformations
- 10.6.1 Media Data
- 10.6.2 Text
- 10.6.3 Structure Transformation
- 10.7 Automation and Documentation
- 10.8 Summary
Glossary items referenced in this chapter
arithmetic mean, auto-associative memory, bag of words, bootstrapping, cardinality of set, clustering, confidence in output, cosine similarity, crowdsourcing, data cleaning, data documentation, data reduction, data validation, data wrangling, database, database identifier, decision tree, delta, domain-specific knowledge, ECG , entity recognition, entropy, exception sets, genetic algorithm, heterogeneous sources, heuristic evaluation function, human computation, human intelligence, infinite impulse response, information, information processing, internet of things, ISBN, Jaccard similarity, JSON, keyword matching, least squares estimate, linear regression, linked data, Llanfairpwll, logarithmic transform, long-tail distribution, machine learning, moving average, National Insurance number, natural language processing, neural network, Normal distribution, normalisation, OCR (optical character recognition), optimal, optional values, outliers, pattern matching, pre-whitening, principal components analysis, privacy, probability, provenance, pseudonymisation, Python, recommender systems, regular expression, relational database, Restricted Boltzmann Machine, Robert Wadlow, SAIL databank, sampling bias, sanity checks, screen scraping, script, semantic integrity, semi-structured data, set theory, sigmoid function, similarity measure, social media, standard deviation, stop words, threshold, time stamp, uniform distribution, unique identifier, unsupervised learning, validation rules, variance, web scraping, XML