{"id":247,"date":"2023-12-31T19:03:07","date_gmt":"2023-12-31T19:03:07","guid":{"rendered":"https:\/\/alandix.com\/aibook\/?page_id=247"},"modified":"2023-12-31T19:03:07","modified_gmt":"2023-12-31T19:03:07","slug":"chap10","status":"publish","type":"page","link":"https:\/\/alandix.com\/aibook\/second-edition\/toc2e\/chap10\/","title":{"rendered":"Chapter 10 \u2013 Data Preparation"},"content":{"rendered":"<div class=\"embedurl\" data-url=\"https:\/\/alandix.com\/books\/aibook\/content\/chaps\/chap10.html\" ><!--  Chapter 10 Data Preparation  -->\n\n<script>\nvar chapnos = 10;\nvar json_url = \"https:\\\/\\\/alandix.com\\\/books\\\/aibook\\\/content\\\/chaps\\\/chap10.json\";\n<\/script>\n\n\n\n\n\t<object style=\"width:100%; aspect-ratio: 10 \/ 7;\" type=\"application\/pdf\" data=\"https:\/\/alandix.com\/books\/aibook\/content\/slides-pdf\/AI-chap-10.pdf\"><\/object>\n\t<p> Download <a href=\"https:\/\/alandix.com\/books\/aibook\/content\/slides-pptx\/AI-chap-10.pptx\" download>chapter slides<\/a><\/p>\n\n\n<h3> Contents <\/h3>\n<div class=\"toc\">\n<dl>\n<dt>10.1&nbsp;&nbsp;Overview<\/dt>\n<dt>10.2&nbsp;&nbsp;Stages of Data Preparation<\/dt>\n<dt>10.3&nbsp;&nbsp;Creating a Dataset<\/dt><dd><dl>\n<dt>10.3.1&nbsp;&nbsp;Extraction and Gathering of Data<\/dt>\n<dt>10.3.2&nbsp;&nbsp;Entity Reconciliation and Linking<\/dt>\n<dt>10.3.3&nbsp;&nbsp;Exception Sets<\/dt>\n<\/dl><\/dd>\n<dt>10.4&nbsp;&nbsp;Manipulation and Transformation of Data<\/dt><dd><dl>\n<dt>10.4.1&nbsp;&nbsp;Types of Data Value<\/dt>\n<dt>10.4.2&nbsp;&nbsp;Transforming to the Right Kind of Data<\/dt>\n<\/dl><\/dd>\n<dt>10.5&nbsp;&nbsp;Numerical Transformations<\/dt><dd><dl>\n<dt>10.5.1&nbsp;&nbsp;Information<\/dt>\n<dt>10.5.2&nbsp;&nbsp;Normalising Data<\/dt>\n<dt>10.5.3&nbsp;&nbsp;Missing Values -- Filling the Gaps<\/dt>\n<dt>10.5.4&nbsp;&nbsp;Outliers -- Dealing with Extremes<\/dt>\n<\/dl><\/dd>\n<dt>10.6&nbsp;&nbsp;Non-numeric Transformations<\/dt><dd><dl>\n<dt>10.6.1&nbsp;&nbsp;Media Data<\/dt>\n<dt>10.6.2&nbsp;&nbsp;Text<\/dt>\n<dt>10.6.3&nbsp;&nbsp;Structure Transformation<\/dt>\n<\/dl><\/dd>\n<dt>10.7&nbsp;&nbsp;Automation and Documentation<\/dt>\n<dt>10.8&nbsp;&nbsp;Summary<\/dt>\n<\/dl><\/div>\n\n\n<h3> Glossary items referenced in this chapter <\/h3>\n<div class=\"toc\">\n<a href=\"https:\/\/alandix.com\/glossary\/aibook\/arithmetic%20mean\">arithmetic mean<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/auto-associative%20memory\">auto-associative memory<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/bag%20of%20words\">bag of words<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/bootstrapping\">bootstrapping<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/cardinality%20of%20set\">cardinality of set<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/clustering\">clustering<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/confidence%20in%20output\">confidence in output<\/a>, <strong><a href=\"https:\/\/alandix.com\/glossary\/aibook\/cosine%20similarity\">cosine similarity<\/a><\/strong>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/crowdsourcing\">crowdsourcing<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/data%20cleaning\">data cleaning<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/data%20documentation\">data documentation<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/data%20reduction\">data reduction<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/data%20validation\">data validation<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/data%20wrangling\">data wrangling<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/database\">database<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/database%20identifier\">database identifier<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/decision%20tree\">decision tree<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/delta\">delta<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/domain-specific%20knowledge\">domain-specific knowledge<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/ecg\">ECG <\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/entity%20recognition\">entity recognition<\/a>, <strong><a href=\"https:\/\/alandix.com\/glossary\/aibook\/entropy\">entropy<\/a><\/strong>, <strong><a href=\"https:\/\/alandix.com\/glossary\/aibook\/exception%20sets\">exception sets<\/a><\/strong>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/genetic%20algorithm\">genetic algorithm<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/geographic%20information%20system\">geographic information system<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/heterogeneous%20sources\">heterogeneous sources<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/heuristic%20evaluation%20function\">heuristic evaluation function<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/human%20computation\">human computation<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/human%20intelligence\">human intelligence<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/infinite%20impulse%20response\">infinite impulse response<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/information\">information<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/information%20processing\">information processing<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/internet%20of%20things\">internet of things<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/isbn\">ISBN<\/a>, <strong><a href=\"https:\/\/alandix.com\/glossary\/aibook\/jaccard%20similarity\">Jaccard similarity<\/a><\/strong>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/json\">JSON<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/keyword%20matching\">keyword matching<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/least%20squares\">least squares<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/linear%20regression\">linear regression<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/linked%20data\">linked data<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/llanfairpwll\">Llanfairpwll<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/logarithmic%20transform\">logarithmic transform<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/long-tail%20distribution\">long-tail distribution<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/machine%20learning\">machine learning<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/moving%20average\">moving average<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/national%20insurance%20number\">National Insurance number<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/natural%20language%20processing\">natural language processing<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/neural%20network\">neural network<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/normal%20distribution\">Normal distribution<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/normalisation\">normalisation<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/ocr\">OCR (optical character recognition)<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/optimal\">optimal<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/optional%20values\">optional values<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/outliers\">outliers<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/pattern%20matching\">pattern matching<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/pre-whitening\">pre-whitening<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/principal%20components%20analysis\">principal components analysis<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/privacy\">privacy<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/probability\">probability<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/provenance\">provenance<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/pseudonymisation\">pseudonymisation<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/python\">Python<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/recommender%20systems\">recommender systems<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/regular%20expression\">regular expression<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/relational%20database\">relational database<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/restricted%20boltzmann%20machine\">restricted Boltzmann machine<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/robert%20wadlow\">Robert Wadlow<\/a>, <strong><a href=\"https:\/\/alandix.com\/glossary\/aibook\/sail%20databank\">SAIL databank<\/a><\/strong>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/sampling%20bias\">sampling bias<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/sanity%20checks\">sanity checks<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/screen%20scraping\">screen scraping<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/script\">script<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/semantic%20integrity\">semantic integrity<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/semi-structured%20data\">semi-structured data<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/set%20theory\">set theory<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/sigmoid%20function\">sigmoid function<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/similarity%20measure\">similarity measure<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/social%20media\">social media<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/standard%20deviation\">standard deviation<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/stop%20words\">stop words<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/threshold\">threshold<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/time%20stamp\">time stamp<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/uniform%20distribution\">uniform distribution<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/unique%20identifier\">unique identifier<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/unsupervised%20learning\">unsupervised learning<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/validation%20rules\">validation rules<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/variance\">variance<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/web%20scraping\">web scraping<\/a>, <a href=\"https:\/\/alandix.com\/glossary\/aibook\/xml\">XML<\/a><\/div>\n\n\n\n\n<\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":2,"featured_media":0,"parent":221,"menu_order":10,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"class_list":["post-247","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/pages\/247","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/comments?post=247"}],"version-history":[{"count":3,"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/pages\/247\/revisions"}],"predecessor-version":[{"id":300,"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/pages\/247\/revisions\/300"}],"up":[{"embeddable":true,"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/pages\/221"}],"wp:attachment":[{"href":"https:\/\/alandix.com\/aibook\/wp-json\/wp\/v2\/media?parent=247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}