Oreilly - Clean Data: Tips, Tricks, and Techniques
by Tomasz Lelek | Released October 2018 | ISBN: 9781789808902
Use Python to check your data consistency and get rid of any missing or duplicate data About This VideoSift through your data to identify issues such as outliers, missing values, and duplicate rows,Deal with unstructured data in the most effective ways and hone your skills in transforming and combining your data,Use Python to check your data for consistency and get rid of any missing or duplicated data.In DetailGive me six hours to chop down a tree and I will spend the first four sharpening the axe"? Do you apply the same principle when doing Data Science?Effective data cleaning is one of the most important aspects of good Data Science and involves acquiring raw data and preparing it for analysis, which, if not done effectively, will not give you the accuracy or results that you're looking to achieve, no matter how good your algorithm is.Data Cleaning is the hardest part of big data and ML. To address this matter, this course will equip you with all the skills you need to clean your data in Python, using tried and tested techniques. You'll find a plethora of tips and tricks that will help you get the job done, in a smart, easy, and efficient way.All the code and supporting files for this course are available on Github at https://github.com/PacktPublishing/Clean-Data-Tips-Tricks-and-Techniques Show and hide more
- Chapter 1 : Identifying the Most Important Data Issues
- The Course Overview 00:02:41
- Setting Up the Work Environment 00:01:57
- Finding Outliers in the Input Data 00:04:19
- Reconcile Missing Values to Give Data More Meaning 00:03:26
- Implementing and Testing the IQR Method 00:06:37
- Chapter 2 : Cleaning Text Data
- Tokenizing Input Data 00:04:58
- Cleaning Stop Words 00:04:49
- Removing Data-Specific Words That Has a Negative Impact 00:04:38
- Handling White Spaces and Language-Agnostic Phrases 00:04:53
- Chapter 3 : Dealing with Unstructured Data (Text)
- Analyzing Unstructured Text Input Data 00:01:56
- Extracting Features from Data and Transforming Text into Vector 00:03:51
- Bag-Of-Words 00:05:54
- Reducing Noise in Data by Using Skip-Gram 00:07:04
- Chapter 4 : Duplicates
- Analyzing Rows – Finding Duplicate Columns 00:06:25
- Finding Global Row Duplicates 00:04:25
- Handling Duplicates by Implementing Idempotent Processing 00:03:19
- Duplicates That Has Meaning 00:04:05
- Chapter 5 : Reasoning about Types and Default
- Interpreting Not a Number – Cleaning for Numeric Data 00:05:43
- Replacing NaN with Scalar Data 00:03:49
- Backward Fill and Forward Fill 00:03:04
- Replacing Generic Values 00:03:17
Show and hide more