Data is collected from various sources such as Databases, flat files, web scraping, etc. The data collected may have many flaws and it might be unorganized, thus the bad data might lead to bad results. However, if the data is organized in a proper manner it will give meaningful information. Data cleaning, structuring and reformatting it to desired format is known as data wrangling. Data wrangling constitutes a very important part of data science lifecycle as the validity of the lifecycle depends on how good the data is.
Types of errors
Before we dive into the process of cleaning the data, let us understand how the data gets erroneous.
There are many ways a human can enter wrong information, either due to miscalculation, typing error, misinterpretation. For certain fields where a text information is added, there are chances that the identical texts (“No”,” Nope”, “Nah”) are considered as different information.
At the point when sensors breakdown, they are probably going to create values outside the given range. The sensors sometimes will create substantial information; however that information is damaged headed to the collection site. Restrictive information configurations may not be readable by various projects. Intervention during open web transmission may cause dropped packets and subsequently inefficient information.
There are high chances of duplicates when the data is collected from multiple sources, especially if the dataset is combined set of multiple sources. The primary key check sometimes fails as the data coming in from different sources are considered different. Duplicates are also formed due to the manual entry error also.
Lack of standardization
When utilizing various information sources, absence of normalization is common issue. To accomplish genuine outcomes, all information that is comparative in real world must be represented to likewise in the input. This is self-evident; however, it is not generally clear how to accomplish this.
Indeed, even inside single information source, normalization issues can at present emerge, especially with human input. A few people may have distinctive spelling or capitalization practice, and individuals on various groups inside an association may even utilize various names for a similar point or item!
Identifying the errors
To fix the errors, it is important to identify the errors, this section we will discuss about various methods used to identify errors.
The string format data entered as “Male”, “M” in a column ‘sex’ is identified as two different values. The date entered as ’03/Aug/2019’, ‘03/08/2019’ are considered two different formats. You can identify these, at least for categorical data with fewer classes, by manually checking the list of classes by using unique function. Finding these errors for numeric data may require some creativity, using range constraints are a start to it.
The categorical variables can have constraints where they can accept only certain values. Let’s say Marital status can be ‘Married’, ‘Unmarried’, ’Engaged’, ’Divorced’, gender can be ‘Male’, ‘Female’, TV is either ‘On’ or ‘Off’. We can check these categorical constraints in the data using python. Let us say if a column can take maximum of 5 categories, when a Unique command is used only 5 values should show up. If the number of categories is more, then you can use some other method.
Visualizations are a decent method to handily discover anomalies, weird conveyances, and different issues. If you trust it is a Normal distribution, however in reality you have a bimodal distribution, you should modify your beginning assumptions. Visualization methods, similar to box-and-whisker plots, histograms, and scatterplots, can be colossally useful in promptly identifying few issues.
Missing values is likely the most widely recognized sort of data issue that must be looked into. Qualities might be absent in light of the fact that you consolidated two or more datasets from various sources, while the data enters, somehow the observation would have been removed, or the value was unintentionally erased.
A couple of missing observations are most likely isn’t an issue, however on the off chance that you notice there is a high density of missing qualities, you ought to examine the cause.
Data Cleaning Techniques
Once the data errors are identified, there are various techniques to fix these errors. Let us see few of them.
This is the most disapproved of technique. For missing qualities, it is smarter to explore the explanation rather than basically wiping out the observations or columns that contain the missing values. This isn’t generally avoidable, however. In the event that a whole attribute is having 85% missing values and you can’t discover another data source, you will most likely be unable to utilize that column.
Before removing many data points, it’s important to get input from experts in the field. This is critically true if you want to remove an entire column.
This method is considered as much better way of dealing rather than removing observations or columns. Consistency issues can be easily fixed if they are identified properly.
For string consistency correction in smaller categorical sets, it can be trivial to run a unique values search and then write a couple of if-statements to replace errors. If you have something like city names, it may be difficult to go with explicit if-statements. You may want to use a fuzzy search and make corrections that way.
Numerical consistency errors, such as order of magnitude mismatches, are simple to fix by multiplication or division. Binary consistency issues can be corrected if you can accurately assign the non-binary input to one of the binary categories.
This strategy (a part of data wrangling) is mostly same as filling in missing values, however it tends to be utilized for incorrect values as well, particularly when a direct correction cannot be made.
Imputation is a fancy way to say guess. However, since we are in the field of data science, this will be a data-driven guess, not just a random guess. You can impute values with statistical indicators (like mean, median, mode), hot-decking, stratification, and others.
One method is to fix the values with a statistical indicator but the problem is we might lose already known pattern for that particular column and we might end up disturbing the pattern in it.
Hot-decking fills in missing values by randomly selecting a value from the set of already-known values. However, even this method might cause to overlook pattern information.
Data wrangling involves set of process and it is one the most important tasks. There are many other ways to deal with the missing values, right approach for right dataset is must else we might end up doing wrong imputations. Data science is that’s why combination of mathematics, technology, domain and more common sense.