Exploratory Data Analysis (EDA) is the most critical initial step for Data Scientists to analyze a new dataset, this guide describes simple and advanced techniques using python.
Exploratory Data Analysis is also known as Data Exploration, is a step in the Data Analysis process, where several techniques are used to better understand the dataset being used.
Understanding the dataset can refer to a few things including but not limited to…
- Extracting important variables and leaving behind useless variables.
- Identifying outliers, missing values, or human error.
- Understanding the relation, or lack of, between variables.
- Ultimately, maximization sights of a dataset and minimization potential error later in the process.
Components of Exploratory Data Analysis:
There are main components of exploring data:
- Understanding your variables:
The best way to explain it is by taking some examples like exploratory data provided by some datasheets. There should be some insights regarding variables that increase the likelihood. Two insights should be compared and the best one with maximum analysis can be chosen.
Some libraries that will help one to understand deeply…
- .shape – returns the number of rows by the number of columns for the dataset.
- .head() – returns the first five rows of the database.
- .columns – returns the database of all the columns in the dataset.
Once there is a knowledge of every variable in the set, these others are used for better understanding.
- .nunique(axis=0) – returns the number of unique values of each variable.
- .describe() – summarize the count, mean, standard deviation, min, and max for numeric variables.
- .unique() – it gives information about discrete variables, including ‘condition’.
In understanding the variables, the same synonyms that are used can be ignored. And the only word that has given in the data set can be used.
Cleaning the dataset:
Till now, one can do reclassify discrete data if needed, but there are a few things that still need to be looked at.
- Removing Redundant variables
The redundant variables can be eliminated. That includes url, image_url, and city_url.
- Variable selection
Used to get rid of any columns that had too many null values. The Threshold can be varying.
- Removing Outliers
Revisiting the issue that is previously addressed can be done in this. There are still methods to determine optimal boundaries.
- Removing rows with null values
Lastly, .dropna(axis=0) is used to remove any rows with null values.
Covariance of Categorial variables:
To visualize the covariation between categorical variables, one needs to count the number of observations for each combination. One way to do that is to rely on the built-in. The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between the specific values of X and Y.
Exploratory data analysis is one of the key competencies of a data scientist at a startup. One should be able to dig into a new data set and determine how to improve the product based on results.