Exploratory Data Analysis in 4 Overall Steps IN SHORT

Sandipan Paul
3 min readNov 3, 2022

--

It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of the data in hand, before getting them into complex models.

Overall, we have 4 steps in EDA

Step 1 is the Univariate Analysis

Step 2 is the Bivariate Analysis

Step 3 is the Missing Value and Outlier Detection

Step 4 is the Feature Engineering

Step 1 is the Univariate Analysis

Uni means One, so Single variable analysis

A variable can be measured with different parameters. Mainly they are categorized into three.

The first parameter is the Measure of Central Tendency. For example, Mean, Median and Mode.

The second parameter is the Measure of Data Spread. For example, Percentile, Quartile, IQR, Boxplot, Variance and Standard Deviation.

The third parameter is the Measure of Distribution. For example, Skewness and Kurtosis.

For in-depth reference, please go to the above link which is Univariate Analysis in Python IN SHORT

Step 2 is the Bivariate Analysis

Bi means Two, so Two variables analysis

Variables are mainly classified into Continuous variables and Categorical variables.

A continuous variable is a variable which takes a set of values. For example, Age. And categorical variable is a variable which takes a set of categories. For example, Gender (Male/Female)

So, for Bivariate (Two) variable analysis, we can have 3 combinations.

First, Continuous vs Continuous. Second, Continuous vs Categorical. Third, Categorical vs Categorical.

Continuous variable vs Continuous variable

Both the variables are Continuous. For example, Experience vs Salary. What is the average salary of a Data Scientist over experience in the industry?

Methods we can use to analyze Continuous vs Continuous are Covariance, Correlation and Variation Inflation Factor (VIF).

Continuous variable vs Categorical variable

One variable is Continuous and the second variable is Categorical. For example, Degree (Graduate/Master) vs Salary. What is the average salary of a Post Graduates Data Scientist?

Methods we can use to analyze Continuous vs Categorical are T-Test, Z Test and Anova.

Categorical variable vs Categorical variable

Both the variables are Categorical. For example, the Degree (Graduate/Master) vs Martial Status (Yes/No). How many Post Graduates are Married?

Method we can use to analyse Categorical vs Categorical is Chi-Square Test.

Step 3 is the Missing Value and Outlier Detection

Missing Value

Now, there is no perfect way to handle missing values that will give us an accurate result as to what the missing value is. But there are several techniques that we can leverage that will give us decent performance.

For example, if the variable is continuous use Mean or Median imputation. If the variable is categorical use Mode imputation. If we want to impute based on complex imputation use Distance based imputation like KNN imputation.

Outlier Detection

Outliers are data points that don’t fit the pattern of the rest of the numbers. They are the extremely high or extremely low values in the data set.

For example, the Age of a customer is 110. So here 110 Age is possible but it is very high compared to the normal population.

Methods to detect Outliers Detections are Percentile, Box Plot and Z Score.

For in-depth reference, please go to the above link which is Missing Value and Outlier Detection in Python IN SHORT

Step 4 is the Feature Engineering

Feature engineering is the process of selecting, manipulating, and transforming raw data into features.

First is the Variable Transformation. For example, Log transformation Variable or Square Root transformation Variable. Scaling is also a type of Variable Transformation. For example, Min Max Scaler or Normalization and Standardization. The Second is Feature Construction. For example, Binning.

For in-depth reference, please go to the above link which is Feature Engineering in Python IN SHORT

If it was helpful, please give a thumbs up. Thank You and Please follow :

Medium: https://medium.com/@sandipanpaul

GitHub: https://github.com/sandipanpaul21

--

--

Sandipan Paul
Sandipan Paul

No responses yet