# Exploratory Data Analysis (EDA)
## Summary Statistics
It is important to know how to extract information from descriptive statistics.
```py
# get the data info
df.info()
# summary stats
df.describe()
# inspect the data
df.head()
df.value_counts()
# Get value counts for our target variable
df['income'].value_counts()
# include string and categorical features
df.describe(include=['int', 'float', 'object', 'category'])
```
## Python Graphing Libraries
Here are some useful Python graphing libraries [17]:
Altair: Declarative Visualization Made Simple
Altair is a declarative statistical visualization library focusing on simplicity and expressiveness that minimizes boilerplate code and emphasizes interactive charts.
DuckDB: High-Performance SQL OLAP
DuckDB is an in-process SQL OLAP database optimized for analytical workload which allows seamless integration with Python tools like Pandas and Jupyter.
FlashText: Efficient Text Search and Replacement
FlashText is a lightweight library for keyword extraction and replacement, outperforming regex in speed and simplicity for many use cases.
Missingno: Visualizing Missing Data
Missingno provides quick and intuitive visualizations for missing data, helping identify patterns and correlations.
NetworkX: Analyzing Graph Data
NetworkX is a versatile library for analyzing and visualizing graph structures from social networks to transportation systems.
Ydata Profiling: Automated Data Insights
Ydata Profiling automates dataset exploration by generating detailed HTML reports that highlight distributions, correlations, and data quality.
## Exploratory Data Analysis
Exploratory Data Analysis (EDA) is one of the first steps of the data science process which involves learning as much as possible about the data without spending too much time.
The basic steps are the following (see [Data Preparation](./data_prep.md)):
1. Basic Exploration
2. Check Data Types
3. Handle Missing Values
4. Handle Duplicate Values
5. Handle Outliers
6. Visualize the Data
We can get an instinctive as well as a high-level practical understanding of the data including a general idea of the structure of the data set, some cleaning ideas, the target variable and possible modeling techniques.
We can display the summary statistics and create histograms for the numeric variables of the dataset.
```py
# check the shape of the dataframe
df.shape
# Print a concise summary of a DataFrame.
# print information about a DataFrame including the index dtype
# and columns, non-null values, and memory usage.
df.info()
# summary statistics fir numeric data
df.describe()
# get categorical data
cat_data = df.select_dtypes(include=['object'])
# show counts values of each categorical variable
for colname in cat_data.columns:
print (colname)
print (cat_data[colname].value_counts(), '\n')
# check data types
df.dtypes
df.size
# get column list
df.columns.tolist()
# check the distribution of categorical columns
df["product_group"].value_counts()
# find percent share of each value by using the normalize parameter
df["product_group"].value_counts(normalize=True)
# check the average price of products for each product group
df.groupby("product_group", as_index=False).agg(
avg_price = ("price","mean")
)
# change column data type
df[['age', 'weight']] = df[['age', 'weight']].astype(float)
# Set new column value based on multiple criteria
# Use bitwise operators instead of AND and OR
df.loc[(df.AvgProduction> 1000000) & (df.Age > 5), 'Category'] = 'Priority 1'
# Check for missing timestamps or rows
time_range = pd.date_range(startdate , enddate, freq='1min')
ts = pd.DataFrame(time_range)
ts.rename(columns = {ts.columns[0]:'timestamp'}, inplace = True)
## now complete a merge to join the sets together
# Filter data based on string match
df1 = df[df[‘Flag’].str.contains(“CHECK ME NOW”)]
# Aggregate across columns
df['StateAverage'] = df_mo[['school1', 'school2','school3', 'school4']].mean(axis=1)
pd.DataFrame({"values":{col:df[col].unique() for col in df},
'type':{col:df[col].dtype for col in df},
'unique values':{col:len(df[col].unique()) for col in df},
'NA values':{col:str(round(sum(df[col].isna())/len(df),2))+'%' for col in df},
'Duplicated Values':{col:sum(df[col].duplicated()) for col in df}
})
```
```py
numeric_variables = list(df.select_dtypes(include=['int64', 'float64'])) #select the numeric variables
df[numeric_variables].describe().apply(lambda x:round(x,2)).T #apply describe method
# create the histograms
histograms = df[numeric_variables].hist(bins =10,
xlabelsize=10,
ylabelsize=10,
grid=False,
sharey= True,
figsize = (15,15))
```
```py
# numerical features
df.describe()
# include string and categorical features
df.describe(include=['int', 'float', 'object', 'category'])
# unique = number of unique categories
# top = dominant category
# freq = count of dominant category
```
### Visualize the Data
We can use univariate plots or plots such as histograms and boxplots on single variables. Then, we can use bivariate (or multi-variate) plots across different variables.
- Univariate Plots
- Multivariate Plots
- Correlation Matrix or Heatmap
- Pair Plots
```py
plt.figure(figsize=(7,6))
sns.boxplot(x="y", y="duration", data=df, showfliers=True)
sns.histplot(x='age', data=df, bins=20, kde=True)
```
```py
# correlation matrix shows the correlation
# between all the variables in the dataset.
corr = df.corr()
f, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(corr, annot=True, square=False, ax=ax, linewidth = 1)
plt.title('Pearson Correlation of Features')
```
Pair Plots is to a correlation matrix, but gives a scatterplot for each of the X and Y pairs.
```py
g.fig.set_size_inches(12,12)
g=sns.pairplot(df, diag_kind = 'auto', hue="y")
```
## Guide to EDA and Data Preparation
Here is a guide to the fundamental pre-processing techniques and how to address some common problems [15].
```py
# Load the dataset
df = pd.read_csv("file.csv")
# Check the data type of the dataset
type(df)
# Check the shape of the dataset
df.shape
# Display column names
df.columns
# Display the first few rows of dataset
df.head()
# Display random sample of 10 rows from dataset
df.sample(10)
# Display dataset information
df.info()
# Describe non-numeric data
df.describe(include=object)
# Describing numeric data
df.describe()
# Extract numeric columns using list comprehension
selected_columns = [list(df.columns.values)[i] for i in [0, 1, 4]]
# Age Distribution
plt.figure(figsize=(8, 6))
sns.histplot(df['Age'], kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()
# Salary Distribution
plt.figure(figsize=(8, 6))
sns.histplot(df['Salary'], kde=True)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.show()
# Gender Distribution (categorical)
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='Gender')
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
# Compute correlation matrix
correlation_matrix = df[['Age', 'Salary', 'Exam_Score']].corr()
# Visualize the correlation matrix with heatmap
plt.figure(figsize=(8, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation between Quantitative Variables')
plt.show()
# Compute contingency table
contingency_table = pd.crosstab(df['Gender'], df['Education_Level'])
# Duplicate Values
duplicates = df.duplicated()
# Display the duplicate rows
df[duplicates]
# Check for negative values in Salary
df[df['Salary'] < 0]
# Replace negative values with NaN (missing values)
df['Salary'] = df['Salary'].apply(lambda x: x if x >= 0 else None)
# Using isna() method to check for missing values in each column
missing_values = df.isna().sum()
print(missing_values)
# Calculate the percentage of missing values in each column
missing_percentage = (df.isna().mean() * 100).round(2)
# Display the percentage of missing values
print(missing_percentage)
```
- For quantitative variables, we use correlation or Pearson correlation.
- For qualitative variables, we study association.
We can use methods such as ANOVA (Analysis of Variance) To study the relationship between a quantitative and a qualitative variable.
Association Between Qualitative Variables — Chi-Squared Test, Cramer’s V Test
## Essential Code Blocks
1. Shape (dimensions) of the DataFrame
2. Data types of the various columns
3. Display a few rows
We may observe that our dataset has a combination of categorical (object) and numeric (float and int) features.
What to look for:
- Can you understand the column names? Do they make sense? (Check the variable definitions if needed)
- Do the values in the columns make sense? Numeric features that should be categorical and vice versa.
- Are there significant missing values (NaN)?
- What types of classes do the categorical features have?
### Distribution
Distribution refers to how the values in a feature are distributed or how often they occur.
For numeric features, we see how many times groups of numbers appear in a particular column.
For categorical features, we view the classes for each column and their frequency.
We will use both graphs and actual summary statistics.
The graphs enable us to get an overall idea of the distributions while the statistics give us factual numbers.
Both graphs and statistics are recommended since they complement each other.
### Summary statistics of numerical features
```py
print(df.describe())
# count null values
df.isnull().sum()
# count of students whose physics marks are greater than 11
df[df['Physics'] > 11]['Name'].count())
# count students whose physics marks are greater than 10
# and math marks are greater than 9.
df[(df['Physics'] > 10 ) & (df['Math'] > 9)]
# Multi-column frequency count
count = df.groupby(col_name).count()
```
We can see for each numeric feature, the count of valueS, the mean value, std or standard deviation, minimum value, the 25th percentile, the 50th percentile or median, the 75th percentile, and the maximum value.
What to look for:
- Missing values: their count is not equal to the total number of rows of the dataset.
- Minimum or maximum values they do not make sense.
- Large range in values (min/max)
### Summary statistics of categorical features
```py
df.describe(include=['category'])
```
### Plot of numeric features
```py
df.hist(figsize=(14,14), xrot=45)
plt.show()
```
What to look for:
- Possible outliers that cannot be explained or might be measurement errors.
- Numeric features that should be categorical such as Gender represented by 1 and 0.
- Boundaries that do not make sense such as percentage values > 100.
### Plot of categorical features
```py
for column in df.select_dtypes(include='object'):
if df[column].nunique() < 10:
sns.countplot(y=column, data=df)
plt.show()
```
What to look for:
- Sparse classes which have the potential to affect a model’s performance.
- Mistakes in labeling of the classes, for example 2 exact classes with minor spelling differences.
### Grouping and segmentation
Segmentation allows us to cut the data and observe the relationship between categorical and numeric features.
#### Segment the target variable by categorical features
Compare the _target_ feature (Price) between the various classes of our main categorical features (Type, Method, and Regionname) and see how the target changes with the classes.
```py
# Plot boxplot of each categorical feature with Price.
for column in data.select_dtypes(include=’object’):
if data[column].nunique() < 10:
sns.boxplot(y=column, x=’Price’, data=data)
plt.show()
```
What to look for: Classes that most affect the target variable(s).
#### Group numeric features by each categorical feature
See how all the other numeric features (not just target feature) change with each categorical feature by summarizing the numeric features across the classes.
```py
# For the 3 categorical features with less than 10 classes,
# we group the data, then calculate the mean across the numeric features.
for column in df.select_dtypes(include='object'):
if df[column].nunique() < 10:
display(df.groupby(column).mean())
```
### Relationships between numeric features
#### Correlation matrix for numerical features
A _correlation_ is a value between -1 and 1 that amounts to how closely values of two separate features move simultaneously.
A _positive_ correlation means that as one feature increases the other one also increases while a _negative_ correlation means one feature increases as the other decreases.
Correlations close to 0 indicate a _weak_ relationship while closer to -1 or 1 signifies a _strong_ relationship.
```py
corrs = df.corr()
print(corrs)
```
This might not mean much now, so we can plot a **heatmap** to visualize the correlations.
#### Heatmap of the correlations
```py
# Plot the grid as a rectangular color-coded matrix using seaborn heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corrs, cmap='RdBu_r', annot=True)
plt.show()
```
What to look for:
- Strongly correlated features: either dark red (positive) or dark blue(negative).
- Target variable: If it has strong positive or negative relationship with other features.
----------
## How to Identify Outliers
Many machine learning algorithms are sensitive to the range and distribution of the input data [7].
Outliers in input data can affect the training process of ML algorithms abd result in less accurate models.
Keep in mind that some suspicious points are often wrongly reduced to outliers [8].
In a regression analysis, these terms are related besides they have different meanings:
- **Influential point:** a subset of the data having an influence on the estimated model. Removing it appears to modify significantly the results of the estimated coefficient.
- **Outlier:** an observation that falls outside the general pattern of the rest of the data. In a regression analysis, these points remain out of the confidence bands due to high residual.
- **High Leverage:** an observation that has the potential to influence predominantly the predicted response. It can be assessed through ∂y_hat/∂y: how much the predicted response differs if we shift slightly the observed response. Generally, they are extreme x-values.
### Extreme Value Analysis
Start with a simple extreme value analysis:
- Focus on univariate methods
- Visualize the data using scatterplots, histograms, and box and whisker plots to identify extreme values
- Assume a normal distribution (Gaussian) and look for values more than 2 or 3 standard deviations from the mean or 1.5 times from the first or third quartile
- Filter out outliers from training set and evaluate model performance
### Proximity Methods
Next, consider trying proximity-based methods:
- Use clustering methods to identify clusters in the data (such as k-Means and DBSCAN)
- Identify and mark the cluster centroids
- Identify data instances that are a fixed distance or percentage distance from the cluster centroids
- Filter out outliers from training set and evaluate model performance
### Projection Methods
Finally, projection methods can be used to identify outliers:
- Use projection methods to summarize your data to two dimensions (such as PCA, MDS, and t-SNE)
- Visualize the mapping and identify outliers
- Use proximity measures from projected values or vectors to identify outliers
- Filter out outliers from training set and evaluate model performance
### Methods Robust to Outliers
An alternative approach is to try models that are robust to outliers.
There are robust forms of regression that minimize the median least square errors rather than mean called robust regression but they are more computationally intensive.
There are also models such as decision trees that are robust to outliers.
Spot check some methods that are robust to outliers to see if there is a significant improvement in model performance metrics.
## Plotly Express
The `plotly.express` module contains functions that can create entire figures at once.
Plotly Express is a built-in part of the `plotly` library and is the recommended starting point for creating most common figures.
We can use `plotly.express` scatter or line charts to quickly run a linear regression between 1 variable and 1 target [16].
### Time series data without Trendline
Time series data typically comes with a timestamp column (eg Datetime) and the value you want to plot (listed below as ‘y’) [16].
```py
# Line plot
px.line(df, x='Datetime', y='y')
# Scatter plot
px.scatter(df, x='Datetime', y='y')
```
### Correlated variables with Trendline
Using the trendline keyword in `px.scatter` draws a linear regression trendline with your x and y variables using OLS (Ordinary Least Squares) regression algorithm which is basically just a standard linear regression [16].
Here is an example using the Wine Quality dataset from the UCI Machine Learning Repository (CC by 4.0 license) where we have plotted 2 features against each other to see the relationship between them [16].
The OLS trendline provides you with the y=mx+b linear equation as well as the R².
```py
# Scatter plot with trendline
px.scatter(df, x='fixed_acidity',y='density',trendline='ols')
```
### Statistical Distribution
#### Mean
With the mean value, you are trying to get a sense of what an average data point looks like.
#### Standard Deviation
Standard deviation is a measure of variation/dispersion of data points with respect to the mean.
Smaller STD indicates that the data are mostly centered around the mean whereas a higher STD value indicates the data points are rather dispersed.
#### Median (50%)
The 50th percentile (the 50% column) is also known as the median. Like mean, it’s another measure of central tendency.
Median is a preferred metric rather than mean if there are outliers or high variability in the data.
If the difference between mean and median is _small_, you can infer that the data is symmetrically distributed.
If the median is higher than the mean, data is likely _left-skewed_ in distribution.
#### Min and Max
Min and max values represent the lower and upper limit of a variable in the dataset.
### Anomalies
You can get a sense of outliers, anomalies, and other points of interest in the dataset using descriptive statistics.
#### Outliers
A large difference between the 75th percentile and the maximum value indicates the presence of potential outliers.
Likewise, a large difference between the minimum value and the 25th percentile indicates the presence of potential outliers.
To confirm outliers you can create a boxplot for visual inspection:
```py
sns.boxplot(y=df['total_bill']);
```
#### Red flags
Sometimes descriptive statistics can raise red flags.
Places with unexpected minimum values (0 or negative) or absolutely unacceptible maximum values (such as someone’s age 120 years!).
These are obvious indications that there are issues in the data and need further investigation.
## References
[1]: [Reading and interpreting summary statistics](https://towardsdatascience.com/reading-and-interpreting-summary-statistics-df34f4e69ba6)
[2]: [Python Cheat Sheet for Data Science](https://chipnetics.com/tutorials/python-cheat-sheet-for-data-science/)
[3]: [Data Cleaning for Visualization versus Machine Learning](https://medium.com/geekculture/data-cleaning-for-visualization-versus-machine-learning-e66126ebb65c)
[4]: [11 Essential Code Blocks for EDA Regression Task](https://towardsdatascience.com/11-simple-code-blocks-for-complete-exploratory-data-analysis-eda-67c2817f56cd)
[5]: [6 Pandas Functions for a Quick Exploratory Data Analysis](https://sonery.medium.com/6-pandas-functions-for-a-quick-exploratory-data-analysis-ff9ece0867d7)
[6]: [My Top 10 Pandas Functions for Preparing Data](https://betterprogramming.pub/my-top-10-pandas-functions-for-preparing-data-3ec7a1451a84)
[7]: [How to Identify Outliers in your Data](https://machinelearningmastery.com/how-to-identify-outliers-in-your-data/)
[8]: [Do Not Dismiss Unusual Data Points as Outliers](https://towardsdatascience.com/do-not-dismiss-unusual-data-points-as-outliers-5132380f2e67)
[9]: [A recipe to empirically answer any question quickly](https://towardsdatascience.com/a-recipe-to-empirically-answer-any-question-quickly-22e48c867dd5)
[10]: [Major Problems of Machine Learning Datasets: Part 2](https://heartbeat.comet.ml/major-problems-of-machine-learning-datasets-part-2-ba82e551fee2)
[11]: [Explore and Validate Datasets with TensorFlow Extended](https://pub.towardsai.net/explore-and-validate-datasets-with-tensorflow-extended-fc52cc5e582)
[12]: [Advanced EDA Made Simple Using Pandas Profiling](https://pub.towardsai.net/advanced-eda-made-simple-using-pandas-profiling-35f83027061a)
[13]: [Detecting Outliers Using Python](https://towardsdatascience.com/detecting-outliers-using-python-66b25fc66e67)
[14]: [Detecting Outliers with Simple and Advanced Techniques](https://towardsdatascience.com/detecting-outliers-with-simple-and-advanced-techniques-cb3b2db60d03)
[15]: [Practical Guide to Data Analysis and Preprocessing](https://towardsdatascience.com/practical-guide-to-data-analysis-and-preprocessing-080815548173)
[16]: [5 Python One-Liners to Kick Off Your Data Exploration](https://towardsdatascience.com/5-python-one-liners-to-kick-off-your-data-exploration-d6221f94291e)
[17]: [10 Little-Known Python Libraries That Will Make You Feel Like a Data Wizard](https://www.kdnuggets.com/10-little-known-python-libraries-data-wizard)
[18]: [Understanding Outliers in Machine Learning](https://blog.gopenai.com/understanding-outliers-in-machine-learning-732e43566763)
[Data Analytics: The Four Approaches to Analyzing Data and How To Use Them Effectively](https://www.kdnuggets.com/2023/04/data-analytics-four-approaches-analyzing-data-effectively.html)
[How to build a Machine Learning (ML) Predictive System](https://towardsdatascience.com/machine-learning-ml-based-predictive-system-to-predict-the-satisfaction-level-of-airlines-f0780dbdbc87?source=rss----7f60cf5620c9---4)
[Huber and Ridge Regressions in Python: Dealing with Outliers](https://towardsdatascience.com/huber-and-ridge-regressions-in-python-dealing-with-outliers-dc4fc0ac32e4?gi=c292a23ceab7)
[Use k-medians clustering method if you have many outliers](https://towardsdatascience.com/use-this-clustering-method-if-you-have-many-outliers-5c99b4cd380d?gi=5c97a562ff04)
[Sensitivity Analysis of Dataset Size vs Model Performance](https://machinelearningmastery.com/sensitivity-analysis-of-dataset-size-vs-model-performance/)
[Analysing Fairness in Machine Learning (with Python)](https://towardsdatascience.com/analysing-fairness-in-machine-learning-with-python-96a9ab0d0705)