Exploratory Data Analysis

Definition:

In statistics, Exploratory Data Analysis (EDA) is an approach to analysing data sets to summarise their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task (Source: Wikipedia)

It is part of Descriptive Analytics, where we will try to discover patterns from data. Through various techniques like correlation, distribution analysis etc, we can understand “What happened so far”

Why we do EDA:

The below quote describes why exactly we do EDA.

ckrpa5uweaiakra

Generally we do EDA for any one or all the reasons below

  • To understand our data better
  • To clean the data (data imputation, outlier treatment)
  • To tell stories
  • Prepare data for machine learning exercise
  • Identify relationship between variables

Process

Its a cyclic process. One has to repeat the following steps many a time

  1. Understand your data
  2. Classify variables in data
  3. Derive new variables using existing data
  4. Detect outliers and missing values
  5. Univariate analysis
  6. Bi-variate analysis
  7. Multi-variate analysis
  8. Validate your insights using statistical hypothesis test
  9. Report your analysis through visuals

1. Understand your data

  • Granularity: 

    What does each row in your data defines the granularity of your data. For example, in a grocery shop, the data will be stored at item level. For each purchase of the customer, there could be five or ten rows in your data, depending upon different types vegetables they have bought

grocery-bill-format-creative-yet-restaurant-invoice-8-food-in-word-financial-statement-form-sampleresume-640x709

Instead if you store only total purchase amount of the customer, then the data is at customer purchase level instead of item level. Customer total purchase level is less granular compared to item level. As you reduce the details, the granularity comes down.

Pros of high granular data: More information is captured. Required information alone can be filter whenever required.

Cons of high granular data: Data volume increases. For each analysis, data has to be aggregated

It is important to know your data granularity. All grouping operation depends upon this. For many data sets, it is easy to get to understand the granularity of the data by just printing first few rows. Below we are reading a cricket data set directly from a URL.

The data set contains runs scored by each player in all international odi matches till 2011. This is a match to player level data

odi = pd.read_csv(‘https://bit.ly/2EN4qrx’)

odi.head()

Screen Shot 2018-10-29 at 11.25.39 PM

  • Summary

Info: Summarise each column in your data to have a high level understanding. For each column, using info() function in pandas get to know the number of non null values, and the column types. Get to know how many numerical and categorical variables.

In the following example, we can easily identify that the Runs and ScoreRate columns alone have missing values

odi.info()

Screen Shot 2018-10-29 at 11.17.53 PM

Describe: Use describe() function to get to know some important statistic about each column.

For categorical columns (object columns) we will get the following statistics

  • Count of non null values
  • Unique: Total number of distinct categories. Ex: In Country column we have totally 22 unique countries
  • Top: Most repeating category. Ex: Most repeating player is Sachin R Tendulkar
  • Freq: Frequency of the top repeating category. Ex: Sachin has appeared 442 times in Player column (i.e. totally he has played 442 odi matches till 2011. Granularity…!!!)

For numerical columns along with count, we get the following statistics

  • min (0th percentile): Minimum run scored is 0
  • 25th percentile
  • 50th percentile (median): Median run scored is 13
  • mean: Average runs scored by all players is 22. This is your first crude prediction if asked to predict a player’s score before he enters the ground. Mean is greater than median. May be right skewed distribution
  • 75th percentile
  • max: Maximum runs scored is 200. (Yes. You are right. It was Sachin…!!!)

odi.describe(include=’all’)

Screen Shot 2018-10-29 at 11.30.35 PM.png

2. Classify variables

Typically tools like Python or R, classify each column as either numerical or non-numerical columns. This classification alone is not enough. Below is a better classification of data variables.

  • Numerics / KPI / Quantitative columns
  • Categorical / Groups / Dimension / Qualitative
    • Names, places, departments, products etc
  • Dates
  • Geographical / Location columns
    • City, State, Pincodes, District etc
  • Text columns
    • Tweets, reviews, product description, complaints etc
  • Miscellaneous
    • Phone numbers, email ids, Primary ids,  URLs etc

This is just one way to classify data variables. Not the only way. Inclusion or exclusion of many analysis, depends upon the data types. For example, if you have text columns, we can perform text analytics. On the other hand if we do not have any date columns, we cannot perform trending analysis.