Definition:
In statistics, Exploratory Data Analysis (EDA) is an approach to analysing data sets to summarise their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task (Source: Wikipedia)
It is part of Descriptive Analytics, where we will try to discover patterns from data. Through various techniques like correlation, distribution analysis etc, we can understand “What happened so far”
Why we do EDA:
The below quote describes why exactly we do EDA.
Generally we do EDA for any one or all the reasons below
 To understand our data better
 To clean the data (data imputation, outlier treatment)
 To tell stories
 Prepare data for machine learning exercise
 Identify relationship between variables
Process
Its a cyclic process. One has to repeat the following steps many a time
 Understand your data
 Classify variables in data
 Derive new variables using existing data
 Detect outliers and missing values
 Univariate analysis
 Bivariate analysis
 Multivariate analysis
 Validate your insights using statistical hypothesis test
 Report your analysis through visuals
1. Understand your data

Granularity:
What does each row in your data defines the granularity of your data. For example, in a grocery shop, the data will be stored at item level. For each purchase of the customer, there could be five or ten rows in your data, depending upon different types vegetables they have bought
Instead if you store only total purchase amount of the customer, then the data is at customer purchase level instead of item level. Customer total purchase level is less granular compared to item level. As you reduce the details, the granularity comes down.
Pros of high granular data: More information is captured. Required information alone can be filter whenever required.
Cons of high granular data: Data volume increases. For each analysis, data has to be aggregated
It is important to know your data granularity. All grouping operation depends upon this. For many data sets, it is easy to get to understand the granularity of the data by just printing first few rows. Below we are reading a cricket data set directly from a URL.
The data set contains runs scored by each player in all international odi matches till 2011. This is a match to player level data
odi = pd.read_csv(‘https://bit.ly/2EN4qrx’)
odi.head()

Summary
Info: Summarise each column in your data to have a high level understanding. For each column, using info() function in pandas get to know the number of non null values, and the column types. Get to know how many numerical and categorical variables.
In the following example, we can easily identify that the Runs and ScoreRate columns alone have missing values
odi.info()
Describe: Use describe() function to get to know some important statistic about each column.
For categorical columns (object columns) we will get the following statistics
 Count of non null values
 Unique: Total number of distinct categories. Ex: In Country column we have totally 22 unique countries
 Top: Most repeating category. Ex: Most repeating player is Sachin R Tendulkar
 Freq: Frequency of the top repeating category. Ex: Sachin has appeared 442 times in Player column (i.e. totally he has played 442 odi matches till 2011. Granularity…!!!)
For numerical columns along with count, we get the following statistics
 min (0th percentile): Minimum run scored is 0
 25th percentile
 50th percentile (median): Median run scored is 13
 mean: Average runs scored by all players is 22. This is your first crude prediction if asked to predict a player’s score before he enters the ground. Mean is greater than median. May be right skewed distribution
 75th percentile
 max: Maximum runs scored is 200. (Yes. You are right. It was Sachin…!!!)
odi.describe(include=’all’)
2. Classify variables
Typically tools like Python or R, classify each column as either numerical or nonnumerical columns. This classification alone is not enough. Below is a better classification of data variables.
 Numerics / KPI / Quantitative columns
 Categorical / Groups / Dimension / Qualitative
 Names, places, departments, products etc
 Dates
 Geographical / Location columns
 City, State, Pincodes, District etc
 Text columns
 Tweets, reviews, product description, complaints etc
 Miscellaneous
 Phone numbers, email ids, Primary ids, URLs etc
This is just one way to classify data variables. Not the only way. Inclusion or exclusion of many analysis, depends upon the data types. For example, if you have text columns, we can perform text analytics. On the other hand if we do not have any date columns, we cannot perform trending analysis.