Importance of Data Visualization

Data visualization has become part of our day to day life. For example when you rent a cab in Uber or Ola, you get a summary report of your travel, which utilizes a map. Postpaid users get an itemized bill which also analyzes the calling patterns and summarizes the proportions of local, STD, ISD calls etc., Recently, all fitness tracking apps show dashboards. For decades, most of us are used to the charts on stock prices.

But recently data visualization has gained more traction in almost all industries. Every organization has started building its own data science team, in which data visualization plays a vital role. In this article I have highlighted a few reasons why data visualization is important. Nevertheless, there exists multiple other reasons.

1. Consume Big Data

Data across the world is growing every micro second. It is a fact that more data has been created in the past few years than in the entire history of human kind. It is estimated that by 2020,  about 1.7 MB of new information will be created every second for every human being.  But on the other end, how are we going to make use of this big data? We need to increase our consumption rate drastically. Visualization is one of the important techniques to consume big data.

bigdata.jpg

2. Communicate faster and better

It is highly competitive atmosphere in any field you take. That too the number of startups which are coming up with novel ideas, disrupt the technologies, tools very easily and quickly. Unlike traditional approaches, organizations need to keep changing their market strategies as quickly as possible. You might have noticed that FMCG, network service providers come up with new offers every day. Hence there is a need to understand the real picture of market, financials, resources etc., as quickly as possible. Stake holders might not have time to sit and interpret numbers in the form of traditional tables. New visualizations are emerging to encode large amount of information in simpler form and also the development time for creating operational dashboards has drastically reduced. Stakeholders need not wait for reports from the business intelligence team. They can directly log in to a portal to understand the current state of affairs.

I-dont-have-time-.jpg

3. Telling compelling stories

From the below picture, almost everyone can easily identify the story. We easily remember the interesting stories learnt during our childhood but how many of us can recall the definition of “light year“?. Stories are easy to remember for a long time than boring numbers. Instead of saying “I have 32 Million dollars”, it is easy to remember “I have lots of money using which I can buy 100 cars“. It also stimulates the reader to visualize 100 cars parked together.

story-cinderalla.gif

3. Detecting patterns

It is a well known fact that human beings are pattern seeking animals. For example, most of us would have related the formations of cloud with human faces. We identify patterns easily from visuals than looking  at plain numbers. Detecting patterns is a crucial step for telling compelling stories from data.

4. Identifying outliers

Treating outliers is one of the essential steps while building any machine learning algorithm. Though it is always debatable, whether one should treat outliers or not, plots like box plots, histograms, scatter plots are commonly used to detect outliers in data.

5. Analyze trends

Before taking any approaches, it is always useful to learn from the past which will help up to spot trends. Among organizations it is always a dilemma whether to choose a commercial product like Tableau or an open sourced library like D3 for data visualization. The below chart shows the interest level for both D3 and Tableau in India for the past 5 years. The blue line represents D3 and the red one represents Tableau. One can clearly observe that before 5 years, D3 was very popular in India than Tableau. But the interest level has been gradually increasing for Tableau across the years, and currently both share almost the same average interest level. Spotting these kinds of trends is very easy using simple charts.

tableau_vs_d3

Source: Google trends (https://trends.google.co.in/trends/explore?geo=IN&q=d3,tableau)

6. Derive actionable insights

The ultimate goal in collecting data, interpreting and consuming it, is essentially to take some proper actions. Modern visualization techniques help business users to easily identify areas in which they need to focus and identify the root cause. That too with modern tools, it is easy to create interactivity either in the form of data drill down, knocking off certain elements and collaborating with team members.

Finally, “A picture is always worth than thousand numbers”. As mentioned earlier, these are not the only reasons why we need data visualization. In the near future, data visualization will change from optional to necessity.

Predicting mutual fund data

Recently i participated in Data Hackathon I (internal hack fest at Gramener) and won the competition. I did basic predictive analytics and visualization on the mutual fund data (which can be freely downloaded from http://www.amfiindia.com/nav-history-download). Randomly i selected “Birla Sun Life Mutual Fund (Open Ended Scheme)” for the past two years.

PS: I assume that the reader has the basic knowledge in time series modelling.

Data processing using pandas:

There were various sub categories within the selected mutual fund data, of which i filtered only “Birla Sun Life  Buy India Fund-Plan A(Divivdend)” since it had some enough data points for two years. Data processing was carried out using the python’s pandas library.

data = pd.read_csv('data.txt',sep=";")
data = data[data['Scheme Name'] =='Birla Sun Life Buy India Fund-Plan A(Divivdend)']
data['Date'] = pd.to_datetime(data['Date'])
data.index = data['Date']
data = data.resample('1D', how='mean').fillna(method='ffill')
data.to_csv('data_processed.csv')

Data analysis using R:

Time series plot: The processed data is loaded in R using read.csv function. The sale price column is what used for prediction. First step in any analysis of a time series is to plot the same. By just looking at the visual one can try getting answers for the following questions:

  • whether the time series is stationary? If not stationary, is there is trend?
  • whether  there is any seasonal component in the time series?
  • whether there is any outlier?

By looking at our mutual fund data (shown below in Figure:1), let’s try answering the above questions:

  • The time series is strictly not a stationary series. But we can assume that the time series stationary.
  • There seems to be no seasonal component and outliers in the time series.
data=read.csv('path/to/the/csvfile/data_processed.csv')$Sale.Price
plot(data)
acf(data)
Time series
Figure 1: Time series

Scope for prediction: Inorder to determine whether there is any scope for prediction (using linear models), one can look at autocorrelation values (ACF values) at various lags for the time series (shown in Figure: 2)

ACF Plot
Figure 2: ACF Plot

This definitely doesn’t look like the ACF plot of a white noise sequence. Almost all lag values are significant and it seems to be exponentially decaying, which is the nature of Auto Regressive models. So i decided to go and fit an AR model to the time series.

Fitting AR Model: First time to fit an model is to identify the order. For which one can look at the Partial Auto Correlation (PACF) values, shown in Figure: 3. The PACF plot shows only one significant value (at lag 1) and rest all are within the confidence interval line. Hence i decided to go with AR(1) which is an autoregressive model with order one.

PACF Plot of the time series
Figure 3: PACF Plot of the time series

In R, ar.yw is a function which can be used to calculate the co-efficient for the AR model. One can either specify the order using max.order explicitly, but i leave the choice to the function just to cross check whether my assumption of AR(1) model.

ar_model = ar.yw(data[1:250])
ar_model
Call:
ar.yw.default(x = data[1:250])

Coefficients:
 1
0.9763

Order selected 1 sigma^2 estimated as 0.1309

The function ar.yw() has fitted an AR(1) model (hence the assumption is verified) with coefficient as 0.9763. Please note that we have used only 250 samples for fitting the model. The rest of the samples can be used for testing our model.

Testing the model: R has a very useful and powerful function called “predict“to predict the future values for a given model. All one needs to do is to pass the model and specify the number of data points to predict.

predicted_values = predict(ar_model, n.ahead=200)
plot(data)
lines(predicted_values$pred)
Original (black circles) vs predicted (red line) values
Figure 4:Original (black circles) vs predicted (red line) values

Results:
The original time series and the predicted values are plotted in Figure: 4. The model is able to capture the decaying trend of the time series, whereas it is not able to capture the increasing trend. This might be something to do with the non-stationarity of the time series.

PS: I am still working on the model and will update this post then and there. Please provide your comments if incase there is something wrong in what i am doing.