But recently data visualization has gained more traction in almost all industries. Every organization has started building its own data science team, in which data visualization plays a vital role. In this article I have highlighted a few reasons why data visualization is important. Nevertheless, there exists multiple other reasons.
Data across the world is growing every micro second. It is a fact that more data has been created in the past few years than in the entire history of human kind. It is estimated that by 2020, about 1.7 MB of new information will be created every second for every human being. But on the other end, how are we going to make use of this big data? We need to increase our consumption rate drastically. Visualization is one of the important techniques to consume big data.
It is highly competitive atmosphere in any field you take. That too the number of startups which are coming up with novel ideas, disrupt the technologies, tools very easily and quickly. Unlike traditional approaches, organizations need to keep changing their market strategies as quickly as possible. You might have noticed that FMCG, network service providers come up with new offers every day. Hence there is a need to understand the real picture of market, financials, resources etc., as quickly as possible. Stake holders might not have time to sit and interpret numbers in the form of traditional tables. New visualizations are emerging to encode large amount of information in simpler form and also the development time for creating operational dashboards has drastically reduced. Stakeholders need not wait for reports from the business intelligence team. They can directly log in to a portal to understand the current state of affairs.
From the below picture, almost everyone can easily identify the story. We easily remember the interesting stories learnt during our childhood but how many of us can recall the definition of “light year“?. Stories are easy to remember for a long time than boring numbers. Instead of saying “I have 32 Million dollars”, it is easy to remember “I have lots of money using which I can buy 100 cars“. It also stimulates the reader to visualize 100 cars parked together.
It is a well known fact that human beings are pattern seeking animals. For example, most of us would have related the formations of cloud with human faces. We identify patterns easily from visuals than looking at plain numbers. Detecting patterns is a crucial step for telling compelling stories from data.
Treating outliers is one of the essential steps while building any machine learning algorithm. Though it is always debatable, whether one should treat outliers or not, plots like box plots, histograms, scatter plots are commonly used to detect outliers in data.
Before taking any approaches, it is always useful to learn from the past which will help up to spot trends. Among organizations it is always a dilemma whether to choose a commercial product like Tableau or an open sourced library like D3 for data visualization. The below chart shows the interest level for both D3 and Tableau in India for the past 5 years. The blue line represents D3 and the red one represents Tableau. One can clearly observe that before 5 years, D3 was very popular in India than Tableau. But the interest level has been gradually increasing for Tableau across the years, and currently both share almost the same average interest level. Spotting these kinds of trends is very easy using simple charts.
Source: Google trends (https://trends.google.co.in/trends/explore?geo=IN&q=d3,tableau)
The ultimate goal in collecting data, interpreting and consuming it, is essentially to take some proper actions. Modern visualization techniques help business users to easily identify areas in which they need to focus and identify the root cause. That too with modern tools, it is easy to create interactivity either in the form of data drill down, knocking off certain elements and collaborating with team members.
Finally, “A picture is always worth than thousand numbers”. As mentioned earlier, these are not the only reasons why we need data visualization. In the near future, data visualization will change from optional to necessity.
]]>PS: I assume that the reader has the basic knowledge in time series modelling.
Data processing using pandas:
There were various sub categories within the selected mutual fund data, of which i filtered only “Birla Sun Life Buy India Fund-Plan A(Divivdend)” since it had some enough data points for two years. Data processing was carried out using the python’s pandas library.
data = pd.read_csv('data.txt',sep=";") data = data[data['Scheme Name'] =='Birla Sun Life Buy India Fund-Plan A(Divivdend)'] data['Date'] = pd.to_datetime(data['Date']) data.index = data['Date'] data = data.resample('1D', how='mean').fillna(method='ffill') data.to_csv('data_processed.csv')
Data analysis using R:
Time series plot: The processed data is loaded in R using read.csv function. The sale price column is what used for prediction. First step in any analysis of a time series is to plot the same. By just looking at the visual one can try getting answers for the following questions:
By looking at our mutual fund data (shown below in Figure:1), let’s try answering the above questions:
data=read.csv('path/to/the/csvfile/data_processed.csv')$Sale.Price plot(data) acf(data)
Scope for prediction: Inorder to determine whether there is any scope for prediction (using linear models), one can look at autocorrelation values (ACF values) at various lags for the time series (shown in Figure: 2)
This definitely doesn’t look like the ACF plot of a white noise sequence. Almost all lag values are significant and it seems to be exponentially decaying, which is the nature of Auto Regressive models. So i decided to go and fit an AR model to the time series.
Fitting AR Model: First time to fit an model is to identify the order. For which one can look at the Partial Auto Correlation (PACF) values, shown in Figure: 3. The PACF plot shows only one significant value (at lag 1) and rest all are within the confidence interval line. Hence i decided to go with AR(1) which is an autoregressive model with order one.
In R, ar.yw is a function which can be used to calculate the co-efficient for the AR model. One can either specify the order using max.order explicitly, but i leave the choice to the function just to cross check whether my assumption of AR(1) model.
ar_model = ar.yw(data[1:250]) ar_model Call: ar.yw.default(x = data[1:250]) Coefficients: 1 0.9763 Order selected 1 sigma^2 estimated as 0.1309
The function ar.yw() has fitted an AR(1) model (hence the assumption is verified) with coefficient as 0.9763. Please note that we have used only 250 samples for fitting the model. The rest of the samples can be used for testing our model.
Testing the model: R has a very useful and powerful function called “predict“to predict the future values for a given model. All one needs to do is to pass the model and specify the number of data points to predict.
predicted_values = predict(ar_model, n.ahead=200) plot(data) lines(predicted_values$pred)
Results:
The original time series and the predicted values are plotted in Figure: 4. The model is able to capture the decaying trend of the time series, whereas it is not able to capture the increasing trend. This might be something to do with the non-stationarity of the time series.
PS: I am still working on the model and will update this post then and there. Please provide your comments if incase there is something wrong in what i am doing.
]]>