Recently i participated in Data Hackathon I (internal hack fest at Gramener) and won the competition. I did basic predictive analytics and visualization on the mutual fund data (which can be freely downloaded from http://www.amfiindia.com/nav-history-download). Randomly i selected “Birla Sun Life Mutual Fund (Open Ended Scheme)” for the past two years.
PS: I assume that the reader has the basic knowledge in time series modelling.
Data processing using pandas:
There were various sub categories within the selected mutual fund data, of which i filtered only “Birla Sun Life Buy India Fund-Plan A(Divivdend)” since it had some enough data points for two years. Data processing was carried out using the python’s pandas library.
data = pd.read_csv('data.txt',sep=";") data = data[data['Scheme Name'] =='Birla Sun Life Buy India Fund-Plan A(Divivdend)'] data['Date'] = pd.to_datetime(data['Date']) data.index = data['Date'] data = data.resample('1D', how='mean').fillna(method='ffill') data.to_csv('data_processed.csv')
Data analysis using R:
Time series plot: The processed data is loaded in R using read.csv function. The sale price column is what used for prediction. First step in any analysis of a time series is to plot the same. By just looking at the visual one can try getting answers for the following questions:
- whether the time series is stationary? If not stationary, is there is trend?
- whether there is any seasonal component in the time series?
- whether there is any outlier?
By looking at our mutual fund data (shown below in Figure:1), let’s try answering the above questions:
- The time series is strictly not a stationary series. But we can assume that the time series stationary.
- There seems to be no seasonal component and outliers in the time series.
data=read.csv('path/to/the/csvfile/data_processed.csv')$Sale.Price plot(data) acf(data)
Scope for prediction: Inorder to determine whether there is any scope for prediction (using linear models), one can look at autocorrelation values (ACF values) at various lags for the time series (shown in Figure: 2)
This definitely doesn’t look like the ACF plot of a white noise sequence. Almost all lag values are significant and it seems to be exponentially decaying, which is the nature of Auto Regressive models. So i decided to go and fit an AR model to the time series.
Fitting AR Model: First time to fit an model is to identify the order. For which one can look at the Partial Auto Correlation (PACF) values, shown in Figure: 3. The PACF plot shows only one significant value (at lag 1) and rest all are within the confidence interval line. Hence i decided to go with AR(1) which is an autoregressive model with order one.
In R, ar.yw is a function which can be used to calculate the co-efficient for the AR model. One can either specify the order using max.order explicitly, but i leave the choice to the function just to cross check whether my assumption of AR(1) model.
ar_model = ar.yw(data[1:250]) ar_model Call: ar.yw.default(x = data[1:250]) Coefficients: 1 0.9763 Order selected 1 sigma^2 estimated as 0.1309
The function ar.yw() has fitted an AR(1) model (hence the assumption is verified) with coefficient as 0.9763. Please note that we have used only 250 samples for fitting the model. The rest of the samples can be used for testing our model.
Testing the model: R has a very useful and powerful function called “predict“to predict the future values for a given model. All one needs to do is to pass the model and specify the number of data points to predict.
predicted_values = predict(ar_model, n.ahead=200) plot(data) lines(predicted_values$pred)
The original time series and the predicted values are plotted in Figure: 4. The model is able to capture the decaying trend of the time series, whereas it is not able to capture the increasing trend. This might be something to do with the non-stationarity of the time series.
PS: I am still working on the model and will update this post then and there. Please provide your comments if incase there is something wrong in what i am doing.