The objective of this assignment is to analyse a dataset concerning bike rentals. The dataset is based on the real data from Capital Bikeshare company that maintains a bike rental network in Washington DC. The dataset has one row for each hour of each day in 2011 and 2012, for a total of 17,379 rows. It contains features of the day (workday, holiday) as well as weather parameters such as temperature and humidity. The range of hourly bike rentals is from 1 to 977. The bike usage is stored in the field ‘cnt’. Our task is to develop a prediction model for the number of bike rentals such that Capital Bikeshare can predict the bike usage in advance
Data Summary, Pre-processing and Visualisation
The data set contains 17379 total rows and 17 columns/features. There are 12 integer features, 4 numeric and a POSIXct which represent a date time feature. Using the summary module to provide some descriptive statics of the dataset showed that there are no NA`s or missing values found in the data set.
Data types such as season, year, month hour holiday, weekday working day and weathersit are marked as numeric but they can be changed to numeric since they have a finite number of values.
The above plots show that there are the most rentals in weather sit 1, clear days whiles the least on 4, heavy rain days. More bicycles are also rented during fall and the least in spring. This seems odd as you would assume that more bicycles will most likely be rented in summer.
Wind speed also seems to be correlated with the number of bike rentals, the lower the wind speed the more bikes are rented and number gradually decreases the higher the wind speed.
Using the Permutation Feature Importance module, scores were produced for each feature based on their feature importance in a trained model and a test set. This module calculates these scores by randomly changing the values of each feature in order to measure how much they have an effect on the predictions , essentially how sensitive the predictions are to those features. Registered and casual had a very high score compared to the other variables and so it was decided that they will be the only features chosen for training of the models. This decision was also supported using the Filter Based Feature Selection module, this was used to identify features in dataset that have a high predictive power,the correlation between the variables and the target value.Using a spearman correlation with this module highlighted that registered and casual had an extremely high correlation to the target compared to the other features.
Comparison of Algorithms
Mean Absolute Error (MAE)
In this section to compare the algorithms, the Mean Absolute Error (MAE) will be used as the metric to compare the performance of the algorithms.(Wesner, 2016) defines MAE as “the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.”
Mean Absolute error measures how close the predictions of the model match the actual outcome, this is done by summing up the absolute differences between the predicted values and the actual outcome of all predictions made and then calculating the mean of that by dividing it by the number of predictions.(Brownlee, 2016a) mentions that MAE gives an idea of the magnitude of the error but it doesn’t provide an idea of the direction, whether the model is over or underfitting the prediction. This is a disadvantage to using this method since the absolute value of the prediction and the actual outcome is taken
The filter based feature selection module was used to select the two most important features, Registered and casual configured using a spearman correlation.
The data was then split into two, a training and a test set. Using the Split data module with a fraction of 0.75, the data set was split into 75% training set and a 25% test set. The training data set will be used for training the model and the 25% test set will be used later to test and score the trained model.
Three different branches were then created for training, each one with a different type of Regression Module
- Boosted Decision Tree Regression
- Linear Regression
- Decision Forest Regression
Each of the three branches contained a separate train model module each connected by one of the different types of training algorithms and the split training data set. The score model module is then used to generate the predictions using the trained module and the test data set. The score module outputs an addition column which contains the predicted value based on the test. The score Modules results are then connected to an Evaluate Model module to evaluate the model`s performance.
The initial train models resulted in the following MAE scores
- Boosted Decision Tree Regression 0.026107
- Linear Regression 0
- Decision Forest Regression 0.00751
The linear Regression Model had an impressive performance, predicting all the data correctly with a MAE of zero as shown with screenshot above as well as its error histogram.
Increasing the number of features did not make a difference to the performance. As shown by the Permutation Feature Importance module scores. (Zhao, 2016) states that the Permutation Feature Module computes importance scores for feature variables by determining the sensitivity the model to random permutations of the values of those features. Permutating the values of more important features will result in a significant reduction of the model’s performance compared to the effect of less important features. The above table showing the scores of several features used for training shows that registered and casual had a significant impact on the model importance, whereas other variables such as hr, atemp, hum, temp, and season made no impact to the model’s performance. To summarise, because of the predictive power using the casual and registered features, adding or removing any additional features will not make a difference to the model’s performance whiles those two are used.
A polynomial feature expansion was also conducted on the data set using order of 2. The Mae Score of this polynomial regression was still zero there was no impact on the performance. The features were squared and included into the data set for the training of the polynomial model.
Looking at the feature importance scores, it looks like the squared features were not as important and did not make a different to the model
Boosted Decision Tree Regression and Decision Forrest
A Boosted Decision Tree regression was trained with the default parameters; its MAE score was 0.024677. Worse than the linear regression model, a decision forest model scored 0.01505 which was better than the boosted decision tree. Boosted Decision Tree and Decision forest are similar in the sense that a tree is grown using the features of the data sets to form decision trees. But the difference between the two is decision forest does not contain just a single tree but multiple trees are grown using subsets of the data and predictions are based on the number of votes from each tree. Decision forest has a better performance compared to the decision tree, the addition of the forest does make a major difference to the predictive power of the models
The tree models were more sensitive to the change in features whereas the linear model had the same performance which may have been possibly because the linear model has a low variance so it can be less prone to over-fitting or a change in features. The additional features to the tree models could have also added a lot of noise which trees do not perform well with.
Linear Regression had the best performance, a reason could be, because linear regression is a linear model it excels in situations where the data has a linear shape. The total number of rentals was very dependent on the number of casual and registered users, since the total count dependent on the sum of these two features there was a very strong linear relationship.
In order to understand the parameters of the boosted decision tree module, the parameter range option was used along with the Tune Model Hyperparameters module.
The parameter Range option was selected in the Decision Tree Module, the minimum number samples per leaf node were changed to [1,2,3,4,6,8,10,15,20,40,80]. The Tune Model Hyperparameters was configured to use the entire grid, the label column selected as well as to use MAE for measuring performance.
Two Tune Model Hyperparameters were used for training, one with only training data and the second containing test data as well. This allowed showing the performance of the models on training as well as test data on the chart below.
It looks like minimum samples of 4 had the best performance using both the test and live data with its lower mean squared error. 80 is an example of a model that is underfitting as both the test and training errors are very high. Overall there seems to be underfitting for minimum samples > 4 and over fitting for 4 and below due to their excellent performance on the training data.4 will be chosen due to its lowest error scores as well as its perfect fit in the middle between over and underfitting. (Brownlee, 2016b) agrees with this decision quoting that “The sweet spot is the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset.” 4 has a great score in both test and training as well as it is at the point just before the error on the test set starts to increase
The training was repeated using the partition and sample module to split data into ten folds and train again using cross validation. The results of cross validation are included in the original chart using the green points as below.
Looking at this chart the performance; cross validation is extremely close to the training set and so the right chart was added with the testing data removed in order to visualize the difference between the training and the cross-validation data more clearly. Minimum sample 4 which seemed like the best originally performed worse during cross validation than on the training data.
Cross validation split the dataset into subsets, part of the data is used for training and the other is used for testing to provide the accuracy of the model. In Repeated Cross Validation, which was done in this case using 10 folds as configured in the partition and sample module, each repetition or fold, 90% of the data was used for training and 10% for testing. Whereas in the training and test data set method “Hold-out validation”, the data is split in to two originally 75% for training only and the other 25% was held out for testing after training. Cross validation results will be used as comparing to hold-out validation, cross validation training is exposed to more data in general. More data is available for training to expose the model to more possible occurrences, the model is also tested with more test data and fine-tuned based on the results of each fold.
The main disadvantage to using CV over Hold-out is its additional computation time but due to the size of our data set and the amount of computational resources available it doesn’t have that much of a noticeable effect.
With the addition of the performance from Cross validation 2 will be chosen as the best performing minimum sample instead of 4 which was chosen in the earlier section mainly because 2 has the best average performance between both methods.
Time Series Modelling not done
There are several reasons as to why the algorithm will works significantly worse on the 2013 data. The training of the model was not exposed to any data from 2013 and so things could be completely different. IF some data from 2013 was available for training they model may have performed better.
Using the split module, the data set was into two using the relative expression module and the following relational expression \”yr” == 0. The filter based feature selection module was used to select the most important features. The 2000 data in the first branch was used for training for training of all the models. The 2001 data on the right branch was used for testing, connected to the score model module`s. The Evaluate Model module was used to show the performance of each of the models
- Boosted Decision Tree Regression 0.072506
- Linear Regression 0
- Decision Forest Regression 0.071729
12 hours Modelling
Using an R Script module, additional features were added to the data set, each feature adding bike usage from 1 to 12 hours before the current row using the following code.
Using a cross validation along with the partition and sample module to split the data into ten folds the Mean absolute errors are as follows
- Boosted Decision Tree Regression 0.142499
- Linear Regression 0.300381
- Decision Forest Regression 0.141181
Using the same process as the 12 hour model the number of hours in the rescript was modified from 12 to 12*48 to produce the following scores of the 12-day model
- Boosted Decision Tree Regression 0.141352
- Linear Regression 0.192412
- Decision Forest Regression 0.204128
Using 12 days produced a better performance compared to using only 12 hours. This could be due to several reasons such the fact that more data was available to train the model. The 12 days’ model was exposed to more occurrences compared to the 12-hour model. (Brownlee, 2016c) agrees with this stating “More data is often more helpful, offering greater opportunity for exploratory data analysis, model testing and tuning, and model fidelity.” As well as there being more data, the 12 days’ model is also exposed to more variations and trends that can be used for more accurate predictions. The 12-day model also has access to the previous week’s value at that specific time since it would potential be the same day the prediction would likely be similar.
Time Series Prediction
The same script used in the previous section was used to add the previous 12 hours as features. An additional section was added to the code in order to achieve the opposite of that which was adding the values of the next 2,3,4,5 hours to every current row.
The bottom 4 rows were also removed as there wasn’t enough data available to add their next 2,3,4,5 hours and would have contained NA`s.
The data was then normalized and the partition and sample module was used to divide the data into ten folds.
Each select column in dataset removed the hours that were not being trained in that section for example the first branch responsible for finding out 2 hr`s, removed cnt,cnt3,cnt4 and cnt 5 with cnt2 selected as the label column in the cross validation model. The second for predicting the third hour had cnt2,cnt4 ,cnt5 excluded from the data set with cnt3 chosen as the label column for cross validation and so forth for the 4th and 5th hours. The current labels not being trained in each section was removed from the data set as they would interfere with the results. The results from the four cross validation modules were joined and pass through a script to visualize each training’s Mean Absolute error.
This plot shows the prediction horizon on the x-axis and the Mean Absolute error on the y-axis. The results show that performance gradually decreases the further into the future is predicted.
Brownlee, J. (2016a) Metrics To Evaluate Machine Learning Algorithms in Python – Machine Learning Mastery. . Available from http://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/ [accessed 4/22/2017].
Brownlee, J. (2016b) Overfitting and Underfitting With Machine Learning Algorithms – Machine Learning Mastery. . Available from http://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/ [accessed 5/1/2017].
Brownlee, J. (2016c) What Is Time Series Forecasting? – Machine Learning Mastery. . Available from http://machinelearningmastery.com/time-series-forecasting/ [accessed 4/27/2017].
Wesner, J. (2016) MAE and RMSE — Which Metric is Better? . Available from https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d [accessed 22/04/2017].
Zhao, K. (2016) Permutation Feature Importance | Cortana Intelligence Gallery. . Available from https://gallery.cortanaintelligence.com/Experiment/Permutation-Feature-Importance-5 [accessed 4/22/2017].