Data Programming Tools
For this project R will be used as the core programming tool using R Studio as the IDE. R is a statistical programing language based on the S programming language. R was initially released in 1995 by Ross Ihaka and Robert Gentleman at the University of Auckland New Zealand. According to (Piatetsky, 2015) R is the most widely used tool for predictive modelling with 38% share of users compared to the next competitor RapidMinder with only 31% in 2014. R is used in a variety of areas such as for data mining and data analysis. R has a wide of range of tools available that can be used to create this solution.
Although R is pre-built with numerous features, it allows the user to extend its capabilities,(Mayor, 2015, p.1) states that “one of the most important strengths of R is the degree to which its functionalities can be extended by installing packages made by users of the community”. Because R is open source there is a wide range of available packages available that can be installed to extend and or add addition functionalities to R making sure all the necessary functions are available for this solution if not additional functions can be installed. (Mayor, 2015, p.1) also states, there is currently over 4500 packages available for R and more being added, another advantage is that most of these are also available for free.
Another advantage offered by R is that it simplifies machine learning, it doesn’t require the user to understand the algorithms, theory`s and then need to implement the steps on their own, but instead there are prebuilt functions for almost all machine learning algorithms. This simplifies the development process greatly. R provides great graphical capabilities, it allows users to visualize data in numerous very customizable ways such as charts, bar graphs and scatter plots, this feature will be useful in analysing the data set, by making the information easy to consume and to also convey results and findings graphically.
Python is an alternative to using R. Since python is a general programming language, the main advantage it has over R is that it is easier to integrate data analyses task with other applications such as web applications. R also considerably has more packages available than python, although python is getting closer. R was designed to make data analyses simpler but it does tend to use a large amount of computing resources making it extremely slow in certain situations. Although since in this situation, the data being processed isn’t extremely large the software`s capabilities should be more than enough.(Theuwissen, 2015) compares the capabilities of python and R pointing out that both tools have a number of excellent libraries available for visualization but usually visualisations in R “are not always pleasing to the eye”. (Theuwissen, 2015) points out several situations where both of those tools succeed but in this case, because the solution required is only statistical and machine learning based without the need for integration to other applications, R is the best tool for solving this problem
The first step was to explore the raw data to understand the structure of the data. The dim() function revealed that the data set contains 371 variables with 38010 observations all together. The names() function provided the names of the variables such as imp_amort_var18_ult1. The str function provided a structure of the dataset and showing the different types of variables included.
Using a combination of sapply and the class function showed data types of all the variables included in the data set consisted of only integers and numeric as shown in figure 1. The is.na function showed that there were no missing values in the dataset
A closer look at the target variable using the summary and the unique function showed that the variable only had two possible outcomes 0 and 1 and so a classifier is required for this prediction instead of regression.
Training Data Pre-processing and Analysis
Looking at the proportion of the target variables in the dataset revealed that the data set was very imbalanced. The proportion of outcome 0 was 0.96 and the outcome of 1 only 0.039. (Brownlee, 2015) states that a data set is imbalanced when the ratio between two classes is 80:20 or more and this could cause problems such as an accuracy paradox.(Laux, 2015) adds more information about the accuracy paradox “The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. “and “It may seem obvious that the ratio of correct predictions to cases should be a key metric. A predictive model may have high accuracy, but be useless.”. In this model, outcomes of zero will be more likely and accurate compared to one. Several methods for preventing this issue are presented by (Brownlee, 2015) such as collecting more data which isn’t possible in this case or attempting to “under or over sample” the data. The decision on how to fix the imbalanced data was based on the research by (Manish., 2016) who produced multiple solutions to data sampling using the ROSE package such as under and over sampling or synthetic sampling, generating additional data synthetically. The decision was made after trying all the sampling techniques, over and under sampling provided the best AUC score. Resulting in a very even distribution of 0 = 0.5016216 and 1 = 0.4983784.
With training data sometimes one can assume that the more variables there are the better the model will be but this is not the case, (Parker, 2013) highlights that adding variables that are not useful, redundant or variables that already exist elsewhere will lead to a bad model, it can also increase the time it takes to train the model. The pre-processing task was not only to clean the data but to also remove data that would not be useful. Checking the number of distinct values in all the variables revealed that 85 of them contained just a single value in all rows. These were removed as they will not provide any meaning information for classification.
Outliers in a dataset can reduce the accuracy of the classification, (Stevens, 1984) agrees saying that the results of a regressing analysis can be quite sensitive to outliers and it is important to be able to detect such outliers and influential data. Outliers are usually Incomplete inconsistent or noisy values in data that can be caused by a data transmission or collection problems.
Mahalanobis distance is a method for finding outliers in multidimensional data. This works by creating a mean of all the variables available and rows in the data set. The next step the measures the distance of each row in the data set to the central mean, providing the Mahalanobis distance. (Stevens, 1984) states that “A large distance indicates an observation that is an outlier in the space denned by the predictors”. Rows too far away from the central mean can be described as outliers.
The next step was to determine a threshold, rows with a Mahalanobis distance above the threshold will be highlighted as outliers. The threshold was determined using Chi-Squared Distribution, all values above the red line as shown in figure 2 are outliers.
(de Jonge and van der Loo, 2013) makes an important note that outliers are not always errors in data and should not always be removed but their inclusion in the analysis is a statistical decision. There were several ways in which these outliers in the data can be dealt with, such as replacing the outlying values with either the mean or median of the variable, but this could risk skewing the entire data set so the decision was made to completely remove all those rows with outliers. One could also argue that due to the large number of rows in the data set, outliers would not have a major effect to data. 94 rows were determined as outliers and removed from the training data set.
#copy items under cut-off
The next set was to analyse the relations between the variables and the target variable, to choose which variables to use for the prediction. To find the relationship between the variables the covariance matrix was calculated.
(Yu-Wei, 2015) describes that a positive covariance shows that two variables are positively linearly related but on the other hand a negative covariance shows that two variables are negatively related. (Yu-Wei, 2015) also states that it is not possible to directly compare the strength of the relation between two variables to another set of completely different types of variables. (Mayor, 2015) agrees by adding “The problem with covariance is that it is not a standardized measure of association that is, the value of the measure depends upon the unit in which the attributes are measured.” as an example it will not be possible to compare correlation of measurements of inches and kilograms to a single target variable as they are different units of measurements. To allow comparing all the different variables available the data must be normalized.
Two correlation methods were tested to find the correlation between every variable and the target variable. Pearson and Spearman, the results of both shown as a histogram in figure 3. Both methods had a similar number of variables above 0.5 and less than -0.5 and a large amount of variables with a low correlation, so the decision to which method to use was based on the findings (Eisinga et al, 2013) and (Hauke et al, 2011) who both concluded that Pearson is not a reliable measure of correlation between two variables, Spearman is the most appropriate for this case. Variables that had a Spearman’s. correlation value of greater than 0.5 or less than -0.5 a count of 18 were selected and the rest removed from the data set.
To begin classification several methods were looked into
- Stochastic Gradient Boosting(GBM)
- Support Vector Machines with Radial(svmRadial)
- Naive Bayes(nb)
- Bagged CART(treebag)
- Random Forest(RF)
Stochastic Gradient Boosting
Stochastic Gradient Boosting uses boosting as its classification methodology. The process of Boosting described by (Friedman et al, 2000) “Boosting works by sequentially applying a classification algorithm to the re-weighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers “. Boosting uses a sequence of learners, these learners are run sequentially. Learners at the beginning fit simple models to the data and then analyse the data for errors. Later learners focus on solving the problems encountered by the beginning learners “weak models”. At the end of this all models are given a weight and combined into an overall model essentially boosting coverts a sequence of simple models to a more complex model.
Support Vector Machine with Radial
SVM Is a supervised machine learning algorithm, its goal is to take groups of observations and segment them using an optimal hyperplane where the margins between the groups are largest as possible.
Figure 4:(Meyer and Wien, 2015, 2) SVM Linear Classification
SVM was mainly developed for binary classification, figure 4 shows an example seperation between two classes of data. The radial basis function (RBF) is a type of neural network addition that allows the svm algorithm to automatically determine centers, weights and thresholds to minimise an upper bound on the expected test error .
Bayesian classifiers are simple probabilistic classifiers; they are statistical classifiers that can predict the membership of probabilities between a given sample. Naïve bays classifiers are based on Bayes theorem, which is the assumption that the effect of a predictor on a class is independent to the value of the other predictors, essentially the value of one predictor does not provide the value of another predictor. This class conditional independence is done in order to simplify the amount of computation needed making it faster than most other classifiers. For each class Naïve Bayes compares the probability against the likely hood of that feature under a given condition independently, the likelihood table using information from a frequency table. The Naïve Bayesian equation is used to then calculate the posterior probability for each attribute and then finally the class with the highest posterior probably is the outcome.
Bagged Cart is a binary tree structure classification approach to pattern recognition. In a binary tree, at the root data is split into two children using a set of question that split up the data, each of those children are then split into two grandchildren and so forth. These children can also be known as nodes or leaves. An example is, is a feature greater than 50 this can be split into two nodes of feature is less than 50 and the other feature greater than 50. Bagging bootstrap aggregation is a technique for creating multiple classifiers using portions of data and then combining them. This allows preventing training errors and preventing overfitting on the data.
In CART (Classification and Regression Trees) not only a single tree is created but a sequence of trees is created, each of those could potentially be the optimal tree. The correct tree is identified by evaluating the performance of every tree using test data. CART is further described by (Steinberg and Colla, 2009,p181) “includes (optional) automatic class balancing and automatic missing value handling, and allows for cost-sensitive learning, dynamic feature construction, and probability tree estimation”.
Random forest is a type of tree classification including bagging. Random forest was proposed to add an additional level of randomness to bagging. (Liaw et al, 2002) states that the difference between a standard tree and random forest is that, a standard tree splits its node with the best split among all nodes whiles random forest nodes are split using the best among a set of randomly chosen set of data and variables to grow the tree. (Liaw et al, 2002) also states that “subset of predictors is randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting”. The random forest name comes from the fact that, there are multiple random trees created, and a collection of trees is known as a forest. The steps in involved in random forest are, at the start training data and variables randomly split into a number of sub sets, a tree is then fully grown from each subset with each tree having a different set of data and variables. In the last step during classification, all the trees attempt to make a prediction based on the input data, the output is based upon the majority results of all trees.
The first part of training was to split the training data set into two parts at random using the create data partition function, to create a training data set for learning and a separate test data set for testing in order to evaluate the model.
The train control function was created to set the tuning parameters for training, Repeated Cross Validation was chosen. Repeated cross validation is a model for estimating model performance, cross validation works by splitting the dataset into subsets, part of the data is used for training and the other part is used for testing to provide the accuracy of the model. In Repeated Cross Validation, this is done multiple times, each repetition or fold, the model is evaluated providing the accuracy and then the mean accuracy of all the occurrences is outputted.
Repeated cross validation was chosen because simple cross validation estimates can have a high variance but this is reduced repeating this process and taking the mean.(Vanwinckelen et al, 2012) in their research conclude that repeated cross validation can be a waste of resources but (Kohavi, 1995) found several situations in which repeated cross validation was the best method to use and also recommending using tenfold cross validation for optimal results. Increasing the number of folds from 3 to 10 improved the models massively. Figure 5 shows an example of a 10-fold cross validation, data is split into 10 groups, in each iteration a separate group of data is used for testing and the rest for training. Each group will be used for testing at some point during the training process this makes sure that all possible cases are used for both training and testing.
Figure 5: (Mayor, 2015,264) Representation of training and testing sets in 10-fold cross validation
The output variables were renamed from 0 and 1 to Y and N and converted to a factor, this is because R uses the different outcomes values as class names in which 0 and 1 are invalid. This was also done to prevent the training process from possible assuming that the type of prediction that is been built is a regression model due to the numeric target variable.
Models were trained using the train function of the caret package, changing the model type for each of the chosen models.
m.gbm= train(x,y,method=”gbm”,metric=metric,trControl=control,verbose=FALSE )
Due to the large number of models that needed to be trained along with the large data set, the training was taking a large amount of time to complete. It was noticed that R was only using a single core by default to conduct training, allowing R access to multiple cores could increase the speed noticeably. The do Parallel package allows specifying how many cores to run the next functions, specifying 4 cores reduced the training time significantly.
The 5 trained models were trained using 75% of the training data. The results of the training using the cross-validation control to estimate the possible future performance of the model. The results of the leave one out cross validation are shown below. The metrics being evaluated at this point is its ROC score, sensitivity and specificity.
ROC, Receiver operating characteristic is a method for visualizing and comparing the performance of binary classifiers. ROC charts are two-dimensional graphs, it shows the true positive rate, the amount of correct predictions made, on the y-axis and the false positive rate, a false alarm, are situations where the classifier falsely predicated a positive result. The true positive rate indicates the amount of correct predictions and the false negative, the amount of false results that have been predicted as positive. As shown in figure 5 classifier D has a good performance as it has a high true positive rate, correct predictions, and a very low false positive rate. E is a bad classifier almost the opposite of D as it has a very low true positive rate and high false positive rate. (Fawcett, 2006) summarises the performance of classifiers on an roc chart saying that a point in ROC space is better than another if it is to the north-west, classifiers on the on the left hand side of the graph near the x axis can be thought of as conservative as they make positive classifications with only a small number false positive errors.
Figure 5:(Fawcett, 2006, p.862) A basic ROC graph showing five discrete Classifiers
Although the performance of classifiers can be compared using the False positive and true positive rates using the ROC chart, sometimes there might be a need to reduce the ROC performance to a single overall value, this can be done by calculating the area under the curve, AUC score. This is essentially the area under the ROC curve. (Fawcett, 2006) states that AUC is a portion of the area of the unit square and so its value will always be between 0 and 1.0, this will allow comparing the models efficiently as the results of different models will be in the same scale.(Tape, 2012) recommends the following grades for AUC scores.
- .90-1 = Excellent
- .80-.90 = Good
- .70-.80 = Fair
- .60-.70 = Poor
- .50-.60 = Fail
Figure 6: CrossValidation Results
Figure 6 shows the results of the model training using scores from the cross validation of each of the 10 folds. The chart shows the ROC (AUC Score) , Sensitivity and Specificity. Sensitivity is the true positive rate of the model, the number of correct predictions while specificity is the true negative rate, the number of negatives predicted correctly. The figure shows that Radom forest and Bagged CART are the best performance, because of their high sensitivity and specificity, they made a lot more correct predictions both positive and negative, resulting in the high ROC score. Naïve Bayes and Support Vector Machine(SVM) were the worst performing. Naïve Bayes had a low sensitivity meaning it predicted most of the instances from the first class as negative and a high specificity which means it likely just outputted the same result for all the predictions. Using the results from the cross validation Random Forest and Bagged Cart will be chosen to be investigated further using the recommendations of (Tape, 2012) an AUC score greater than 0.90 is assumed to be an excellent classifier.
After selecting the top two performers from the results of the cross-validation the next part was to test the performance of the model with data that was not used in training and so this is new data that has not been seen by the models. Before the training of the models a subset of 25% of the training data was taken out to be used for this test.
p.treebag.probs =predict(m.treebag,newdata=test[,-which(names(test)==”TARGET”)],probability= TRUE,type=”prob”)
p.rf.probs =predict(m.rf , newdata =test[,-which(names(test)==”TARGET”)],probability=TRUE,type= “prob”)
plot(p.rf.perf,col=2,colorize=T,main=paste(“Random Forest AUC:”,[email protected]))
plot(p.treebag.perf,col=2,colorize=T, main=paste(“Bagged CART AUC:”,[email protected]))
Figure 7: Performance of Random Forest and Bagged Cart
The ROC chart based on the actual prediction results of both Random Forest and Bagged Cart. Both performed extremely well. Bagged Cart had an AUC score of 0.977 while Random Forest was slightly better at 0.988.
Figure 8: Random Forest and Bagged Cart Performance
The performance of the two models are investigated further using a confusion matrix showing both classes. Figure 8 shows a confusion matrix of the results from the predictions, note that before training the target variables were renamed 0 to N and 1 as Y. Random forest predicted 4448 correct N`s and 4658 correct Y`s whiles falsely predicting 306 Y`s and 66 N`s. Bagged Cart Predicted less N`s 4418, the same amount of Y`s 4658, it falsely predicted 336 Y`s as N and 66 false predictions of N`s. in comparison they both predicted the same amount of correct Y`s but Random forest performed better predicting a larger number of correct N`s. In conclusion, Random forest will be chosen as the best model to make the final predictions as it had the best overall AUC score.
Brownlee, J. (2014) Feature Selection to Improve Accuracy and Decrease Training Time – Machine Learning Mastery . Available from http://machinelearningmastery.com/feature-selection-to-improve-accuracy-and-decrease-training-time/ [accessed 1/2/2017].
Brownlee, J. (2015) 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset – Machine Learning Mastery : machinelearningmastery.com. Available from http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ [accessed 1/3/2017].
Casella, G. and Berger, R.L. Statistical inference : Duxbury Pacific Grove, CA.
Chiu, Y.W. (2015) Machine Learning with R Cookbook : Packt Publishing.
de Jonge, E. and van der Loo, M. (2013) An introduction to data cleaning with R. Statistics Netherlands, The Hague, 53.
Eisinga, R., Grotenhuis, M.t. and Pelzer, B. (2013) The reliability of a two-item scale: Pearson, cronbach, or spearman-brown?, International Journal of Public Health, 1-6.
Fawcett, T. (2006) An introduction to ROC analysis. Pattern Recognition Letters, 27(8) 861-874.
Friedman, J., Hastie, T. and Tibshirani, R. (2000) Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2) 337-407.
Hauke, J. and Kossowski, T. (2011) Comparison of values of pearson’s and spearman’s correlation coefficients on the same sets of data. Quaestiones Geographicae, 30(2) 87-93.
Kohavi, R. (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, 1137-1145.
Laux, M. (2015) If you did not already know: “Accuracy Paradox” | Data Analytics & R . Available from https://advanceddataanalytics.net/2015/04/19/if-you-did-not-already-know-accuracy-paradox/ [accessed 1/3/2017].
Liaw, A. and Wiener, M. (2002) Classification and regression by randomForest. R News, 2(3) 18-22.
Manish. (2016) Practical Guide to deal with Imbalanced Classification Problems in R : Analytics Vidhya. Available from https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/ [accessed 1/3/2017].
Mayor, E. (2015) Learning Predictive Analytics with R. : Packt Publishing.
Meyer, D. and Wien, F.T. (2015) Support vector machines. The Interface to Libsvm in Package e1071, .
Nolan, D. and Lang, D.T. (2015) Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. : CRC Press.
Parker, C. (2013) Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part One). : blog.bigml.com. Available from https://blog.bigml.com/2013/02/15/everything-you-wanted-to-know-about-machine-learning-but-were-too-afraid-to-ask-part-one/ [accessed 2 January 2017].
Piatetsky, G. (2015) R leads RapidMiner, Python catches up, Big Data tools grow, Spark ignites : KDnuggets. Available from http://www.kdnuggets.com/2015/05/poll-r-rapidminer-python-big-data-spark.html [accessed 1/13/2017].
S.S. Alwakeel and H.M. Almansour (2011) Modeling and performance evaluation of message-oriented middleware with priority queuing. Information Technology Journal, (1) 61.
Scholkopf, B., Sung, K., Burges, C.J., Girosi, F., Niyogi, P., Poggio, T. and Vapnik, V. (1997) Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45(11) 2758-2765.
Steinberg, D. and Colla, P. (2009) CART: Classification and regression trees. The Top Ten Algorithms in Data Mining, 9 179.
Stevens, J.P. (1984) Outliers and influential data points in regression analysis. Psychological Bulletin, 95(2) 334.
Theuwissen, M. (2015) R vs Python for Data Science: The Winner is … : KDnuggets. Available from http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html [accessed 1/13/2017].
Vanwinckelen, G. and Blockeel, H. (2012) On estimating model accuracy with repeated cross-validation. In: BeneLearn 2012: Proceedings of the 21st Belgian-Dutch Conference on Machine Learning, 39-44.
Yu-Wei, C.D.C. (2015) Machine learning with R cookbook. : Packt Publishing Ltd.