The goal of this project is to use the supplied datasets (accelerometer data tagged with one of five classes of performance, see Data below) to predict the performance class of 20 new observations. This report outlines an approach using feature selection with cfsSubsetEval in Weka and a cross validated random forest model which achieves over 99% cross validation accuracy and 20/20 correct on the test data.
The data for this report come in the form of CSV training and testing files originally from http://groupware.les.inf.puc-rio.br/har and downloaded on Wed Jun 11 07:27:28 2014. See download_data.R for details.
The class variable represents performing a Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Note that these classes are unordered.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz34LM1QlJ3
From the paper Qualitative Activity Recognition of Weight Lifting Exercises
trainPath <- "./data/pml-training.csv"
testPath <- "./data/pml-testing.csv"
train <- read.csv(trainPath)
test <- read.csv(testPath)
nrow(train)
## [1] 19622
nrow(test)
## [1] 20
# Class variable
summary(train$classe)
## A B C D E
## 5580 3797 3422 3216 3607
#summary(train)
#summary(test)
Rather than using the caret functions nearZeroVar, findCorrelation, and findLinearCombos (or similar techniques) in this section (included for informational purposes) data processing was skipped in favor of doing feature selection in Weka using CfsSubsetEval starting with the user_name attribute after removing attributes which were “too informative” (like the row number X, timestamps, and numwindow). This is similar to the technique used by the paper authors referenced as M. A. Hall. Correlation-based Feature Subset Selection for Machine Learning except they used 17 features rather than 11. See wekaVars and indVars assignments below for the variables chosen.
Doing feature selection in this fashion avoided a number of issues with the data. For example, note issues with #DIV/0! in computed fields resulting in numeric values being treated as factors and many attributes containing excessive NAs.
After the initial implementation I became aware of the FSelector R package which supports a similar capability (but is less flexible, for example it is not possible to specify a starting set) and is used below as an example of how to do this type of feature selection in R (another alternative would be RWeka).
# Check for and remove near zero variance variables
# What is with all the variables having 406 values and 19216 NAs?
# Looks like this has to do with new_window
nzv <- nearZeroVar(train)
nzv
## [1] 6 12 13 14 15 16 17 20 23 26 51 52 53 54 55 56 57
## [18] 58 59 69 70 71 72 73 74 75 78 79 81 82 87 88 89 90
## [35] 91 92 95 98 101 125 126 127 128 129 130 131 133 134 136 137 139
## [52] 142 143 144 145 146 147 148 149 150
#sapply(nzv, function(x) summary(train[x]))
train <- train[,-nzv]
dataCor <- cor(train[sapply(train, is.numeric)], use="pairwise.complete.obs")
highlyCor <- findCorrelation(dataCor, cutoff = .75)
highlyCor
## [1] 9 56 22 11 7 31 16 10 69 12 30 28 19 5 77 8 54 55 61 75 60 59 66
## [24] 78 70 67 64 86 79 2 42 32 63 44 14 51 17 15 46 93 20 40 72 74 13 90
## [47] 23
# require(corrplot) # Nice correlation plot
# corrplot(dataCor)
# combos <- findLinearCombos(train[sapply(train, is.numeric)])
# combos
# Variable selection using CFS subset
library(FSelector)
useVars <- setdiff(colnames(train), c("X", "raw_timestamp_part_1",
"raw_timestamp_part_2", "cvtd_timestamp",
"num_window"))
cfsVars <- cfs(classe ~ ., train[, useVars])
# Weka CfsSubsetEval came up with this list after removing a few, merit 0.234
# 1,3,4,5,40,55,61,109,114,115,118 : 11
wekaVars <- c("user_name", "roll_belt", "pitch_belt", "yaw_belt",
"magnet_belt_z", "gyros_arm_x", "magnet_arm_x",
"gyros_dumbbell_y", "magnet_dumbbell_x", "magnet_dumbbell_y",
"pitch_forearm")
# Using 10-fold CV for CfsSubsetEval consistently gave the same list minus magnet_dumbbell_x
cfsVars
## [1] "yaw_belt" "min_roll_belt" "var_total_accel_belt"
## [4] "avg_roll_belt" "stddev_roll_belt" "var_roll_belt"
## [7] "avg_yaw_belt" "magnet_arm_z" "max_yaw_arm"
## [10] "amplitude_yaw_arm" "var_accel_dumbbell" "avg_roll_dumbbell"
## [13] "var_accel_forearm"
wekaVars
## [1] "user_name" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "magnet_belt_z" "gyros_arm_x"
## [7] "magnet_arm_x" "gyros_dumbbell_y" "magnet_dumbbell_x"
## [10] "magnet_dumbbell_y" "pitch_forearm"
intersect(wekaVars, cfsVars)
## [1] "yaw_belt"
# setdiff(cfsVars, wekaVars)
# setdiff(wekaVars, cfsVars)
# Set dependent and independent variables we will use below
depVar <- "classe"
indVars <- wekaVars
# indVars <- cfsVars # Try this for comparison (results were unsatisfactory)
The variable lists returned by FSelector and Weka are surprisingly different given they use a similar algorithm. Given that the Weka implementation was done by M. A. Hall (who wrote the original paper describing the technique) and is more flexible I tend to prefer it.
It turned out FSelector::cfs chose a poor set of variables. Most of them were almost completely NA! Therefore I will continue to use the Weka CfsSubsetEval selected variables.
The goal of this project is to predict the manner in which the test subjects did the exercise. This is the “classe” variable in the training set (depVar below). I chose a random forest model to begin because they usually give high accuracy although there can be runtime (using parallel processing in caret helps with this) and interpretability costs. The feature selection approach is described above and was chosen based on its use by the authors and my desire to use a technique I recently learned in More Data Mining with Weka in a practical setting.
10-fold cross validation was chosen as a good starting model validation technique. Even though we have enough data to make a train/test split possible, I like the averaging effect of cross validation and have had good experience using it in caret.
For out of sample error I would expect something similar to the observed 99% accuracy for the same subjects (user_name) and 20/20 on the submission lends support to that belief. But, I would expect worse results for different subjects. The paper authors did leave one subject out cross validation giving an accuracy of 78% (which seems like a reasonable estimate for new subjects). I did not feel it was necessary to replicate that technique for this project, but I did anyway as part of the supplemental analysis where LOSOCV accuracy was only 22%!.
The initial intent was for this to be a baseline model, but given the performance I see no need to change it.
# First try a simple random forest using caret
library(caret)
# Note multiclass classification adds some issues
# Since this is a good example of multiclass classification, try out multiClassSummary
# Use modified multiClassSummary to get as much performance info as possible
source("multiClassSummary.R")
trainMetric = "Accuracy" # Classification default, a poor metric if uneven split
trainSummary <- multiClassSummary
fitControl <-
trainControl(
#method = "none", # For testing, requires 1 model (tuneLength or grid)
method = "cv",
number = 10, # 10 is default
repeats = 1,
verboseIter = TRUE, # Debug, seems to be proving helpful
classProbs = TRUE, # Needed for twoClassSummary
summaryFunction = trainSummary,
selectionFunction = "best", # default, see ?best
allowParallel = usingParallel
)
rfGrid <- expand.grid(.mtry = c(5, 10))
set.seed(123)
rfFit <- train(train[,indVars],
train[,depVar],
method = "rf",
metric = trainMetric,
tuneGrid = rfGrid,
#tuneLength = 1, # Run once for debugging
trControl = fitControl,
# Following arguments for rf
importance=TRUE
)
## Warning: There were missing values in resampled performance measures.
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 5 on full training set
# Run time
rfFit$times$everything
## user system elapsed
## 47.15 1.76 2197.90
rfFit
## Random Forest
##
## 19622 samples
## 11 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 17660, 17661, 17659, 17660, 17658, 17660, ...
##
## Resampling results across tuning parameters:
##
## mtry ROC Sensitivity Specificity Accuracy Kappa logLoss
## 5 1 1 1 1 1 0.03
## 10 1 1 1 1 1 0.03
## AccuracyLower AccuracyUpper AccuracyPValue McnemarPValue
## 1 1 0 NaN
## 1 1 0 NaN
## Pos_Pred_Value Neg_Pred_Value Detection_Rate Balanced_Accuracy
## 1 1 0.2 1
## 1 1 0.2 1
## ROC SD Sensitivity SD Specificity SD Accuracy SD Kappa SD
## 9e-05 0.003 8e-04 0.003 0.004
## 2e-04 0.003 8e-04 0.003 0.004
## logLoss SD AccuracyLower SD AccuracyUpper SD AccuracyPValue SD
## 0.001 0.004 0.002 0
## 0.002 0.004 0.003 0
## Pos_Pred_Value SD Neg_Pred_Value SD Detection_Rate SD
## 0.003 8e-04 6e-04
## 0.003 8e-04 6e-04
## Balanced_Accuracy SD
## 0.002
## 0.002
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
rfImp <- varImp(rfFit)
rfImp
## rf variable importance
##
## variables are sorted by average importance across the classes
## A B C D E
## pitch_forearm 100.00 60.32 65.17 82.24 56.81
## pitch_belt 40.74 82.04 48.39 38.86 31.93
## magnet_dumbbell_y 41.08 38.10 69.27 39.45 34.84
## yaw_belt 67.58 66.08 59.32 67.94 29.13
## roll_belt 25.33 50.80 50.10 42.92 62.14
## gyros_dumbbell_y 19.72 26.30 47.24 8.37 15.06
## gyros_arm_x 11.15 44.81 32.50 24.67 19.73
## magnet_arm_x 9.07 18.18 30.59 40.69 5.94
## magnet_dumbbell_x 25.08 30.58 36.24 19.95 36.17
## magnet_belt_z 14.81 35.81 17.18 28.91 32.79
## user_name 0.00 4.72 1.08 10.88 4.98
plot(rfFit)
# Look at performance
getTrainPerf(rfFit)
## TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1 0.9999 0.9904 0.9977 0.991 0.9886
## TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1 0.0296 0.9858 0.9946 0
## TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1 NaN 0.9906 0.9978
## TrainDetection_Rate TrainBalanced_Accuracy method
## 1 0.1982 0.9941 rf
confusionMatrix(rfFit)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction A B C D E
## A 28.3 0.1 0.0 0.1 0.0
## B 0.0 19.0 0.1 0.0 0.0
## C 0.0 0.3 17.3 0.1 0.0
## D 0.0 0.0 0.0 16.2 0.0
## E 0.0 0.0 0.0 0.0 18.3
The CV training accuracy is superb at 99%.
The paper authors presented a confusion matrix for the leave one subject out test with overall accuracy 78.2%. Their overall accuracy (of all data) given was 98.2%.
Now generate predictions on the test set and write to files in the answers directory for submission.
20 out of 20 answers correct.
This is the end of the official submission. Below this is some supplemental analysis which I thought might be of interest, but should not be included in the grading.
Although accurate, the random forest model above is not very interpretable. Try a decision tree model here to see if we learn anything more about the data.
fitControlMulti <-
trainControl(method = "cv",
number = 10, # 10 is default
repeats = 1,
verboseIter = TRUE, # Debug, seems to be proving helpful
classProbs = TRUE, # Needed for twoClassSummary
summaryFunction = trainSummary,
selectionFunction = "best", # default, see ?best
allowParallel = usingParallel
)
rpartGrid <- expand.grid(.cp = c(0.02, 0.05))
set.seed(123)
rpartFit <- train(train[,indVars],
train[,depVar],
method = "rpart",
metric = trainMetric,
tuneGrid = rpartGrid,
#tuneLength = 1, # Run once for debugging
trControl = fitControlMulti
#trControl = fitControl
# Following arguments for rpart
)
## Warning: There were missing values in resampled performance measures.
## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.02 on full training set
## user system elapsed
## 1.92 0.13 43.81
## CART
##
## 19622 samples
## 11 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 17660, 17661, 17659, 17660, 17658, 17660, ...
##
## Resampling results across tuning parameters:
##
## cp ROC Sensitivity Specificity Accuracy Kappa logLoss
## 0.02 0.8 0.6 0.9 0.6 0.5 0.3
## 0.05 0.6 0.3 0.8 0.4 0.1 0.4
## AccuracyLower AccuracyUpper AccuracyPValue McnemarPValue
## 0.6 0.6 2e-141 2e-26
## 0.3 0.4 2e-13 NaN
## Pos_Pred_Value Neg_Pred_Value Detection_Rate Balanced_Accuracy
## 0.6 0.9 0.1 0.7
## NaN 0.9 0.07 0.6
## ROC SD Sensitivity SD Specificity SD Accuracy SD Kappa SD
## 0.02 0.02 0.005 0.02 0.03
## 0.005 0.005 0.001 0.005 0.007
## logLoss SD AccuracyLower SD AccuracyUpper SD AccuracyPValue SD
## 0.01 0.02 0.02 8e-141
## 0.003 0.005 0.005 5e-13
## McnemarPValue SD Pos_Pred_Value SD Neg_Pred_Value SD
## 6e-26 0.02 0.004
## NA NA 0.001
## Detection_Rate SD Balanced_Accuracy SD
## 0.004 0.01
## 0.001 0.003
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02.
## TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1 0.8419 0.5956 0.9005 0.6045 0.5002
## TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1 0.3301 0.5825 0.6262 2.393e-141
## TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1 2.006e-26 0.6361 0.9013
## TrainDetection_Rate TrainBalanced_Accuracy method
## 1 0.1209 0.748 rpart
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction A B C D E
## A 19.7 4.0 2.6 2.8 0.9
## B 1.1 9.0 0.6 0.4 1.1
## C 2.8 2.5 12.4 3.5 2.6
## D 3.7 3.7 1.7 8.5 2.9
## E 1.2 0.1 0.1 1.3 10.9
## pdf
## 2
The accuracy is clearly inferior to random forest, but for small cp is not that bad.
The tree was too large to display inline, but a PNG of it is available in the repository as CourseProjectTree.png
Mentioned above as being outside of the scope of this project I thought this would be a useful technique to know how to use in the future so decided to try to implement it here.
Henceforth called LOSOCV. (aka LOSOXV)
Most straightforward way to do this appears to be using the index argument to trainControl: http://stats.stackexchange.com/questions/93227/how-to-implement-a-hold-out-validation-in-r
It is unclear to me if/how LGOCV method in trainControl relates to this.
Some discussion of leave-one-subject-out cross-validation
A survey of cross-validation procedures for model selection
LOSOCV at Kaggle
subject <- train[,"user_name"]
subjectFolds <- length(levels(subject))
# Create folds by subject
subjectIndexes <- list()
for (i in 1:length(levels(subject))){
subjectIndexes[[paste0("Fold",i)]] = which(subject == levels(subject)[i])
}
str(subjectIndexes)
## List of 6
## $ Fold1: int [1:3892] 694 695 696 697 698 699 700 701 702 703 ...
## $ Fold2: int [1:3112] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold3: int [1:3536] 902 903 904 905 906 907 908 909 910 911 ...
## $ Fold4: int [1:3070] 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 ...
## $ Fold5: int [1:3402] 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 ...
## $ Fold6: int [1:2610] 166 167 168 169 170 171 172 173 174 175 ...
#summary(subjectIndexes)
#lapply(subjectIndexes, summary)
# Compare to basic createFolds
folds <- createFolds(train[,depVar], k=subjectFolds)
# subjectIndexes <- folds # This works fine
fitControlLOSOCV <-
trainControl(method = "cv",
number = subjectFolds,
repeats = 1,
verboseIter = TRUE, # Debug, seems to be proving helpful
classProbs = TRUE, # Needed for twoClassSummary
summaryFunction = trainSummary,
selectionFunction = "best", # default, see ?best
index = subjectIndexes,
allowParallel = usingParallel
)
rfGridLOSOCV <- expand.grid(.mtry = c(5))
set.seed(123)
rfFitLOSOCV <- train(train[,indVars],
train[,depVar],
method = "rf",
metric = trainMetric,
tuneGrid = rfGridLOSOCV,
#tuneLength = 1, # Run once for debugging
trControl = fitControlLOSOCV,
# Following arguments for rf
importance=TRUE
)
## Warning: There were missing values in resampled performance measures.
## Aggregating results
## Fitting final model on full training set
# Run time
rfFitLOSOCV$times$everything
## user system elapsed
## 45.60 0.67 94.07
rfFitLOSOCV
## Random Forest
##
## 19622 samples
## 11 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (6 fold)
##
## Summary of sample sizes: 3892, 3112, 3536, 3070, 3402, 2610, ...
##
## Resampling results
##
## ROC Sensitivity Specificity Accuracy Kappa logLoss AccuracyLower
## 0.6 0.2 0.8 0.2 0.05 0.8 0.2
## AccuracyUpper AccuracyPValue McnemarPValue Pos_Pred_Value
## 0.2 0.8 0 0.3
## Neg_Pred_Value Detection_Rate Balanced_Accuracy ROC SD
## 0.8 0.05 0.5 0.06
## Sensitivity SD Specificity SD Accuracy SD Kappa SD logLoss SD
## 0.04 0.01 0.05 0.05 0.2
## AccuracyLower SD AccuracyUpper SD AccuracyPValue SD McnemarPValue SD
## 0.05 0.05 0.4 0
## Pos_Pred_Value SD Neg_Pred_Value SD Detection_Rate SD
## 0.02 0.01 0.009
## Balanced_Accuracy SD
## 0.03
##
## Tuning parameter 'mtry' was held constant at a value of 5
##
rfImpLOSOCV <- varImp(rfFitLOSOCV)
rfImpLOSOCV
## rf variable importance
##
## variables are sorted by average importance across the classes
## A B C D E
## pitch_forearm 100.00 64.5 67.87 87.62 62.87
## pitch_belt 49.85 87.7 48.15 41.29 33.19
## magnet_dumbbell_y 41.19 44.4 84.01 41.77 38.89
## yaw_belt 72.74 74.7 63.99 75.29 35.51
## roll_belt 28.95 54.4 54.15 44.11 72.14
## gyros_dumbbell_y 15.68 25.9 52.34 7.26 12.66
## magnet_arm_x 8.93 19.6 33.90 44.24 7.13
## gyros_arm_x 11.42 43.6 34.83 27.99 18.45
## magnet_dumbbell_x 29.38 29.1 39.81 21.14 38.42
## magnet_belt_z 16.81 37.6 18.07 34.16 34.53
## user_name 0.00 6.7 2.56 10.82 6.74
# Look at performance
getTrainPerf(rfFitLOSOCV)
## TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1 0.5889 0.2371 0.8102 0.2344 0.04933
## TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1 0.8474 0.228 0.241 0.8333
## TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1 0 0.2852 0.8146
## TrainDetection_Rate TrainBalanced_Accuracy method
## 1 0.04689 0.5237 rf
confusionMatrix(rfFitLOSOCV)
## Cross-Validated (6 fold) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction A B C D E
## A 5.1 2.2 2.2 1.5 1.4
## B 4.5 4.3 2.5 3.0 2.6
## C 2.8 1.4 1.4 0.7 0.5
## D 2.0 2.1 1.7 1.8 3.0
## E 14.1 9.4 9.7 9.3 10.9
It turns out this model is virtually useless for predicting subjects not included in the training set!
Try using a different set of variables (not including user_name) from cfsSubsetEval in Weka.
# Weka CfsSubsetEval came up with this list after removing a few, merit 0.266
# 3,4,5,61,109,115,118 : 7
# This is a subset of the original set so just try removing user_name from the original
wekaVars1 <- c("roll_belt", "pitch_belt", "yaw_belt",
"magnet_belt_z", "gyros_arm_x", "magnet_arm_x",
"gyros_dumbbell_y", "magnet_dumbbell_x", "magnet_dumbbell_y",
"pitch_forearm")
# Using 10-fold CV for CfsSubsetEval gave similar results
set.seed(123)
rfFitLOSOCV1 <- train(train[,wekaVars1],
train[,depVar],
method = "rf",
metric = trainMetric,
tuneGrid = rfGridLOSOCV,
#tuneLength = 1, # Run once for debugging
trControl = fitControlLOSOCV,
# Following arguments for rf
importance=TRUE
)
## Warning: There were missing values in resampled performance measures.
## Aggregating results
## Fitting final model on full training set
# Run time
rfFitLOSOCV1$times$everything
## user system elapsed
## 42.27 1.04 287.15
rfFitLOSOCV1
## Random Forest
##
## 19622 samples
## 10 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (6 fold)
##
## Summary of sample sizes: 3892, 3112, 3536, 3070, 3402, 2610, ...
##
## Resampling results
##
## ROC Sensitivity Specificity Accuracy Kappa logLoss AccuracyLower
## 0.6 0.2 0.8 0.2 0.04 0.9 0.2
## AccuracyUpper AccuracyPValue McnemarPValue Pos_Pred_Value
## 0.2 0.8 0 0.3
## Neg_Pred_Value Detection_Rate Balanced_Accuracy ROC SD
## 0.8 0.05 0.5 0.06
## Sensitivity SD Specificity SD Accuracy SD Kappa SD logLoss SD
## 0.05 0.01 0.06 0.06 0.2
## AccuracyLower SD AccuracyUpper SD AccuracyPValue SD McnemarPValue SD
## 0.06 0.06 0.4 0
## Pos_Pred_Value SD Neg_Pred_Value SD Detection_Rate SD
## 0.06 0.02 0.01
## Balanced_Accuracy SD
## 0.03
##
## Tuning parameter 'mtry' was held constant at a value of 5
##
rfImpLOSOCV1 <- varImp(rfFitLOSOCV1)
rfImpLOSOCV1
## rf variable importance
##
## variables are sorted by average importance across the classes
## A B C D E
## pitch_forearm 100.00 56.0 65.5 88.74 56.89
## pitch_belt 40.08 91.2 52.2 40.34 33.48
## magnet_dumbbell_y 42.13 50.6 88.7 45.84 46.03
## yaw_belt 75.51 79.8 66.0 80.52 36.36
## roll_belt 20.16 52.4 46.3 44.06 71.70
## gyros_arm_x 4.45 44.4 32.1 24.23 9.46
## gyros_dumbbell_y 11.47 24.0 43.1 4.19 9.91
## magnet_arm_x 1.45 11.5 26.8 41.97 0.00
## magnet_dumbbell_x 22.78 25.7 35.1 22.85 36.73
## magnet_belt_z 12.89 35.4 10.4 29.10 35.33
# Look at CV performance
getTrainPerf(rfFitLOSOCV1)
## TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1 0.5888 0.2328 0.8086 0.2269 0.04198
## TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1 0.8779 0.2206 0.2334 0.8333
## TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1 0 0.2545 0.8119
## TrainDetection_Rate TrainBalanced_Accuracy method
## 1 0.04539 0.5207 rf
confusionMatrix(rfFitLOSOCV1)
## Cross-Validated (6 fold) Confusion Matrix
##
## (entries are percentages of table totals)
##
## Reference
## Prediction A B C D E
## A 4.1 2.0 2.0 1.5 1.2
## B 4.5 4.3 2.6 2.9 2.7
## C 2.8 1.5 1.4 0.7 0.6
## D 2.0 2.1 1.7 1.9 2.9
## E 14.9 9.4 9.8 9.4 11.0
# Look at performance of the final model. This will overfit since it uses the entire
# training set to fit the model.
predLOSOCV1 <- predict(rfFitLOSOCV1)#, train[,wekaVars1])
confusionMatrix(train$classe, predLOSOCV1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5580 0 0 0 0
## B 0 3797 0 0 0
## C 0 0 3422 0 0
## D 0 0 0 3216 0
## E 0 0 0 0 3607
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 1.000 1.000 1.000 1.000
## Specificity 1.000 1.000 1.000 1.000 1.000
## Pos Pred Value 1.000 1.000 1.000 1.000 1.000
## Neg Pred Value 1.000 1.000 1.000 1.000 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.194 0.174 0.164 0.184
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 1.000 1.000 1.000 1.000 1.000
This model is still virtually useless for predicting subjects outside of the training set! Note the difference between the CV accuracy (usually a good estimate for out of sample) and the accuracy on the entire training set (grossly overfit).
I am impressed by the 78.2% LOSOCV accuracy quoted in the paper. Working to replicate that would be worthwhile, but I have already exceeded my time budget for this project. Section 5. DETECTION OF MISTAKES states that there were six subjects so I don’t think the authors used more users/data than was made available to us.