Synopsis

The goal of this project is to use the supplied datasets (accelerometer data tagged with one of five classes of performance, see Data below) to predict the performance class of 20 new observations. This report outlines an approach using feature selection with cfsSubsetEval in Weka and a cross validated random forest model which achieves over 99% cross validation accuracy and 20/20 correct on the test data.

Data

The data for this report come in the form of CSV training and testing files originally from http://groupware.les.inf.puc-rio.br/har and downloaded on Wed Jun 11 07:27:28 2014. See download_data.R for details.

The class variable represents performing a Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Note that these classes are unordered.

trainPath <- "./data/pml-training.csv"
testPath <- "./data/pml-testing.csv"

train <- read.csv(trainPath)
test <- read.csv(testPath)

nrow(train)

## [1] 19622

nrow(test)

## [1] 20

# Class variable
summary(train$classe)

##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

#summary(train)
#summary(test)

Data Processing

Rather than using the caret functions nearZeroVar, findCorrelation, and findLinearCombos (or similar techniques) in this section (included for informational purposes) data processing was skipped in favor of doing feature selection in Weka using CfsSubsetEval starting with the user_name attribute after removing attributes which were “too informative” (like the row number X, timestamps, and numwindow). This is similar to the technique used by the paper authors referenced as M. A. Hall. Correlation-based Feature Subset Selection for Machine Learning except they used 17 features rather than 11. See wekaVars and indVars assignments below for the variables chosen.

Doing feature selection in this fashion avoided a number of issues with the data. For example, note issues with #DIV/0! in computed fields resulting in numeric values being treated as factors and many attributes containing excessive NAs.

After the initial implementation I became aware of the FSelector R package which supports a similar capability (but is less flexible, for example it is not possible to specify a starting set) and is used below as an example of how to do this type of feature selection in R (another alternative would be RWeka).

# Check for and remove near zero variance variables
# What is with all the variables having 406 values and 19216 NAs?
# Looks like this has to do with new_window
nzv <- nearZeroVar(train)
nzv

##  [1]   6  12  13  14  15  16  17  20  23  26  51  52  53  54  55  56  57
## [18]  58  59  69  70  71  72  73  74  75  78  79  81  82  87  88  89  90
## [35]  91  92  95  98 101 125 126 127 128 129 130 131 133 134 136 137 139
## [52] 142 143 144 145 146 147 148 149 150

#sapply(nzv, function(x) summary(train[x]))
train <- train[,-nzv]

dataCor <- cor(train[sapply(train, is.numeric)], use="pairwise.complete.obs")
highlyCor <- findCorrelation(dataCor, cutoff = .75)
highlyCor

##  [1]  9 56 22 11  7 31 16 10 69 12 30 28 19  5 77  8 54 55 61 75 60 59 66
## [24] 78 70 67 64 86 79  2 42 32 63 44 14 51 17 15 46 93 20 40 72 74 13 90
## [47] 23

# require(corrplot) # Nice correlation plot
# corrplot(dataCor)

# combos <- findLinearCombos(train[sapply(train, is.numeric)])
# combos

# Variable selection using CFS subset
library(FSelector)
useVars <- setdiff(colnames(train), c("X", "raw_timestamp_part_1",
                                      "raw_timestamp_part_2", "cvtd_timestamp",
                                      "num_window"))
cfsVars <- cfs(classe ~ ., train[, useVars])

# Weka CfsSubsetEval came up with this list after removing a few, merit 0.234
# 1,3,4,5,40,55,61,109,114,115,118 : 11
wekaVars <- c("user_name", "roll_belt", "pitch_belt", "yaw_belt",
              "magnet_belt_z", "gyros_arm_x", "magnet_arm_x",
              "gyros_dumbbell_y", "magnet_dumbbell_x", "magnet_dumbbell_y",
              "pitch_forearm")
# Using 10-fold CV for CfsSubsetEval consistently gave the same list minus magnet_dumbbell_x

cfsVars

##  [1] "yaw_belt"             "min_roll_belt"        "var_total_accel_belt"
##  [4] "avg_roll_belt"        "stddev_roll_belt"     "var_roll_belt"       
##  [7] "avg_yaw_belt"         "magnet_arm_z"         "max_yaw_arm"         
## [10] "amplitude_yaw_arm"    "var_accel_dumbbell"   "avg_roll_dumbbell"   
## [13] "var_accel_forearm"

wekaVars

##  [1] "user_name"         "roll_belt"         "pitch_belt"       
##  [4] "yaw_belt"          "magnet_belt_z"     "gyros_arm_x"      
##  [7] "magnet_arm_x"      "gyros_dumbbell_y"  "magnet_dumbbell_x"
## [10] "magnet_dumbbell_y" "pitch_forearm"

intersect(wekaVars, cfsVars)

## [1] "yaw_belt"

# setdiff(cfsVars, wekaVars)
# setdiff(wekaVars, cfsVars)

# Set dependent and independent variables we will use below
depVar <- "classe"
indVars <- wekaVars
# indVars <- cfsVars # Try this for comparison (results were unsatisfactory)

The variable lists returned by FSelector and Weka are surprisingly different given they use a similar algorithm. Given that the Weka implementation was done by M. A. Hall (who wrote the original paper describing the technique) and is more flexible I tend to prefer it.

It turned out FSelector::cfs chose a poor set of variables. Most of them were almost completely NA! Therefore I will continue to use the Weka CfsSubsetEval selected variables.

Results

The goal of this project is to predict the manner in which the test subjects did the exercise. This is the “classe” variable in the training set (depVar below). I chose a random forest model to begin because they usually give high accuracy although there can be runtime (using parallel processing in caret helps with this) and interpretability costs. The feature selection approach is described above and was chosen based on its use by the authors and my desire to use a technique I recently learned in More Data Mining with Weka in a practical setting.

10-fold cross validation was chosen as a good starting model validation technique. Even though we have enough data to make a train/test split possible, I like the averaging effect of cross validation and have had good experience using it in caret.

For out of sample error I would expect something similar to the observed 99% accuracy for the same subjects (user_name) and 20/20 on the submission lends support to that belief. But, I would expect worse results for different subjects. The paper authors did leave one subject out cross validation giving an accuracy of 78% (which seems like a reasonable estimate for new subjects). I did not feel it was necessary to replicate that technique for this project, but I did anyway as part of the supplemental analysis where LOSOCV accuracy was only 22%!.

The initial intent was for this to be a baseline model, but given the performance I see no need to change it.

# First try a simple random forest using caret
library(caret)

# Note multiclass classification adds some issues
# Since this is a good example of multiclass classification, try out multiClassSummary
# Use modified multiClassSummary to get as much performance info as possible
source("multiClassSummary.R")
trainMetric = "Accuracy" # Classification default, a poor metric if uneven split
trainSummary <- multiClassSummary

fitControl <-
  trainControl(
               #method = "none", # For testing, requires 1 model (tuneLength or grid)
               method = "cv",
               number = 10, # 10 is default
               repeats = 1,
               verboseIter = TRUE, # Debug, seems to be proving helpful
               classProbs = TRUE, # Needed for twoClassSummary
               summaryFunction = trainSummary,
               selectionFunction = "best", # default, see ?best
               allowParallel = usingParallel
               )

rfGrid <- expand.grid(.mtry = c(5, 10))

set.seed(123)
rfFit <- train(train[,indVars],
               train[,depVar],
               method = "rf",
               metric = trainMetric,
               tuneGrid = rfGrid,
               #tuneLength = 1, # Run once for debugging
               trControl = fitControl,
               # Following arguments for rf
               importance=TRUE
               )

## Warning: There were missing values in resampled performance measures.

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 5 on full training set

# Run time
rfFit$times$everything

##    user  system elapsed 
##   47.15    1.76 2197.90

rfFit

## Random Forest 
## 
## 19622 samples
##    11 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 17660, 17661, 17659, 17660, 17658, 17660, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  ROC  Sensitivity  Specificity  Accuracy  Kappa  logLoss
##   5     1    1            1            1         1      0.03   
##   10    1    1            1            1         1      0.03   
##   AccuracyLower  AccuracyUpper  AccuracyPValue  McnemarPValue
##   1              1              0               NaN          
##   1              1              0               NaN          
##   Pos_Pred_Value  Neg_Pred_Value  Detection_Rate  Balanced_Accuracy
##   1               1               0.2             1                
##   1               1               0.2             1                
##   ROC SD  Sensitivity SD  Specificity SD  Accuracy SD  Kappa SD
##   9e-05   0.003           8e-04           0.003        0.004   
##   2e-04   0.003           8e-04           0.003        0.004   
##   logLoss SD  AccuracyLower SD  AccuracyUpper SD  AccuracyPValue SD
##   0.001       0.004             0.002             0                
##   0.002       0.004             0.003             0                
##   Pos_Pred_Value SD  Neg_Pred_Value SD  Detection_Rate SD
##   0.003              8e-04              6e-04            
##   0.003              8e-04              6e-04            
##   Balanced_Accuracy SD
##   0.002               
##   0.002               
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 5.

rfImp <- varImp(rfFit)
rfImp

## rf variable importance
## 
##   variables are sorted by average importance across the classes
##                        A     B     C     D     E
## pitch_forearm     100.00 60.32 65.17 82.24 56.81
## pitch_belt         40.74 82.04 48.39 38.86 31.93
## magnet_dumbbell_y  41.08 38.10 69.27 39.45 34.84
## yaw_belt           67.58 66.08 59.32 67.94 29.13
## roll_belt          25.33 50.80 50.10 42.92 62.14
## gyros_dumbbell_y   19.72 26.30 47.24  8.37 15.06
## gyros_arm_x        11.15 44.81 32.50 24.67 19.73
## magnet_arm_x        9.07 18.18 30.59 40.69  5.94
## magnet_dumbbell_x  25.08 30.58 36.24 19.95 36.17
## magnet_belt_z      14.81 35.81 17.18 28.91 32.79
## user_name           0.00  4.72  1.08 10.88  4.98

plot(rfFit)

plot of chunk resultsPerf

# Look at performance
getTrainPerf(rfFit)

##   TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1   0.9999           0.9904           0.9977         0.991     0.9886
##   TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1       0.0296             0.9858             0.9946                   0
##   TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1                NaN              0.9906              0.9978
##   TrainDetection_Rate TrainBalanced_Accuracy method
## 1              0.1982                 0.9941     rf

confusionMatrix(rfFit)

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.3  0.1  0.0  0.1  0.0
##          B  0.0 19.0  0.1  0.0  0.0
##          C  0.0  0.3 17.3  0.1  0.0
##          D  0.0  0.0  0.0 16.2  0.0
##          E  0.0  0.0  0.0  0.0 18.3

The CV training accuracy is superb at 99%.

The paper authors presented a confusion matrix for the leave one subject out test with overall accuracy 78.2%. Their overall accuracy (of all data) given was 98.2%.

Now generate predictions on the test set and write to files in the answers directory for submission.

20 out of 20 answers correct.

This is the end of the official submission. Below this is some supplemental analysis which I thought might be of interest, but should not be included in the grading.

Supplemental Analysis (not to be graded)

Decision Tree

Although accurate, the random forest model above is not very interpretable. Try a decision tree model here to see if we learn anything more about the data.

fitControlMulti <-
  trainControl(method = "cv",
               number = 10, # 10 is default
               repeats = 1,
               verboseIter = TRUE, # Debug, seems to be proving helpful
               classProbs = TRUE, # Needed for twoClassSummary
               summaryFunction = trainSummary,
               selectionFunction = "best", # default, see ?best
               allowParallel = usingParallel
               )

rpartGrid <- expand.grid(.cp = c(0.02, 0.05))

set.seed(123)
rpartFit <- train(train[,indVars],
                  train[,depVar],
                  method = "rpart",
                  metric = trainMetric,
                  tuneGrid = rpartGrid,
                  #tuneLength = 1, # Run once for debugging
                  trControl = fitControlMulti
                  #trControl = fitControl
                  # Following arguments for rpart
                  )

## Warning: There were missing values in resampled performance measures.

## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.02 on full training set

##    user  system elapsed 
##    1.92    0.13   43.81

## CART 
## 
## 19622 samples
##    11 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 17660, 17661, 17659, 17660, 17658, 17660, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp    ROC  Sensitivity  Specificity  Accuracy  Kappa  logLoss
##   0.02  0.8  0.6          0.9          0.6       0.5    0.3    
##   0.05  0.6  0.3          0.8          0.4       0.1    0.4    
##   AccuracyLower  AccuracyUpper  AccuracyPValue  McnemarPValue
##   0.6            0.6            2e-141          2e-26        
##   0.3            0.4            2e-13           NaN          
##   Pos_Pred_Value  Neg_Pred_Value  Detection_Rate  Balanced_Accuracy
##   0.6             0.9             0.1             0.7              
##   NaN             0.9             0.07            0.6              
##   ROC SD  Sensitivity SD  Specificity SD  Accuracy SD  Kappa SD
##   0.02    0.02            0.005           0.02         0.03    
##   0.005   0.005           0.001           0.005        0.007   
##   logLoss SD  AccuracyLower SD  AccuracyUpper SD  AccuracyPValue SD
##   0.01        0.02              0.02              8e-141           
##   0.003       0.005             0.005             5e-13            
##   McnemarPValue SD  Pos_Pred_Value SD  Neg_Pred_Value SD
##   6e-26             0.02               0.004            
##   NA                NA                 0.001            
##   Detection_Rate SD  Balanced_Accuracy SD
##   0.004              0.01                
##   0.001              0.003               
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.02.

plot of chunk treePerf

##   TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1   0.8419           0.5956           0.9005        0.6045     0.5002
##   TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1       0.3301             0.5825             0.6262          2.393e-141
##   TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1          2.006e-26              0.6361              0.9013
##   TrainDetection_Rate TrainBalanced_Accuracy method
## 1              0.1209                  0.748  rpart

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 19.7  4.0  2.6  2.8  0.9
##          B  1.1  9.0  0.6  0.4  1.1
##          C  2.8  2.5 12.4  3.5  2.6
##          D  3.7  3.7  1.7  8.5  2.9
##          E  1.2  0.1  0.1  1.3 10.9

## pdf 
##   2

The accuracy is clearly inferior to random forest, but for small cp is not that bad.

The tree was too large to display inline, but a PNG of it is available in the repository as CourseProjectTree.png

Leave One Subject Out Cross Validation

Mentioned above as being outside of the scope of this project I thought this would be a useful technique to know how to use in the future so decided to try to implement it here.

Henceforth called LOSOCV. (aka LOSOXV)

Most straightforward way to do this appears to be using the index argument to trainControl: http://stats.stackexchange.com/questions/93227/how-to-implement-a-hold-out-validation-in-r
It is unclear to me if/how LGOCV method in trainControl relates to this.

Some discussion of leave-one-subject-out cross-validation
A survey of cross-validation procedures for model selection
LOSOCV at Kaggle

subject <- train[,"user_name"]
subjectFolds <- length(levels(subject))

# Create folds by subject
subjectIndexes <- list()
for (i in 1:length(levels(subject))){ 
  subjectIndexes[[paste0("Fold",i)]] = which(subject == levels(subject)[i])
}

str(subjectIndexes)

## List of 6
##  $ Fold1: int [1:3892] 694 695 696 697 698 699 700 701 702 703 ...
##  $ Fold2: int [1:3112] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold3: int [1:3536] 902 903 904 905 906 907 908 909 910 911 ...
##  $ Fold4: int [1:3070] 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 ...
##  $ Fold5: int [1:3402] 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 ...
##  $ Fold6: int [1:2610] 166 167 168 169 170 171 172 173 174 175 ...

#summary(subjectIndexes)
#lapply(subjectIndexes, summary)

# Compare to basic createFolds
folds <- createFolds(train[,depVar], k=subjectFolds)
# subjectIndexes <- folds # This works fine

fitControlLOSOCV <-
  trainControl(method = "cv",
               number = subjectFolds,
               repeats = 1,
               verboseIter = TRUE, # Debug, seems to be proving helpful
               classProbs = TRUE, # Needed for twoClassSummary
               summaryFunction = trainSummary,
               selectionFunction = "best", # default, see ?best
               index = subjectIndexes,
               allowParallel = usingParallel
               )

rfGridLOSOCV <- expand.grid(.mtry = c(5))

set.seed(123)
rfFitLOSOCV <- train(train[,indVars],
                     train[,depVar],
                     method = "rf",
                     metric = trainMetric,
                     tuneGrid = rfGridLOSOCV,
                     #tuneLength = 1, # Run once for debugging
                     trControl = fitControlLOSOCV,
                     # Following arguments for rf
                     importance=TRUE
                     )

## Warning: There were missing values in resampled performance measures.

## Aggregating results
## Fitting final model on full training set

# Run time
rfFitLOSOCV$times$everything

##    user  system elapsed 
##   45.60    0.67   94.07

rfFitLOSOCV

## Random Forest 
## 
## 19622 samples
##    11 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold) 
## 
## Summary of sample sizes: 3892, 3112, 3536, 3070, 3402, 2610, ... 
## 
## Resampling results
## 
##   ROC  Sensitivity  Specificity  Accuracy  Kappa  logLoss  AccuracyLower
##   0.6  0.2          0.8          0.2       0.05   0.8      0.2          
##   AccuracyUpper  AccuracyPValue  McnemarPValue  Pos_Pred_Value
##   0.2            0.8             0              0.3           
##   Neg_Pred_Value  Detection_Rate  Balanced_Accuracy  ROC SD
##   0.8             0.05            0.5                0.06  
##   Sensitivity SD  Specificity SD  Accuracy SD  Kappa SD  logLoss SD
##   0.04            0.01            0.05         0.05      0.2       
##   AccuracyLower SD  AccuracyUpper SD  AccuracyPValue SD  McnemarPValue SD
##   0.05              0.05              0.4                0               
##   Pos_Pred_Value SD  Neg_Pred_Value SD  Detection_Rate SD
##   0.02               0.01               0.009            
##   Balanced_Accuracy SD
##   0.03                
## 
## Tuning parameter 'mtry' was held constant at a value of 5
##

rfImpLOSOCV <- varImp(rfFitLOSOCV)
rfImpLOSOCV

## rf variable importance
## 
##   variables are sorted by average importance across the classes
##                        A    B     C     D     E
## pitch_forearm     100.00 64.5 67.87 87.62 62.87
## pitch_belt         49.85 87.7 48.15 41.29 33.19
## magnet_dumbbell_y  41.19 44.4 84.01 41.77 38.89
## yaw_belt           72.74 74.7 63.99 75.29 35.51
## roll_belt          28.95 54.4 54.15 44.11 72.14
## gyros_dumbbell_y   15.68 25.9 52.34  7.26 12.66
## magnet_arm_x        8.93 19.6 33.90 44.24  7.13
## gyros_arm_x        11.42 43.6 34.83 27.99 18.45
## magnet_dumbbell_x  29.38 29.1 39.81 21.14 38.42
## magnet_belt_z      16.81 37.6 18.07 34.16 34.53
## user_name           0.00  6.7  2.56 10.82  6.74

# Look at performance
getTrainPerf(rfFitLOSOCV)

##   TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1   0.5889           0.2371           0.8102        0.2344    0.04933
##   TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1       0.8474              0.228              0.241              0.8333
##   TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1                  0              0.2852              0.8146
##   TrainDetection_Rate TrainBalanced_Accuracy method
## 1             0.04689                 0.5237     rf

confusionMatrix(rfFitLOSOCV)

## Cross-Validated (6 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A  5.1  2.2  2.2  1.5  1.4
##          B  4.5  4.3  2.5  3.0  2.6
##          C  2.8  1.4  1.4  0.7  0.5
##          D  2.0  2.1  1.7  1.8  3.0
##          E 14.1  9.4  9.7  9.3 10.9

It turns out this model is virtually useless for predicting subjects not included in the training set!

Try using a different set of variables (not including user_name) from cfsSubsetEval in Weka.

# Weka CfsSubsetEval came up with this list after removing a few, merit 0.266
# 3,4,5,61,109,115,118 : 7
# This is a subset of the original set so just try removing user_name from the original
wekaVars1 <- c("roll_belt", "pitch_belt", "yaw_belt",
               "magnet_belt_z", "gyros_arm_x", "magnet_arm_x",
               "gyros_dumbbell_y", "magnet_dumbbell_x", "magnet_dumbbell_y",
               "pitch_forearm")
# Using 10-fold CV for CfsSubsetEval gave similar results

set.seed(123)
rfFitLOSOCV1 <- train(train[,wekaVars1],
                      train[,depVar],
                      method = "rf",
                      metric = trainMetric,
                      tuneGrid = rfGridLOSOCV,
                      #tuneLength = 1, # Run once for debugging
                      trControl = fitControlLOSOCV,
                      # Following arguments for rf
                      importance=TRUE
                      )

## Warning: There were missing values in resampled performance measures.

## Aggregating results
## Fitting final model on full training set

# Run time
rfFitLOSOCV1$times$everything

##    user  system elapsed 
##   42.27    1.04  287.15

rfFitLOSOCV1

## Random Forest 
## 
## 19622 samples
##    10 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (6 fold) 
## 
## Summary of sample sizes: 3892, 3112, 3536, 3070, 3402, 2610, ... 
## 
## Resampling results
## 
##   ROC  Sensitivity  Specificity  Accuracy  Kappa  logLoss  AccuracyLower
##   0.6  0.2          0.8          0.2       0.04   0.9      0.2          
##   AccuracyUpper  AccuracyPValue  McnemarPValue  Pos_Pred_Value
##   0.2            0.8             0              0.3           
##   Neg_Pred_Value  Detection_Rate  Balanced_Accuracy  ROC SD
##   0.8             0.05            0.5                0.06  
##   Sensitivity SD  Specificity SD  Accuracy SD  Kappa SD  logLoss SD
##   0.05            0.01            0.06         0.06      0.2       
##   AccuracyLower SD  AccuracyUpper SD  AccuracyPValue SD  McnemarPValue SD
##   0.06              0.06              0.4                0               
##   Pos_Pred_Value SD  Neg_Pred_Value SD  Detection_Rate SD
##   0.06               0.02               0.01             
##   Balanced_Accuracy SD
##   0.03                
## 
## Tuning parameter 'mtry' was held constant at a value of 5
##

rfImpLOSOCV1 <- varImp(rfFitLOSOCV1)
rfImpLOSOCV1

## rf variable importance
## 
##   variables are sorted by average importance across the classes
##                        A    B    C     D     E
## pitch_forearm     100.00 56.0 65.5 88.74 56.89
## pitch_belt         40.08 91.2 52.2 40.34 33.48
## magnet_dumbbell_y  42.13 50.6 88.7 45.84 46.03
## yaw_belt           75.51 79.8 66.0 80.52 36.36
## roll_belt          20.16 52.4 46.3 44.06 71.70
## gyros_arm_x         4.45 44.4 32.1 24.23  9.46
## gyros_dumbbell_y   11.47 24.0 43.1  4.19  9.91
## magnet_arm_x        1.45 11.5 26.8 41.97  0.00
## magnet_dumbbell_x  22.78 25.7 35.1 22.85 36.73
## magnet_belt_z      12.89 35.4 10.4 29.10 35.33

# Look at CV performance
getTrainPerf(rfFitLOSOCV1)

##   TrainROC TrainSensitivity TrainSpecificity TrainAccuracy TrainKappa
## 1   0.5888           0.2328           0.8086        0.2269    0.04198
##   TrainlogLoss TrainAccuracyLower TrainAccuracyUpper TrainAccuracyPValue
## 1       0.8779             0.2206             0.2334              0.8333
##   TrainMcnemarPValue TrainPos_Pred_Value TrainNeg_Pred_Value
## 1                  0              0.2545              0.8119
##   TrainDetection_Rate TrainBalanced_Accuracy method
## 1             0.04539                 0.5207     rf

confusionMatrix(rfFitLOSOCV1)

## Cross-Validated (6 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A  4.1  2.0  2.0  1.5  1.2
##          B  4.5  4.3  2.6  2.9  2.7
##          C  2.8  1.5  1.4  0.7  0.6
##          D  2.0  2.1  1.7  1.9  2.9
##          E 14.9  9.4  9.8  9.4 11.0

# Look at performance of the final model.  This will overfit since it uses the entire
# training set to fit the model.
predLOSOCV1 <- predict(rfFitLOSOCV1)#, train[,wekaVars1])
confusionMatrix(train$classe, predLOSOCV1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5580    0    0    0    0
##          B    0 3797    0    0    0
##          C    0    0 3422    0    0
##          D    0    0    0 3216    0
##          E    0    0    0    0 3607
## 
## Overall Statistics
##                                 
##                Accuracy : 1     
##                  95% CI : (1, 1)
##     No Information Rate : 0.284 
##     P-Value [Acc > NIR] : <2e-16
##                                 
##                   Kappa : 1     
##  Mcnemar's Test P-Value : NA    
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    1.000    1.000    1.000    1.000
## Specificity             1.000    1.000    1.000    1.000    1.000
## Pos Pred Value          1.000    1.000    1.000    1.000    1.000
## Neg Pred Value          1.000    1.000    1.000    1.000    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.194    0.174    0.164    0.184
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       1.000    1.000    1.000    1.000    1.000

This model is still virtually useless for predicting subjects outside of the training set! Note the difference between the CV accuracy (usually a good estimate for out of sample) and the accuracy on the entire training set (grossly overfit).

I am impressed by the 78.2% LOSOCV accuracy quoted in the paper. Working to replicate that would be worthwhile, but I have already exceeded my time budget for this project. Section 5. DETECTION OF MISTAKES states that there were six subjects so I don’t think the authors used more users/data than was made available to us.

Practical Machine Learning: Course Project Predicting How Well A Weight Lifting Activity Is Performed

Rich Seiter

Wednesday, June 11, 2014