Before start our lesson please download the datasets.
Objective: Using the predictor variables in the HAR-dataset to predict the activity of the person.
About the Experiment: An experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
About the data set: For each record in the raw dataset it is provided:
- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
- Triaxial Angular velocity from the gyroscope.
- A 561-feature vector with time and frequency domain variables.
- Its activity label.
- An identifier of the subject who carried out the experiment.
Let us try to read the data from the files and create pandas dataframes.
Importing the features into an independent dataframe for further use.
import pandas as pd
features = pd.read_table('\\UCI HAR Dataset\\features.txt',
sep = ' ',
header=None,
names=('ID', 'Sensor'))
features.head()
Reading X_train and y_train
X_train = pd.read_table('\\UCI HAR Dataset\\train\\X_train.txt',
sep='s+', header=None,
names = features['Sensor'])
y_train = pd.read_table('\\UCI HAR Dataset\\train\\y_train.txt',
sep=' ', header=None,
names=['ActivityID'])
X_train.shape, y_train.shape
Let’s read all files in the test folder
X_test = pd.read_table('\\UCI HAR Dataset\\test\\X_test.txt',
sep='s+', header=None,
names = features['Sensor'])#takes the sensor names form Features dataframe and put it as column names
# The file X_test requires to use as a separator a regular expression: 's+',
# because sometimes more than one blanks are used (data mangling!)
y_test = pd.read_table('\\UCI HAR Dataset\\test\\y_test.txt',
sep=' ', header=None,
names=['ActivityID'])
X_test.shape, y_test.shape
We see that Training sets has data 7352 observations and test set has 2947 observations. The number of features in both the training and testing dataset is 561. The data is already devided into training and testing set with features and label dataframes. So, We are good to go and Create models.
Modeling the data
As the data is quite normalised and scaled we will not need any preprocessing, we can directly use the data to train models. We will train the data on these classification algorithms:
- RandomForestClassifier
- Support Vector Machine – Classifier (SVM)
- Neural Networks
1. RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
Selecting the model with the parameters
clf1 = RandomForestClassifier(n_estimators=13, max_features='log2')
Training the classifier
clf1.fit(X_train, y_train['ActivityID'])
Training accuracy:
tr_score1 = clf1.score(X_train, y_train)
tr_score1
Test set accuracy:
ts_score1 = clf1.score(X_test, y_test)
ts_score1
The Random forest model with 100 trees is giving us accuracy around 91-92% on the testing set.
Predicting the classes on training and test set:
tr_predict1 = clf1.predict(X_train) #Prediction on training data
ts_predict1 = clf1.predict(X_test) #Prediction on testing data
Model Evaluation
Creating the confusion matrices:
from sklearn.metrics import confusion_matrix
tr_cm1 = confusion_matrix(y_train, tr_predict1) #confusion matrix for predictions on training
ts_cm1 = confusion_matrix(y_test, ts_predict1) #confusion matrix for predictions on test
Plotting the matrix
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sn
We will convert the matrixes into a dataframes with the name of each class, before that let’s take care of the index and columns on the plots that we are about to draw.
labels = {1:'WALKING', 2:'WALKING UPSTAIRS', 3:'WALKING DOWNSTAIRS',
4:'SITTING', 5:'STANDING', 6:'LAYING'}
index = [i for i in labels.values()]
columns = [i for i in labels.values()]
pl_tr_cm1 = pd.DataFrame(tr_cm1, index = index, columns = columns)
pl_ts_cm1 = pd.DataFrame(ts_cm1, index = index, columns = columns)
Code to plot the Confusion matrix:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize= (15,6), sharey=True)
sn.heatmap(pl_tr_cm1, annot=True, fmt = 'g', ax=ax1).set_title('On Training Data') #fmt = 'g' will allow us to plot 3digit numbers
sn.heatmap(pl_ts_cm1, annot=True, fmt = 'g', ax=ax2).set_title('On Testing Data')
The matrix helps us understand that most of the errors that are being made by the model are while deciding “Sitting-Standing” and another problem classification cluster is “walking – walkingupstairs – walking Downstairs”
2. Support Vector Classifier
from sklearn import svm
clf2 = svm.SVC(decision_function_shape='ovo')
clf2.fit(X_train, y_train['ActivityID'])
The support vectors and length of the vectors:
clf2.support_
len(clf2.support_)
Accuracy on the training set:
tr_score2 = clf2.score(X_train, y_train)
tr_score2
Accuracy on the testing set:
ts_score2 = clf2.score(X_test, y_test)
ts_score2
Predicting the classes on training and test set:
tr_predict2 = clf2.predict(X_train) # on training set
ts_predict2 = clf2.predict(X_test) # on testinging set
Model Evaluation
Creating confusion matrices for the model’s performance on training and testing data
tr_cm2 = confusion_matrix(y_train, tr_predict2) #confusion matrix for predictions on training
ts_cm2 = confusion_matrix(y_test, ts_predict2) #confusion matrix for predictions on test
labeling the matrices:
pl_tr_cm2 = pd.DataFrame(tr_cm2, index = index, columns = columns)
pl_ts_cm2 = pd.DataFrame(ts_cm2, index = index, columns = columns)
Code to plot the matrices:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize= (15,6), sharey=True)
sn.heatmap(pl_tr_cm2, annot=True, fmt = 'g', ax=ax1).set_title('On Training Data') #fmt = 'g' will allow us to plot 3digit numbers
sn.heatmap(pl_ts_cm2, annot=True, fmt = 'g', ax=ax2).set_title('On Testing Data')
Again we can see that SVM is also giving us erros while predicting values between “Sitting-Standing” and between “walking – walkingupstairs – walking Downstairs”, just like RFC
Neural Networks
We will train the data with a feed forward multilayer perceptron- network using neurolab library function nl.net.newff
Inporting the modules
import neurolab as nl
import pylab as pl
We will need to create a feed foreward network with nodes layers more then the number of features and random initialized
net = nl.net.newff([[-1, 1]]*561, [50, 1])
Here the first parameter says that there are 561 features and each feature has value between [-1, 1] In second parameter the number of hidden nodes is taken 50 and 1: represent the output node.
Normalizing the training samples for the network: inp is the input data for the network and y or tar is the target values.
import numpy as np
inp = np.array(X_train) #converting the dataframe into numpy array
The target should be in [[ ],[ ],[ ]…..[ ],[ ]] form, this is how the array can be converted in desired format:
y = np.array(y_train['ActivityID'])
tar = np.array([y[i:i+1] for i in range(0, len(y), 1)]) # target
traininng the network
Training the network is a bit diferrent for neural networks: 1) the input variables passes through the network and we get an error value for first iteration. 2) after optimizing the network for the error in last iteration the input variable again passes through the network. 3) this process keep going till we minimize the error value while training the network.
the function below will iterate the network (epouch= 500) times, the goal is the error goal.
error = []
error.append(net.train(inp, y, epochs=500, goal=0.01))
Memory Error; Memory Error; Memory Error
While training we can see that training a neural network consumes a lot of memory. To overcome this problem we might need help of distributed computing.
Below are some next step in the neural network to get some results that can be used if the memory error is resolved:
plotting epoches Vs error we can use this plot to specify the no.of epoaches in training to reduce time.
pl.figure(1)
pl.plot(error[0])
pl.xlabel('Number of epochs')
pl.ylabel('Training error')
pl.grid()
pl.show()
Simulating network(predicting)
predicted_values = net.sim(X_test)
Converting predicted values into classes by using threshold predicted_class=predicted_values
- predicted_class[0.5 < predicted_values < 1.5] = 1
- predicted_class[1.5 < predicted_values < 2.5] = 2
- predicted_class[2.5 < predicted_values < 3.5] = 3
- predicted_class[3.5 < predicted_values < 4.5] = 4
- predicted_class[4.5 < predicted_values < 5.5] = 5
In the network that we have trained we might not need threshold do define classes.
#predicted classes
predicted_class
Model validation
Creating confusion matrices
ConfusionMatrix = cm(y_test,predicted_class)
print(ConfusionMatrix)
Model Accuracy
accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print(accuracy)
Model Error
error=1-accuracy
print(error)
Due to computational issues with memory we were not able to get the results based on Neural Network Training.
Conclusion:
Out of three of our approaches we were able to create models with Random Forest and Support Vector Machine techniques.
From the models that we have created, the confusion matrices shows that models are not that well to distinguish between “Sitting-Standing” and between “walking – walkingupstairs – walking Downstairs” classes.
Meaning there is a huge opportunity to improve our models.
The accuracy of the Models on the training data is quite high, this kind of accuracy is possible in cases below:
-
- Overfitting issue.
-
- The features are highly correlated.
We will try to validate our models by 10-fold Cross-validation and Bootstraping.
Validating our models
K-fold () cross validation on the training data
Merge the X_ train and y_train to create a single dataframe called train.
train = pd.concat([X_train, y_train], axis=1)
Importing and the crossvalidation form scikit module and selecting n_folds = 10
from sklearn import cross_validation
kfold = cross_validation.KFold(len(train), n_folds=10)
Selecting the models for cross validation
model1 = RandomForestClassifier(n_estimators=20) #Random forest model
model2 = svm.SVC(decision_function_shape='ovo')
We will perform 10-fold cross validation and calculate mean of the scores for all the folds. Accuracy1 is mean of accuracy of each fold for the random forest model : model1. Accuracy2 is mean of accuracy of each fold for the Support vector Machine classification model : model2.
10-fold Crossvalidation score with Random Forest:-
kf_score1 = cross_validation.cross_val_score(model1, X_train, y_train['ActivityID'],cv=kfold).mean()
kf_score1
Something seem to be different as our Random Forest model was giving a score of nearly 99%, and the same model is giving a score of 90% while performing cross validation. So our suspicion was true regarding overfitting of Random Forest model. In this kind of situation we can play with the Algorithm parameters and tune it a bit to generalize the model. The parameters can be understood form SKlearn documentation by visiting: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
10-fold cross validation score using SVM classifier:-
kf_score2 = cross_validation.cross_val_score(model2, X_train, y_train['ActivityID'],cv=kfold).mean()
kf_score2
Computation took a bit longer then Random forest but was worth it. We can see with cross validation that accuracy of SVM model is about same as using whole training data. Meaning our SVM model is not over fitting..
Bootstrap validation:
Let us look into alternatives(Variations to be more precise) for 10-fold cross validation. Bootstrapping is a way to quantify the uncertainty in your model while cross validation is used for model selection and measuring predictive accuracy. Bootstrapping means resampling your data randomly, meaning all the samples that we take may not hit all the points.
We will use Scikit’s Shufflesplit function to perform bootstrapping as the a direct bootstrapping function is no longer available in scikit module.
bootstrap = cross_validation.ShuffleSplit(len(train), n_iter=10,test_size=0.1)
Bootstrap cross validation using Random forest:-
bs_score1 = cross_validation.cross_val_score(model1, X_train, y_train['ActivityID'],cv = bootstrap).mean()
bs_score1
Bootstrap validation is telling us the same thing about RF model being overfitted!
Bootstrap cross validation using SVM Classifier:-
bs_score2 = cross_validation.cross_val_score(model2, X_train, y_train['ActivityID'],cv = bootstrap).mean()
bs_score2
Well, bootstrap validation is also indicating that SVM model should be our preferred model if computation cost and time is not our priority!
Principal Component Analysis:
So far, form the kind of models that we have created we can understand that there is a scope of improvement. How do we go about it? Let us think… Ok wait, so we have 561 features working all together to train and create models, saying that all the features are contributing equally doesn’t seem to be a fair analogy for me. Can we reduce the dimention of the features and only consider few of the features that make more sense to our model then other so, computation cost is not that high!
Turns out that PCA(Principal Componenet Analysis) is a thing we can use!
Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables, called ‘principal components’, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components.
boring!!!!!!!!!!! wasn’t it??
Got something cool for you, hope this might make some sense: http://setosa.io/ev/principal-component-analysis/
Your’e welcome!!!
We will perform PCA on the training data and check the accuracy score and Confusion Matrix to see if PCA was is soing something for us. Code below will give us a dataframe ‘X_train_pca’ which will contain PCA transformed compoents which contains 99% of the variance:
from sklearn.decomposition import PCA
# Minimum percentage of variance we want to be described by the resulting transformed components
variance_pct = .99
# Create PCA object & Transform the initial features
X_transformed = PCA(n_components=variance_pct).fit_transform(X_train)
X_train_pca = pd.DataFrame(X_transformed)
print (X_transformed.shape[1], " components describe ", str(variance_pct)[1:], "% of the variance")
X_train_pca.shape, y_train.shape
SVM accuracy after PCA:
clf_pca = svm.SVC(decision_function_shape='ovo')
clf_pca.fit(X_train_pca, y_train['ActivityID'])
tr_score_pca = clf_pca.score(X_train_pca, y_train)
tr_score_pca
Predicting the lables using the model that we have trained on the training set:
tr_pred_pca = clf_pca.predict(X_train_pca) # on training set
Plotting Confusion matrix:
tr_cm_pca = confusion_matrix(y_train, tr_pred_pca) #confusion matrix for predictions on training
pl_tr_cm_pca = pd.DataFrame(tr_cm_pca, index = index, columns = columns)
fig = plt.plot(figsize= (15,6), sharey=True)
sn.heatmap(pl_tr_cm_pca, annot=True, fmt = 'g').set_title('After PCA SVM on training set')
Good stuff, atleast we can see that after PCA the model is able to classify ‘Walking – Walking Upstairs – Walking Downstairs’ better than our previous models!
Summary:
- We went through a simple pipline of small data Science problem:
- Idnetifying and understanding the kind of problem.
- converting the raw data into a tidy version of the same.
- Working with Random Forest, SVM and Neural Network classifier in Python.
- Understanding CrossValidation.
- Performing Dimention Reduction using Principal Component Analysis.
Results:
- Random Forest which is said to be least overfitting model(I am not sure, but I just read at too many places), can also give a overfitted model with default parameters.
- SVM takes a lot of time to compute than Random Forest.
- Neural networks are heavy to perform and consume a lot of memory and time.
- For a kind of problem(Classfication in this case) not all the algorithms perform alike.
- PCA to reduce dimentions, which is predominantly used with image data can also be useful while using sensor data.
Final Thoughts:
This case study was probably a bit long! Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics of machine learning.


