Before start our lesson please download the datasets.

Objective: Using the predictor variables in the HAR-dataset to predict the activity of the person.

About the Experiment: An experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

About the data set: For each record in the raw dataset it is provided:

Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
Triaxial Angular velocity from the gyroscope.
A 561-feature vector with time and frequency domain variables.
Its activity label.
An identifier of the subject who carried out the experiment.

Let us try to read the data from the files and create pandas dataframes.

Importing the features into an independent dataframe for further use.

In [1]:

import pandas as pd
features = pd.read_table('\\UCI HAR Dataset\\features.txt',
                         sep = ' ',
                         header=None,
                         names=('ID', 'Sensor'))

In [2]:

features.head()

Out[2]:

	ID	Sensor
0	1	tBodyAcc-mean()-X
1	2	tBodyAcc-mean()-Y
2	3	tBodyAcc-mean()-Z
3	4	tBodyAcc-std()-X
4	5	tBodyAcc-std()-Y

Reading X_train and y_train

In [3]:

X_train = pd.read_table('\\UCI HAR Dataset\\train\\X_train.txt',
                      sep='s+', header=None,
                      names = features['Sensor'])

In [4]:

y_train = pd.read_table('\\UCI HAR Dataset\\train\\y_train.txt',
                      sep=' ', header=None,
                      names=['ActivityID'])

In [5]:

X_train.shape, y_train.shape

Out[5]:

((7352, 561), (7352, 1))

Let’s read all files in the test folder

In [6]:

X_test = pd.read_table('\\UCI HAR Dataset\\test\\X_test.txt',
                      sep='s+', header=None,
                      names = features['Sensor'])#takes the sensor names form Features dataframe and put it as column names
# The file X_test requires to use as a separator a regular expression: 's+', 
# because sometimes more than one blanks are used (data mangling!)

In [7]:

y_test = pd.read_table('\\UCI HAR Dataset\\test\\y_test.txt',
                      sep=' ', header=None,
                      names=['ActivityID'])

In [8]:

X_test.shape, y_test.shape

Out[8]:

((2947, 561), (2947, 1))

We see that Training sets has data 7352 observations and test set has 2947 observations. The number of features in both the training and testing dataset is 561. The data is already devided into training and testing set with features and label dataframes. So, We are good to go and Create models.

Modeling the data

As the data is quite normalised and scaled we will not need any preprocessing, we can directly use the data to train models. We will train the data on these classification algorithms:

RandomForestClassifier
Support Vector Machine – Classifier (SVM)
Neural Networks

1. RandomForestClassifier

In [9]:

from sklearn.ensemble import RandomForestClassifier

Selecting the model with the parameters

In [10]:

clf1 = RandomForestClassifier(n_estimators=13, max_features='log2')

Training the classifier

In [11]:

clf1.fit(X_train, y_train['ActivityID'])

Out[11]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=13, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Training accuracy:

In [12]:

tr_score1 = clf1.score(X_train, y_train)

In [13]:

tr_score1

Out[13]:

0.99972796517954299

Test set accuracy:

In [14]:

ts_score1 = clf1.score(X_test, y_test)

In [15]:

ts_score1

Out[15]:

0.91245334238208342

The Random forest model with 100 trees is giving us accuracy around 91-92% on the testing set.

Predicting the classes on training and test set:

In [16]:

tr_predict1 = clf1.predict(X_train)  #Prediction on training data
ts_predict1 = clf1.predict(X_test)   #Prediction on testing data

Model Evaluation

Creating the confusion matrices:

In [17]:

from sklearn.metrics import confusion_matrix
tr_cm1 = confusion_matrix(y_train, tr_predict1) #confusion matrix for predictions on training
ts_cm1 = confusion_matrix(y_test, ts_predict1) #confusion matrix for predictions on test

Plotting the matrix

In [18]:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sn

We will convert the matrixes into a dataframes with the name of each class, before that let’s take care of the index and columns on the plots that we are about to draw.

In [19]:

labels = {1:'WALKING', 2:'WALKING UPSTAIRS', 3:'WALKING DOWNSTAIRS',
          4:'SITTING', 5:'STANDING', 6:'LAYING'}
index = [i for i in labels.values()]
columns = [i for i in labels.values()]

In [20]:

pl_tr_cm1 = pd.DataFrame(tr_cm1, index = index, columns = columns)
pl_ts_cm1 = pd.DataFrame(ts_cm1, index = index, columns = columns)

Code to plot the Confusion matrix:

In [21]:

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize= (15,6), sharey=True)
sn.heatmap(pl_tr_cm1, annot=True, fmt = 'g', ax=ax1).set_title('On Training Data')    #fmt = 'g' will allow us to plot 3digit numbers
sn.heatmap(pl_ts_cm1, annot=True, fmt = 'g', ax=ax2).set_title('On Testing Data')

Out[21]:

<matplotlib.text.Text at 0xae8f630>

The matrix helps us understand that most of the errors that are being made by the model are while deciding “Sitting-Standing” and another problem classification cluster is “walking – walkingupstairs – walking Downstairs”

2. Support Vector Classifier

In [22]:

from sklearn import svm

In [23]:

clf2 = svm.SVC(decision_function_shape='ovo')
clf2.fit(X_train, y_train['ActivityID'])

Out[23]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The support vectors and length of the vectors:

In [24]:

clf2.support_

Out[24]:

array([  78,   86,  105, ..., 7063, 7065, 7247])

In [25]:

len(clf2.support_)

Out[25]:

Accuracy on the training set:

In [26]:

tr_score2 = clf2.score(X_train, y_train)
tr_score2

Out[26]:

0.95497823721436348

Accuracy on the testing set:

In [27]:

ts_score2 = clf2.score(X_test, y_test)
ts_score2

Out[27]:

0.94197488971835763

Predicting the classes on training and test set:

In [28]:

tr_predict2 = clf2.predict(X_train) # on training set
ts_predict2 = clf2.predict(X_test) # on testinging set

Model Evaluation

Creating confusion matrices for the model’s performance on training and testing data

In [29]:

tr_cm2 = confusion_matrix(y_train, tr_predict2) #confusion matrix for predictions on training
ts_cm2 = confusion_matrix(y_test, ts_predict2) #confusion matrix for predictions on test

labeling the matrices:

In [30]:

pl_tr_cm2 = pd.DataFrame(tr_cm2, index = index, columns = columns)
pl_ts_cm2 = pd.DataFrame(ts_cm2, index = index, columns = columns)

Code to plot the matrices:

In [31]:

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize= (15,6), sharey=True)
sn.heatmap(pl_tr_cm2, annot=True, fmt = 'g', ax=ax1).set_title('On Training Data')    #fmt = 'g' will allow us to plot 3digit numbers
sn.heatmap(pl_ts_cm2, annot=True, fmt = 'g', ax=ax2).set_title('On Testing Data')

Out[31]:

<matplotlib.text.Text at 0x82d0cf8>

Again we can see that SVM is also giving us erros while predicting values between “Sitting-Standing” and between “walking – walkingupstairs – walking Downstairs”, just like RFC

Neural Networks

We will train the data with a feed forward multilayer perceptron- network using neurolab library function nl.net.newff

Inporting the modules

In [32]:

import neurolab as nl
import pylab as pl

We will need to create a feed foreward network with nodes layers more then the number of features and random initialized

In [77]:

net = nl.net.newff([[-1, 1]]*561, [50, 1])

Here the first parameter says that there are 561 features and each feature has value between [-1, 1] In second parameter the number of hidden nodes is taken 50 and 1: represent the output node.

Normalizing the training samples for the network: inp is the input data for the network and y or tar is the target values.

In [83]:

import numpy as np
inp = np.array(X_train) #converting the dataframe into numpy array

The target should be in [[ ],[ ],[ ]…..[ ],[ ]] form, this is how the array can be converted in desired format:

In [92]:

y = np.array(y_train['ActivityID'])
tar = np.array([y[i:i+1] for i in range(0, len(y), 1)]) # target

traininng the network

Training the network is a bit diferrent for neural networks: 1) the input variables passes through the network and we get an error value for first iteration. 2) after optimizing the network for the error in last iteration the input variable again passes through the network. 3) this process keep going till we minimize the error value while training the network.

the function below will iterate the network (epouch= 500) times, the goal is the error goal.

In [89]:

error = []
error.append(net.train(inp, y, epochs=500, goal=0.01))

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-89-31a4e99bd1fa> in <module>()
      1 error = []
----> 2 error.append(net.train(inp, y, epochs=500, goal=0.001))

C:UsersLUCKYAnaconda3libsite-packagesneurolabcore.py in train(self, *args, **kwargs)
    163 
    164         """
--> 165         return self.trainf(self, *args, **kwargs)
    166 
    167     def reset(self):

C:UsersLUCKYAnaconda3libsite-packagesneurolabcore.py in __call__(self, net, input, target, **kwargs)
    347         self.error = []
    348         try:
--> 349             train(net, *args)
    350         except TrainStop as msg:
    351             if self.params['show']:

C:UsersLUCKYAnaconda3libsite-packagesneurolabtrainspo.py in __call__(self, net, input, target)
     77 
     78         x = fmin_bfgs(self.fcn, self.x.copy(), fprime=self.grad, callback=self.step,
---> 79                       **self.kwargs)
     80         self.x[:] = x
     81 

C:UsersLUCKYAnaconda3libsite-packagesscipyoptimizeoptimize.py in fmin_bfgs(f, x0, fprime, args, gtol, norm, epsilon, maxiter, full_output, disp, retall, callback)
    791             'return_all': retall}
    792 
--> 793     res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)
    794 
    795     if full_output:

C:UsersLUCKYAnaconda3libsite-packagesscipyoptimizeoptimize.py in _minimize_bfgs(fun, x0, args, jac, callback, gtol, norm, eps, maxiter, disp, return_all, **unknown_options)
    848     k = 0
    849     N = len(x0)
--> 850     I = numpy.eye(N, dtype=int)
    851     Hk = I
    852     old_fval = f(x0)

C:UsersLUCKYAnaconda3libsite-packagesnumpylibtwodim_base.py in eye(N, M, k, dtype)
    231     if M is None:
    232         M = N
--> 233     m = zeros((N, M), dtype=dtype)
    234     if k >= M:
    235         return m

MemoryError:

Memory Error; Memory Error; Memory Error

While training we can see that training a neural network consumes a lot of memory. To overcome this problem we might need help of distributed computing.

Below are some next step in the neural network to get some results that can be used if the memory error is resolved:

plotting epoches Vs error we can use this plot to specify the no.of epoaches in training to reduce time.

In [ ]:

pl.figure(1)
pl.plot(error[0])
pl.xlabel('Number of epochs')
pl.ylabel('Training error')
pl.grid()
pl.show()

Simulating network(predicting)

In [ ]:

predicted_values = net.sim(X_test)

Converting predicted values into classes by using threshold predicted_class=predicted_values

predicted_class[0.5 < predicted_values < 1.5] = 1
predicted_class[1.5 < predicted_values < 2.5] = 2
predicted_class[2.5 < predicted_values < 3.5] = 3
predicted_class[3.5 < predicted_values < 4.5] = 4
predicted_class[4.5 < predicted_values < 5.5] = 5

In the network that we have trained we might not need threshold do define classes.

In [ ]:

#predicted classes
predicted_class

Model validation

Creating confusion matrices

In [ ]:

ConfusionMatrix = cm(y_test,predicted_class)
print(ConfusionMatrix)

Model Accuracy

In [ ]:

accuracy=(ConfusionMatrix[0,0]+ConfusionMatrix[1,1])/sum(sum(ConfusionMatrix))
print(accuracy)

Model Error

In [ ]:

error=1-accuracy
print(error)

Due to computational issues with memory we were not able to get the results based on Neural Network Training.

Conclusion:

Out of three of our approaches we were able to create models with Random Forest and Support Vector Machine techniques.

From the models that we have created, the confusion matrices shows that models are not that well to distinguish between “Sitting-Standing” and between “walking – walkingupstairs – walking Downstairs” classes.

Meaning there is a huge opportunity to improve our models.

The accuracy of the Models on the training data is quite high, this kind of accuracy is possible in cases below:

1. Overfitting issue.
1. The features are highly correlated.

We will try to validate our models by 10-fold Cross-validation and Bootstraping.

Validating our models

K-fold () cross validation on the training data

Merge the X_ train and y_train to create a single dataframe called train.

In [33]:

train = pd.concat([X_train, y_train], axis=1)

Importing and the crossvalidation form scikit module and selecting n_folds = 10

In [34]:

from sklearn import cross_validation
kfold = cross_validation.KFold(len(train), n_folds=10)

Selecting the models for cross validation

In [35]:

model1 = RandomForestClassifier(n_estimators=20) #Random forest model
model2 = svm.SVC(decision_function_shape='ovo')

We will perform 10-fold cross validation and calculate mean of the scores for all the folds. Accuracy1 is mean of accuracy of each fold for the random forest model : model1. Accuracy2 is mean of accuracy of each fold for the Support vector Machine classification model : model2.

10-fold Crossvalidation score with Random Forest:-

In [36]:

kf_score1 = cross_validation.cross_val_score(model1, X_train, y_train['ActivityID'],cv=kfold).mean()

In [37]:

kf_score1

Out[37]:

0.90791666666666659

Something seem to be different as our Random Forest model was giving a score of nearly 99%, and the same model is giving a score of 90% while performing cross validation. So our suspicion was true regarding overfitting of Random Forest model. In this kind of situation we can play with the Algorithm parameters and tune it a bit to generalize the model. The parameters can be understood form SKlearn documentation by visiting: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

10-fold cross validation score using SVM classifier:-

In [38]:

kf_score2 = cross_validation.cross_val_score(model2, X_train, y_train['ActivityID'],cv=kfold).mean()

In [39]:

kf_score2

Out[39]:

0.91566788671990551

Computation took a bit longer then Random forest but was worth it. We can see with cross validation that accuracy of SVM model is about same as using whole training data. Meaning our SVM model is not over fitting..

Bootstrap validation:

Let us look into alternatives(Variations to be more precise) for 10-fold cross validation. Bootstrapping is a way to quantify the uncertainty in your model while cross validation is used for model selection and measuring predictive accuracy. Bootstrapping means resampling your data randomly, meaning all the samples that we take may not hit all the points.

We will use Scikit’s Shufflesplit function to perform bootstrapping as the a direct bootstrapping function is no longer available in scikit module.

In [40]:

bootstrap = cross_validation.ShuffleSplit(len(train), n_iter=10,test_size=0.1)

Bootstrap cross validation using Random forest:-

In [41]:

bs_score1 = cross_validation.cross_val_score(model1, X_train, y_train['ActivityID'],cv = bootstrap).mean()

In [42]:

bs_score1

Out[42]:

0.97635869565217381

Bootstrap validation is telling us the same thing about RF model being overfitted!

Bootstrap cross validation using SVM Classifier:-

In [43]:

bs_score2 = cross_validation.cross_val_score(model2, X_train, y_train['ActivityID'],cv = bootstrap).mean()

In [44]:

bs_score2

Out[44]:

0.9452445652173912

Well, bootstrap validation is also indicating that SVM model should be our preferred model if computation cost and time is not our priority!

Principal Component Analysis:

So far, form the kind of models that we have created we can understand that there is a scope of improvement. How do we go about it? Let us think… Ok wait, so we have 561 features working all together to train and create models, saying that all the features are contributing equally doesn’t seem to be a fair analogy for me. Can we reduce the dimention of the features and only consider few of the features that make more sense to our model then other so, computation cost is not that high!

Turns out that PCA(Principal Componenet Analysis) is a thing we can use!

Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables, called ‘principal components’, from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components.

boring!!!!!!!!!!! wasn’t it??

Got something cool for you, hope this might make some sense: http://setosa.io/ev/principal-component-analysis/

Your’e welcome!!!

We will perform PCA on the training data and check the accuracy score and Confusion Matrix to see if PCA was is soing something for us. Code below will give us a dataframe ‘X_train_pca’ which will contain PCA transformed compoents which contains 99% of the variance:

In [45]:

from sklearn.decomposition import PCA
# Minimum percentage of variance we want to be described by the resulting transformed components
variance_pct = .99

# Create PCA object & Transform the initial features
X_transformed = PCA(n_components=variance_pct).fit_transform(X_train)

X_train_pca = pd.DataFrame(X_transformed)

print (X_transformed.shape[1], " components describe ", str(variance_pct)[1:], "% of the variance")

145  components describe  .99 % of the variance

In [46]:

X_train_pca.shape, y_train.shape

Out[46]:

((7352, 145), (7352, 1))

SVM accuracy after PCA:

In [47]:

clf_pca = svm.SVC(decision_function_shape='ovo')
clf_pca.fit(X_train_pca, y_train['ActivityID'])

Out[47]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [48]:

tr_score_pca = clf_pca.score(X_train_pca, y_train)
tr_score_pca

Out[48]:

0.97687704026115341

Predicting the lables using the model that we have trained on the training set:

In [49]:

tr_pred_pca = clf_pca.predict(X_train_pca) # on training set

Plotting Confusion matrix:

In [50]:

tr_cm_pca = confusion_matrix(y_train, tr_pred_pca) #confusion matrix for predictions on training
pl_tr_cm_pca = pd.DataFrame(tr_cm_pca, index = index, columns = columns)
fig = plt.plot(figsize= (15,6), sharey=True)
sn.heatmap(pl_tr_cm_pca, annot=True, fmt = 'g').set_title('After PCA SVM on training set')

Out[50]:

<matplotlib.text.Text at 0xb55d5c0>

Good stuff, atleast we can see that after PCA the model is able to classify ‘Walking – Walking Upstairs – Walking Downstairs’ better than our previous models!

Summary:

We went through a simple pipline of small data Science problem:
- Idnetifying and understanding the kind of problem.
- converting the raw data into a tidy version of the same.
- Working with Random Forest, SVM and Neural Network classifier in Python.
- Understanding CrossValidation.
- Performing Dimention Reduction using Principal Component Analysis.

Results:

Random Forest which is said to be least overfitting model(I am not sure, but I just read at too many places), can also give a overfitted model with default parameters.
SVM takes a lot of time to compute than Random Forest.
Neural networks are heavy to perform and consume a lot of memory and time.
For a kind of problem(Classfication in this case) not all the algorithms perform alike.
PCA to reduce dimentions, which is predominantly used with image data can also be useful while using sensor data.

Final Thoughts:

This case study was probably a bit long! Hopefully it will provide some assistance to people getting started with scikit-learn and could use a little guidance on the basics of machine learning.

Human Activity Recognition Using Smartphones

Before start our lesson please download the datasets.

Modeling the data

1. RandomForestClassifier

Model Evaluation

2. Support Vector Classifier

Model Evaluation

Neural Networks

traininng the network

Memory Error; Memory Error; Memory Error

Model validation

Conclusion:

Meaning there is a huge opportunity to improve our models.

Validating our models

K-fold () cross validation on the training data

Bootstrap validation:

Principal Component Analysis:

boring!!!!!!!!!!! wasn’t it??

SVM accuracy after PCA:

Summary:

Final Thoughts: