Before start our lesson please download the datasets.

Logistic regression in detailed

In this chapter, we will discuss about logistic regression. Earlier we have talked about the linear regression or it can be also called as regression analysis. Now we will be studying about the logistic regression and how the logistic regression will be different from the linear regression. Even we will be discussing about the new concepts that we are going to learn in the logistic regression, which are not defined in the linear regression.

What is the need of logistic regression?
Building logistic Regression line
Goodness of fit measures
Multicollinearity
Individual Impact of variables
Model selection

Regression Recap

In regression, the dependent variable is predicted using independent variables. A straight line is fit to capture the relation in the form of a model. The R-Square/Adjusted R-Square values tells us the goodness of fit of the model. Once the line is ready, we can substitute the values of x(predictor) to get the predicted values of y(dependent variable).

LAB: Regression – Recap

Dataset: Product Sales Data/Product_sales.csv
What are the variables in the dataset?
Build a predictive model for Bought vs Age.
What is R-Square?
If Age is 4, then will that customer buy the product?
If Age is 105, then will that customer buy the product?

Solution

1) Import Dataset: Product Sales Data/Product_sales.csv

In [1]:

import numpy as np
import pandas as pd
import matplotlib as plt
import scipy as sp
sales=pd.read_csv("DatasetsProduct Sales DataProduct_sales.csv")

Here is the Product Sales Dataset, where we are having two variables which can be predicted. The “Bought” vs. “Age” will be the two variables that can have different values. “Bought” will be having two values, i.e., “Yes” or “No”; while “Age” will be based on the customer’s age. We have to just predict whether the customer will buy the product or not according to their age. Thus, for that we have to create linear regression between “Bought” and “Age”. Then we have to find the R-Square. We have to find if the customer’s age is 4, then that customer will buy the product or not. And if the customer’s age is 105, then the customer will buy the product or not.

2) What are the variables in the dataset?

In [2]:

sales.columns.values

Out[2]:

array(['Age', 'Bought'], dtype=object)

3) Build a predictive model for Bought vs Age: In order to build a predictive model, we need to use the statsmodels package; which enables many statistical methods to be used in Python.

In [3]:

import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
model = sm.ols(formula='Bought ~ Age', data=sales)
fitted = model.fit()
fitted.summary()

Out[3]:

OLS Regression Results
Dep. Variable:	Bought	R-squared:	0.842
Model:	OLS	Adj. R-squared:	0.842
Method:	Least Squares	F-statistic:	2480.
Date:	Mon, 13 Feb 2017	Prob (F-statistic):	1.63e-188
Time:	16:29:23	Log-Likelihood:	95.589
No. Observations:	467	AIC:	-187.2
Df Residuals:	465	BIC:	-178.9
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-0.1704	0.015	-11.156	0.000	-0.200 -0.140
Age	0.0209	0.000	49.803	0.000	0.020 0.022

Omnibus:	77.279	Durbin-Watson:	1.362
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1022.092
Skew:	0.056	Prob(JB):	1.14e-222
Kurtosis:	10.247	Cond. No.	60.7

4) What is R-Square? 0.8421

In [4]:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(sales[["Age"]], sales[["Bought"]])

age1=4
predict1=lr.predict(age1)
predict1

Out[4]:

array([[-0.08664394]])

6) If Age is 105, then will that customer buy the product?

In [5]:

age2=105
predict2=lr.predict(age2)
predict2

Out[5]:

array([[ 2.02851132]])

As you can see here the value comes as 2.028, then this doesn’t mean that the customer is going to buy 2 types of product.

Something wrong

The model that we built above is not right.
There is certain issues with the type of dependent variable.
The dependent variable is not continuous, it is binary.
We can’t fit a linear regression line to this data

Why not linear ?

Consider Product sales data. The dataset has two columns.
- Age – continuous variable between 6-80
- Buy(0- Yes ; 1-No)

“Age” is the continuous data and it takes value according to the range given (8 to 60). But as we can see the variable “Buy” is not the continuous data. It is binary and has only 0 or 1 data. Hence, in the end no matter what we do or no matter what optimization we use, we are not able to fit the linear regression line to this data. Now, let’s talk about the real life examples.

Real-life examples

Gaming – Win vs. Loss
Sales – Buying vs. Not buying
Marketing – Response vs. No Response
Credit card & Loans – Default vs. Non Default
Operations – Attrition vs. Retention
Websites – Click vs. No click
Fraud identification – Fraud vs. Non Fraud
Healthcare – Cure vs. No Cure

Some Nonlinear functions

As we can see in the functions given, there is a polynomial function; will that be able to fit for our data? The answer is that it won’t fit our data. Again, there are a Gaussian function, quadratic equation, exponential function, double exponential and sine function won’t be fitting to our data. Now, there is logistic function; which we can fit to our data.

A Logistic Function

The Logistic function

As it looks like “S” and may be if we adjust some parameters, then we can fit the data, because it looks like it’s tails are longer and the mid portion is shorter and it will be a good fit to our dataset. We will use logistic function to fit to our data rather than linear function. We can even avoid many errors, rather than just fitting the straight line. Earlier the equation of linear regression was $y = beta_0 + beta_1X$ . Now, for logistic regression, we have some what different equation. So for that, we need the model that predicts the probabilities between 0 and 1. The model should be in “S” shaped. There is portion of the data, full of 0s and there is portion of the data, full of 1s. Some people of age less than 0s portion are not buying and some people of age more than 1s are buying. Most of them are the pattern that we are observing in the variable. As the variable increases then the other variable is dominating. Like “Age” is increasing then the “Buying” will be dominating. Now, what is the logistic regression equation? Earlier in the linear regression line, the value of equation will be simple equation but for the logistic regression line equation we have this, $Probability = frac{e^{(beta_0+ beta_1X)}}{1+e^{($beta_0+ beta_1X)}}$

Logistic Regression Function

Logistic regression models the logit of the outcome, instead of the outcome i.e. instead of winning or losing; we build a model for log odds of winning or losing. Natural logarithm of the odds of the outcome. ln(Probability of the outcome (p)/Probability of not having the outcome (1-p)) $P(y/x) = frac{e^{(beta_0+ beta_1X)}}{1+e^{($beta_0+ beta_1X)}}$

Lab: Logistic Regression

Dataset: Product Sales Data/Product_sales.csv
Build a logistic Regression line between Age and Buying.
A 4 years old customer, will he buy the product?
If Age is 105, then will that customer buy the product?

1) Import Dataset: Product Sales Data/Product_sales.csv

In [6]:

import pandas as pd 
sales=pd.read_csv("DatasetsProduct Sales DataProduct_sales.csv")

2) Build a logistic Regression line between Age and Buying

In [7]:

import statsmodels.formula.api as sm
logit=sm.Logit(sales['Bought'],sales['Age'])
logit
result = logit.fit()
result

Optimization terminated successfully.
         Current function value: 0.584320
         Iterations 5

Out[7]:

<statsmodels.discrete.discrete_model.BinaryResultsWrapper at 0x1b240ee5588>

In [8]:

result.summary()

Out[8]:

Logit Regression Results
Dep. Variable:	Bought	No. Observations:	467
Model:	Logit	Df Residuals:	466
Method:	MLE	Df Model:	0
Date:	Mon, 13 Feb 2017	Pseudo R-squ.:	0.1478
Time:	16:29:50	Log-Likelihood:	-272.88
converged:	True	LL-Null:	-320.21
		LLR p-value:	nan

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Age	0.0294	0.003	8.813	0.000	0.023 0.036

Coefficients Interval of each coefficient.

In [9]:

print (result.conf_int())

            0         1
Age  0.022851  0.035923

One more way of fitting the model.

In [10]:

from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(sales[["Age"]],sales["Bought"])

Out[10]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

3) A 4 years old customer, will he buy the product?

In [11]:

age1=4
predict_age1=logistic.predict(age1)
print(predict_age1)

[0]

4) If Age is 105 then will that customer buy the product?

In [12]:

age2=105
predict_age2=logistic.predict(age2)
print(predict_age2)

[1]

Multiple Logistic Regression

The dependent variable is binary.
Instead of single independent/predictor variable, we have multiple predictors.
Like Buying/Non-Buying depends on customer attributes like age, gender, place, income, etc.

LAB: Multiple Logistic Regression

There is a dataset called Fiberbits data and this is an internet service provider dataset. Since last few years, there were some customers who stick with this service provider and there were also some customers who left this service provider. Now what we are trying to do is to build an attrition kind of model that will check whether the customer will be there or not. We are trying to do these things, because if we get to know who are going to leave and who are going to stick to this service provider, then we can send them promotional code and offers to retain to this service provider. Thus, Active_cust is the variable from which we can able to retrieve whether the customer is active or already left the network. There might be other variables that will be used as predictor variables. There are many reasons for the customers to stay back or leave the network. We will try to build the model that will predict whether the customer will stay back or the customer will leave. And the model that we have to build is on the fiberbits data.

Dataset: Fiberbits/Fiberbits.csv
- Active_cust variable indicates whether the customer is active or already left the network.
Build a model to predict the chance of attrition for a given customer using all the features.
How good is your model?
What are the most impacting variables?

1) Import Dataset: Fiberbits/Fiberbits.csv

In [13]:

Fiber=pd.read_csv("DatasetsFiberbitsFiberbits.csv")
list(Fiber.columns.values)

Out[13]:

['active_cust',
 'income',
 'months_on_network',
 'Num_complaints',
 'number_plan_changes',
 'relocated',
 'monthly_bill',
 'technical_issues_per_month',
 'Speed_test_result']

Here, the fiberbits data is having 100000 rows and 9 columns. Columns like active_cust which is having the value 0 or 1 and is the target variable i.e., being in the network and leaving the network depends on the income, months on network, number of complaints they gave, number of plan changes they made, relocation status is like whether they relocated or not, the monthly bill that they are getting, technical issues per month and speed test results like whatever the speed they promised whether they gave it or not. Every criterion is there in this dataset. The task is we have to predict whether the customer is going to stick to this service provider or that customer is going to leave the network depending on all the criteria that we have mentioned. Now here the active customer is having the values 0 or 1, hence directly we are going to build the model using rest of the variables. Here, the model will be depending on the active customer versus all the other variables. If we are defining all the variables then we can write as follows:

2) Build a model to predict the chance of attrition for a given customer using all the features.

In [14]:

from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
logistic.fit(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']],Fiber[['active_cust']])

C:UsersJaldhi MehtaAnaconda3libsite-packagessklearnutilsvalidation.py:515: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Out[14]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:

predict1=logistic.predict(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
predict1

Out[15]:

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

Here the model building is been done. This is the model which helps us to predict but the equation is somewhat complicated here. Looking at the coefficients values of all the variables, we can say if the value of probability by substituting all the variables is 1 or near to 1, then the customer might leave the network and if the probability is 0 or near to 0, then the customer will not leave the network. If the probability is near to 1, then we can start building the strategies or we can start segmenting out the customer, thus to somehow retain them. Its not only about the internet service provider but we can use this model in any field which gives the binary values. Thus we can build the multiple logistic regression line.

Goodness of Fit for a Logistic Regression

After building the logistic regression line, the question arises that how good the model is? If someone challenges that the line which is been fitted in the model is not right or what is the confidence in your prediction or how good is your model or what is the goodness of fit of the particular model. What is the goodness of the fit measures in a logistic regression? If you observe carefully, then R2 is not a right measure because in percentage of variation explained by all the predictor variables in Y. Thus the Y’s variance is how much percentage of variance is explained or how much percentage is predicted by all the variables. Here the Y variables are taking two values and there will be hardly any variance i.e., 1 or 0. Hence, R2 is not the right measure of goodness of fit for logistic regression. There are other ways of finding out the way of goodness of fit for logistic regression as follows:

Classification Matrix – AIC and BIC = ROC and AUC – Area under the curve

Classification Table & Accuracy

Now discussing about the confusion matrix, which is also called the classification model gives the accuracy of the model. If we understand that Y takes two values in the action, then it will be like Yes or No, 1 or 0, (-1) or (+1), True or False, etc. and at the end of the prediction in our model, we are also going to give two values 0 or 1, Yes or No, etc. Take a data point that already exists in the data. Suppose let’s take a point active_cust is 0 and then observe all the variables that are given for that data and build the model. With that logistic regression model, if we substitute every other variables and the prediction comes to be 0 or near to 0, then we are doing good because the actual value is 0. But we substitute every other value and the prediction comes to be 1 then at that particular point you have made a mistake and that is the wrong prediction. Hence, we want to have some lesser such wrong predictions. For the good model, when the actual value is 0 and you want to predict as 0 by substituting all the variables, thus 0 should be predicted as 0 and 1 should be predicted as 1. If 0 is predicted as 1 and 1 is predicted as 0, then that is wrong prediction or misclassification. The predicted values are not 0 and 1, they are probabilities. And the probabilities will be in between 0 and 1. We can put the threshold like, anything more than 0.5 will be 1 and anything less than 0.5 will be 0. Hence the actual value is 0 and we are getting the probability value which is near to 0 then our model is right. Let’s suppose there are dataset in which there are 1000 data points. Out of which, 500 zeroes are there and 500 ones are there. Hence most of the 500 zeroes should be predicted as zeroes. But may be some values are predicted as ones. Most of the 1 should be predicted as 1, and then only we can say that the model is a good one.

Predicted / Actual	0	1
0	True Positive (TP)	False Positive (FP)
1	False Negative (FN)	True Negative (TN)

As seen in the table, if the “True Positives” and “True Negatives” will be lesser and “False Positives” and “False Negatives” will be higher, then there will be something wrong in the prediction. (Here the True Positives, True Negatives, etc. will be further discussed.) For a good model, all 0 should be predicted as 0 and all 1 should be predicted as 1. The given table is called the confusion matrix or classification table. Accuracy is predicting the 0 as 0 and 1 as 1 i.e., these diagonals elements is the accuracy. (Right now just consider them as the diagonals elements.) $Accuracy=frac{(TP+TN)}{(TP+FP+FN+TN)}$ Accuracy is what percentage of time and how many times out of overall cases, you have predicted them correctly. Hence for your model you need to find out accuracy. Accuracy is been always derived only by classification table or confusion matrix. In fact, 0 is considered as positive and 1 is considered as negative.The rows are the actual classes and the columns are the predicted classes. When 0 is predicted as 0, then they are called True Positive which is like when actual condition is Positive, it is truly predicted as positive. When 0 is predicted as 1, then they are called False Negative which is like when actual condition is Positive, it is falsely predicted as negative. When 1 is predicted as 0, then they are called False Positives which is like when actual condition is Negative, it is falsely predicted as positive. When 1 is predicted as 1, then they are called True Negatives which is like when actual condition is Negative, it is truly predicted as negative. Hence there should be high number of “True Positives” and “True Negatives”, and there should be less number of “False Positives” and “False Negatives”. $Misclassification=frac{(FP+FN)}{(TP+FP+FN+TN)}$ $Misclassification= 1- Accuracy$ We can also derived Specificity and Sensitivity from confusion matrix. As of now we are concentrating in the accuracy for the logistic regression line. By observing the accuracy and confusion matrix, we can decide whether the model is good or bad. Any accuracy with more than 80% is really good. Most of the time it depends on the data, if the data is really good then it gives higher accuracy and vice versa.

LAB: Confusion Matrix & Accuracy

Create confusion matrix for Fiberbits model:

In [16]:

import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Fiber[['active_cust']],predict1)
print(cm1)

[[29210 12931]
 [10183 47676]]

Find the accuracy value for fiberbits model

In [17]:

total1=sum(sum(cm1))
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1

Out[17]:

0.76885999999999999

Try three different threshold values and note down the changes in accuracy value in Fiberbits model

In [18]:

predict1

Out[18]:

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

Multicollinearity

Multicollinearity is nothing but the interdependency of the predictor variables. Within predictor variables, some variables are interdependent. At the time of interdependence, the final regression coefficients went for a toss, the variance of coefficients was very high that we cannot trust the coefficients. When we are doing the individual impact of analysis say X1,X2 or related and when we are doing the relation analysis as X1 versus Y and X2 versus Y because, the coefficients are wrong and we are getting into wrong conclusions. Then we realize that X_1 and X2 are related and that is called multicollinearity. We just cannot keep both of them in the model at the same time; it has to be X_1 or X_2, because one of them is carrying the complete information about the other one. By observing we get to know that multicollinearity is only related to predictor variables, it has nothing to do with Y versus X. It totally depends on X versus remaining all X’s, X1 versus remaining all X’s or X2 versus remaining all X’s. Now multicollinearity is an issue even in logistic regression, because we are hardly considering the Y or dependent variable in that scenario. Thus, if there is multicollinearity then the logistic regression coefficients will go for a toss. Multicollinearity needs to be treated in the logistic regression, because even in logistic regression we are using some kind of optimization when there are variables which are interlinked within this model. Definitely they are going to impact the coefficients and again they are going to led to wrong conclusion even in logistic regression when we are trying to analyse the impact of the each variable i.e., the Y variable. So multicollinearity needs to be treated even in logistic regression. The relation between X and Y is non-linear, that is why we are using logistic regression. The multicollinearity is an issue related to predictor variables. Multicollinearity needs to be fixed even in logistic regression as well. Otherwise the individual coefficients will be affected. The process of identification of multicollinearity is same as the logistic regression, because multicollinearity is all about the relation within X variables i.e., X1 versus remaining variables that is how we find the VIF (Variance Inflation Factors) values. We just have to find VIF values which is derived from indeed when we build individual model i.e., X versus remaining variables. Hence we have to use VIF values to identify multicollinearity. If it is multicollinearity, then we can give same treatment that we were giving it earlier.

Multicollinearity in Python

We take the “Fiberbits” Dataset and for multicollinearity we need to find VIF. Any VIF value more than 5, which is an indication of multicollinearity. But that doesn’t mean we have to drop all the variables that the VIF value are having more than 5. We will drop one by one variable. Hence X1 is depending on X2 and X2 is depending on X1, then X2 is redundant in presence of $X_1$ and X1 is redundant in presence of X2. So both of them will have VIF more than 5, but that does not mean we should drop both X1 and X2, we will lose out the basic information. For multicollinearity, we need to drop out one by one only. First find out the VIF, go for the highest VIF, then observe is there any VIF that is more than 5 then we can drop the highest variable of VIF.

In [19]:

def vif_cal(input_data, dependent_col):
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [20]:

vif_cal(input_data=Fiber, dependent_col="active_cust")

income  VIF =  1.02
months_on_network  VIF =  1.03
Num_complaints  VIF =  1.01
number_plan_changes  VIF =  1.59
relocated  VIF =  1.56
monthly_bill  VIF =  1.02
technical_issues_per_month  VIF =  1.06
Speed_test_result  VIF =  1.0

We need to use the above function for caluclation of VIF values in python. As we can see no variable is having more than 5 VIF value, then it means that all the variables will be having independent information. We cannot drop the variables that are nearer values to 5, because if we drop it, then we may lose out on some information that might impact the overall accuracy of our model. Hence, that is how we find out multicollinearity. Multicollinearity is only applicable when the model has multiple independent variables. It is to test whether there are any variables carrying same information but named as two different variables. Multicollinearity is depends on the questions like why we need to eliminate multicollinearity and how multicollinearity is also known as redundancy.

Individual Impact of Variables

Out of all the predictor variables, that we have used for prediction of the Buying or NonBuying or that we have used for attrition versus non-attrition, hence the question arises is what are the important variables? If we have to choose top 5 variables, then what will be those variables? While selecting any model, is there any way that we can drop the variables that are not impacting or that are less impacting and keep only the important variables thus, we don’t have to collect or maintain the data for those less impacting variables. Why should we keep the variables those who are not impacting? How to rank the predictor variables in order of their importance?

Individual Impact – z-values and Wald Chi-square

How do we find out the individual impact of the variables? The answer is for finding out the individual impact of the variables is that we have to observe the z-values and Wald-chi square. There are z-values in our output against each variable. Observing the z-values or the absolute values of the z, we can say the impact of that variable. We can rank them by observing the Wald Chi-square or the square of the z-values. We have to simply square that z-value, that the variable with highest z-value will be the best one. Talking about absolute value, if you have the variable with z-value as (-30) and another variable with z-value as 10, the most important variable will be the variable with z-value as (-30), because some variables might be positively impacting while some variables might be negatively impacting but it only depends on impact thus, we have to consider as absolute z-value. Or in simple terms, we have to consider the Wald Chi-square value or Chi-square value or the z-square value. Let’s observe what are the top impacting variable or how do we rank the variable in our fiberbits data.

LAB: Individual Impact of Variables

Identify top impacting and least impacting variables in fiberbits models.
Find the variable importance and order them based on their impact.

In [21]:

result1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
result1
result1.fit()
result1.fit().summary()
result1.fit().summary2()

Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7

Out[21]:

Model:	Logit	Pseudo R-squared:	0.240
Dependent Variable:	active_cust	AIC:	103450.4420
Date:	2017-02-13 16:30	BIC:	103526.5454
No. Observations:	100000	Log-Likelihood:	-51717.
Df Model:	7	LL-Null:	-68074.
Df Residuals:	99992	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
income	0.0000	0.0000	4.0973	0.0000	0.0000	0.0000
months_on_network	0.0150	0.0005	31.1715	0.0000	0.0141	0.0159
Num_complaints	-1.7669	0.0271	-65.2837	0.0000	-1.8199	-1.7138
number_plan_changes	-0.1784	0.0075	-23.9093	0.0000	-0.1930	-0.1638
relocated	-3.0826	0.0404	-76.2589	0.0000	-3.1618	-3.0034
monthly_bill	-0.0024	0.0002	-16.0138	0.0000	-0.0027	-0.0021
technical_issues_per_month	-0.4636	0.0072	-64.0101	0.0000	-0.4778	-0.4494
Speed_test_result	0.1094	0.0015	75.0729	0.0000	0.1065	0.1122

Top impacting variables are – relocated & Speed_test_result Least impacting variables are – monthly_bill & income

Model Selection

While trying to build the best model, we tend to add many variables, we tend to derive many variables, we tend to add many data and we sometimes try to build an optimal model by dropping the variables. For improving the model, there are many possibilities which we can implement it.

- By adding new independent variables
- By deriving new variables
- By transforming variables
- By collecting more data
- Or sometimes we want to just to drop the few variables even if we have to reduce or even if we lose out some of the accuracy.

May be instead of building a model with 200 or 300 variables, we want to build a model with just 50 or 60 variables or may be 20 variables and have the best accuracy possible. We can have the most top impacting variable and have an accuracy of 80%, rather than having 200-300 variables and accuracy of 85%. Thus, how do we build if we have several models with same accuracy level or how do we choose best models or how do we choose the model that is best suited or how do we choose the optimal model for a given set of parameters for given accuracy level. Considering that there are different models called M1, M2 or M3, then how do we get to know that M1 or M2 or M3 is apt model for the data? Hence, that question can be answered by observing the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values. AIC and BIC values of given model or standalone AIC has no real use and has something like Adjusted R-square in linear regression. When we have 2-3 models; by observing the AIC values, the best fit is the model with least AIC values. AIC is an estimate of information that is loss when a given model is use to represent the process that generates the data or simply AIC is the information loss while building the model. Hence, we want to lose the least amount of information while building the model. If you have two models; model-1 and model-2, then compare accuracy between them. Let’s say the model-1 is having accuracy of 84% and model-2 is having accuracy of 85% and we want to choose a model, because model-1 is built on 20 variables and model-2 is built on 30 variables and we want to know that on which model we want to go with. Then based on AIC, we have to choose the model that has least information loss which is best model. AIC formula includes the maximum likelihood function for the model and number of parameters as well. $AIC= -2ln(L)+ 2k$ L be the maximum value of the likelihood function for the model. k is the number of independent variables. Hence, if we have multiple numbers of parameters or if we are having too many independent variables, then we can observe the AIC value. For a given accuracy, if we want to choose the best model, then we have to observe the AIC value. AIC is information loss. If we have the 2-3 models of same level of accuracy i.e., all the models called M1, M2, M3, etc. have same nearby accuracy values, then we can know the best model out of them. For finding the best model out of all of the others and the least information loss model is been calculated by AIC. Now discussing about BIC; BIC is just a substitute to AIC which is having slightly different formula. We can either AIC or BIC (one of them is sufficient or we can simply fixed to AIC) throughout our analysis.

LAB-Logistic Regression Model Selection

Find AIC and BIC values for the first fiber bits model(M1).
What are the top-2 impacting variables in fiber bits model?
What are the least impacting variables in fiber bits model?
Can we drop any of these variables and build a new model(M2)?
Can we add any new interaction and polynomial variables to increase the accuracy of the model?(M3)
We have three models, what the best accuracy that you can expect on this data?

1) Find AIC and BIC values for the first fiberbits model (M1) including all the variables.

In [22]:

m1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
m1
m1.fit()

m1.fit().summary2()

Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7

Out[22]:

Model:	Logit	Pseudo R-squared:	0.240
Dependent Variable:	active_cust	AIC:	103450.4420
Date:	2017-02-13 16:30	BIC:	103526.5454
No. Observations:	100000	Log-Likelihood:	-51717.
Df Model:	7	LL-Null:	-68074.
Df Residuals:	99992	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
income	0.0000	0.0000	4.0973	0.0000	0.0000	0.0000
months_on_network	0.0150	0.0005	31.1715	0.0000	0.0141	0.0159
Num_complaints	-1.7669	0.0271	-65.2837	0.0000	-1.8199	-1.7138
number_plan_changes	-0.1784	0.0075	-23.9093	0.0000	-0.1930	-0.1638
relocated	-3.0826	0.0404	-76.2589	0.0000	-3.1618	-3.0034
monthly_bill	-0.0024	0.0002	-16.0138	0.0000	-0.0027	-0.0021
technical_issues_per_month	-0.4636	0.0072	-64.0101	0.0000	-0.4778	-0.4494
Speed_test_result	0.1094	0.0015	75.0729	0.0000	0.1065	0.1122

What are the top-2 impacting variables in fiber bits model?
What are the least impacting variables in fiber bits model?

In [23]:

m1.fit().summary()

Optimization terminated successfully.
         Current function value: 0.517172
         Iterations 7

Out[23]:

Logit Regression Results
Dep. Variable:	active_cust	No. Observations:	100000
Model:	Logit	Df Residuals:	99992
Method:	MLE	Df Model:	7
Date:	Mon, 13 Feb 2017	Pseudo R-squ.:	0.2403
Time:	16:30:36	Log-Likelihood:	-51717.
converged:	True	LL-Null:	-68074.
		LLR p-value:	0.000

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
income	1.71e-05	4.17e-06	4.097	0.000	8.92e-06 2.53e-05
months_on_network	0.0150	0.000	31.172	0.000	0.014 0.016
Num_complaints	-1.7669	0.027	-65.284	0.000	-1.820 -1.714
number_plan_changes	-0.1784	0.007	-23.909	0.000	-0.193 -0.164
relocated	-3.0826	0.040	-76.259	0.000	-3.162 -3.003
monthly_bill	-0.0024	0.000	-16.014	0.000	-0.003 -0.002
technical_issues_per_month	-0.4636	0.007	-64.010	0.000	-0.478 -0.449
Speed_test_result	0.1094	0.001	75.073	0.000	0.107 0.112

2) Income and Monthly Bill Dropped because those are the least impacting variables.

In [24]:

m2=sm.Logit(Fiber['active_cust'],Fiber[['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['technical_issues_per_month']+['Speed_test_result']])
m2
m2.fit()
m2.fit().summary()
m2.fit().summary2()

Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.518605
         Iterations 7

Out[24]:

Model:	Logit	Pseudo R-squared:	0.238
Dependent Variable:	active_cust	AIC:	103732.9794
Date:	2017-02-13 16:30	BIC:	103790.0570
No. Observations:	100000	Log-Likelihood:	-51860.
Df Model:	5	LL-Null:	-68074.
Df Residuals:	99994	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
months_on_network	0.0146	0.0005	30.8698	0.0000	0.0137	0.0155
Num_complaints	-1.7621	0.0270	-65.2891	0.0000	-1.8150	-1.7092
number_plan_changes	-0.1765	0.0074	-23.7127	0.0000	-0.1910	-0.1619
relocated	-3.0800	0.0404	-76.1640	0.0000	-3.1592	-3.0007
technical_issues_per_month	-0.4762	0.0072	-66.1848	0.0000	-0.4903	-0.4621
Speed_test_result	0.1074	0.0014	74.5451	0.0000	0.1046	0.1102

3) Dropping high impacting variables relocated and Speed_test_result.

In [25]:

m3=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['monthly_bill']+['technical_issues_per_month']])
m3
m3.fit()
m3.fit().summary()
m3.fit().summary2()

Optimization terminated successfully.
         Current function value: 0.583085
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.583085
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.583085
         Iterations 7

Out[25]:

Model:	Logit	Pseudo R-squared:	0.143
Dependent Variable:	active_cust	AIC:	116629.0918
Date:	2017-02-13 16:30	BIC:	116686.1694
No. Observations:	100000	Log-Likelihood:	-58309.
Df Model:	5	LL-Null:	-68074.
Df Residuals:	99994	LLR p-value:	0.0000
Converged:	1.0000	Scale:	1.0000
No. Iterations:	7.0000

	Coef.	Std.Err.	z	P>\|z\|	[0.025	0.975]
income	0.0012	0.0000	33.1411	0.0000	0.0011	0.0013
months_on_network	0.0424	0.0006	75.9838	0.0000	0.0413	0.0434
Num_complaints	-0.5252	0.0216	-24.2799	0.0000	-0.5676	-0.4828
number_plan_changes	-0.3992	0.0056	-71.4290	0.0000	-0.4101	-0.3882
monthly_bill	-0.0015	0.0001	-10.7074	0.0000	-0.0018	-0.0012
technical_issues_per_month	-0.4333	0.0066	-65.6161	0.0000	-0.4462	-0.4203

Conclusion: Logistic Regression

Here in this chapter, we learnt about the logistic regression, what will be the need of logistic regression, why we need logistic regression when we already have the linear regression, what is the difference between logistic regression and linear regression, how to fit a logistic regression and how to validate the logistic regression, how to do the model selection, etc. Thus the logistic regression is the good foundation of all the algorithms. Hence, if we have good understanding of the logistic regression and goodness of fit measures of logistic regression, then it will really help in understanding complex machine learning algorithms like neural networks and SVMs. Further topics will be much easier compared to this topic. In fact the neural networks is been derived from logistic regression. Hence, we have to be very careful while selecting the model and all the goodness of fit measures are calculated on the training data. We may have to do some out of time validation or cross validation to get an idea on actual error or the actual accuracy of the model. So in future topics we will be discussing about the neural networks, SVMs, etc.

Select Category

Handout – Logistic Regression in Python