Before start our lesson please download the datasets.
Logistic regression in detailed
In this chapter, we will discuss about logistic regression. Earlier we have talked about the linear regression or it can be also called as regression analysis. Now we will be studying about the logistic regression and how the logistic regression will be different from the linear regression. Even we will be discussing about the new concepts that we are going to learn in the logistic regression, which are not defined in the linear regression.
Contents
- What is the need of logistic regression?
- Building logistic Regression line
- Goodness of fit measures
- Multicollinearity
- Individual Impact of variables
- Model selection
Regression Recap
In regression, the dependent variable is predicted using independent variables. A straight line is fit to capture the relation in the form of a model. The R-Square/Adjusted R-Square values tells us the goodness of fit of the model. Once the line is ready, we can substitute the values of x(predictor) to get the predicted values of y(dependent variable).
LAB: Regression – Recap
- Dataset: Product Sales Data/Product_sales.csv
- What are the variables in the dataset?
- Build a predictive model for Bought vs Age.
- What is R-Square?
- If Age is 4, then will that customer buy the product?
- If Age is 105, then will that customer buy the product?
Solution
import numpy as np
import pandas as pd
import matplotlib as plt
import scipy as sp
sales=pd.read_csv("DatasetsProduct Sales DataProduct_sales.csv")
sales.columns.values
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
model = sm.ols(formula='Bought ~ Age', data=sales)
fitted = model.fit()
fitted.summary()
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(sales[["Age"]], sales[["Bought"]])
age1=4
predict1=lr.predict(age1)
predict1
age2=105
predict2=lr.predict(age2)
predict2
Something wrong
- The model that we built above is not right.
- There is certain issues with the type of dependent variable.
- The dependent variable is not continuous, it is binary.
- We can’t fit a linear regression line to this data
Why not linear ?
- Consider Product sales data. The dataset has two columns.
- Age – continuous variable between 6-80
- Buy(0- Yes ; 1-No)
Real-life examples
- Gaming – Win vs. Loss
- Sales – Buying vs. Not buying
- Marketing – Response vs. No Response
- Credit card & Loans – Default vs. Non Default
- Operations – Attrition vs. Retention
- Websites – Click vs. No click
- Fraud identification – Fraud vs. Non Fraud
- Healthcare – Cure vs. No Cure
Some Nonlinear functions
As we can see in the functions given, there is a polynomial function; will that be able to fit for our data? The answer is that it won’t fit our data. Again, there are a Gaussian function, quadratic equation, exponential function, double exponential and sine function won’t be fitting to our data. Now, there is logistic function; which we can fit to our data.
A Logistic Function
The Logistic function
As it looks like “S” and may be if we adjust some parameters, then we can fit the data, because it looks like it’s tails are longer and the mid portion is shorter and it will be a good fit to our dataset. We will use logistic function to fit to our data rather than linear function. We can even avoid many errors, rather than just fitting the straight line. Earlier the equation of linear regression was . Now, for logistic regression, we have some what different equation. So for that, we need the model that predicts the probabilities between 0 and 1. The model should be in “S” shaped. There is portion of the data, full of 0s and there is portion of the data, full of 1s. Some people of age less than 0s portion are not buying and some people of age more than 1s are buying. Most of them are the pattern that we are observing in the variable. As the variable increases then the other variable is dominating. Like “Age” is increasing then the “Buying” will be dominating. Now, what is the logistic regression equation? Earlier in the linear regression line, the value of equation will be simple equation but for the logistic regression line equation we have this,
Logistic Regression Function
Logistic regression models the logit of the outcome, instead of the outcome i.e. instead of winning or losing; we build a model for log odds of winning or losing. Natural logarithm of the odds of the outcome. ln(Probability of the outcome (p)/Probability of not having the outcome (1-p))
Lab: Logistic Regression
- Dataset: Product Sales Data/Product_sales.csv
- Build a logistic Regression line between Age and Buying.
- A 4 years old customer, will he buy the product?
- If Age is 105, then will that customer buy the product?
import pandas as pd
sales=pd.read_csv("DatasetsProduct Sales DataProduct_sales.csv")
import statsmodels.formula.api as sm
logit=sm.Logit(sales['Bought'],sales['Age'])
logit
result = logit.fit()
result
result.summary()
print (result.conf_int())
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(sales[["Age"]],sales["Bought"])
age1=4
predict_age1=logistic.predict(age1)
print(predict_age1)
age2=105
predict_age2=logistic.predict(age2)
print(predict_age2)
Multiple Logistic Regression
- The dependent variable is binary.
- Instead of single independent/predictor variable, we have multiple predictors.
- Like Buying/Non-Buying depends on customer attributes like age, gender, place, income, etc.
LAB: Multiple Logistic Regression
There is a dataset called Fiberbits data and this is an internet service provider dataset. Since last few years, there were some customers who stick with this service provider and there were also some customers who left this service provider. Now what we are trying to do is to build an attrition kind of model that will check whether the customer will be there or not. We are trying to do these things, because if we get to know who are going to leave and who are going to stick to this service provider, then we can send them promotional code and offers to retain to this service provider. Thus, Active_cust is the variable from which we can able to retrieve whether the customer is active or already left the network. There might be other variables that will be used as predictor variables. There are many reasons for the customers to stay back or leave the network. We will try to build the model that will predict whether the customer will stay back or the customer will leave. And the model that we have to build is on the fiberbits data.
- Dataset: Fiberbits/Fiberbits.csv
- Active_cust variable indicates whether the customer is active or already left the network.
- Build a model to predict the chance of attrition for a given customer using all the features.
- How good is your model?
- What are the most impacting variables?
Fiber=pd.read_csv("DatasetsFiberbitsFiberbits.csv")
list(Fiber.columns.values)
from sklearn.linear_model import LogisticRegression
logistic= LogisticRegression()
logistic.fit(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']],Fiber[['active_cust']])
predict1=logistic.predict(Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
predict1
Goodness of Fit for a Logistic Regression
After building the logistic regression line, the question arises that how good the model is? If someone challenges that the line which is been fitted in the model is not right or what is the confidence in your prediction or how good is your model or what is the goodness of fit of the particular model. What is the goodness of the fit measures in a logistic regression? If you observe carefully, then R2 is not a right measure because in percentage of variation explained by all the predictor variables in Y. Thus the Y’s variance is how much percentage of variance is explained or how much percentage is predicted by all the variables. Here the Y variables are taking two values and there will be hardly any variance i.e., 1 or 0. Hence, R2 is not the right measure of goodness of fit for logistic regression. There are other ways of finding out the way of goodness of fit for logistic regression as follows:
- Classification Matrix – AIC and BIC = ROC and AUC – Area under the curve
Classification Table & Accuracy
Now discussing about the confusion matrix, which is also called the classification model gives the accuracy of the model. If we understand that Y takes two values in the action, then it will be like Yes or No, 1 or 0, (-1) or (+1), True or False, etc. and at the end of the prediction in our model, we are also going to give two values 0 or 1, Yes or No, etc. Take a data point that already exists in the data. Suppose let’s take a point active_cust is 0 and then observe all the variables that are given for that data and build the model. With that logistic regression model, if we substitute every other variables and the prediction comes to be 0 or near to 0, then we are doing good because the actual value is 0. But we substitute every other value and the prediction comes to be 1 then at that particular point you have made a mistake and that is the wrong prediction. Hence, we want to have some lesser such wrong predictions. For the good model, when the actual value is 0 and you want to predict as 0 by substituting all the variables, thus 0 should be predicted as 0 and 1 should be predicted as 1. If 0 is predicted as 1 and 1 is predicted as 0, then that is wrong prediction or misclassification. The predicted values are not 0 and 1, they are probabilities. And the probabilities will be in between 0 and 1. We can put the threshold like, anything more than 0.5 will be 1 and anything less than 0.5 will be 0. Hence the actual value is 0 and we are getting the probability value which is near to 0 then our model is right. Let’s suppose there are dataset in which there are 1000 data points. Out of which, 500 zeroes are there and 500 ones are there. Hence most of the 500 zeroes should be predicted as zeroes. But may be some values are predicted as ones. Most of the 1 should be predicted as 1, and then only we can say that the model is a good one.
Predicted / Actual | 0 | 1 |
---|---|---|
0 | True Positive (TP) | False Positive (FP) |
1 | False Negative (FN) | True Negative (TN) |
As seen in the table, if the “True Positives” and “True Negatives” will be lesser and “False Positives” and “False Negatives” will be higher, then there will be something wrong in the prediction. (Here the True Positives, True Negatives, etc. will be further discussed.) For a good model, all 0 should be predicted as 0 and all 1 should be predicted as 1. The given table is called the confusion matrix or classification table. Accuracy is predicting the 0 as 0 and 1 as 1 i.e., these diagonals elements is the accuracy. (Right now just consider them as the diagonals elements.) Accuracy is what percentage of time and how many times out of overall cases, you have predicted them correctly. Hence for your model you need to find out accuracy. Accuracy is been always derived only by classification table or confusion matrix. In fact, 0 is considered as positive and 1 is considered as negative.The rows are the actual classes and the columns are the predicted classes. When 0 is predicted as 0, then they are called True Positive which is like when actual condition is Positive, it is truly predicted as positive. When 0 is predicted as 1, then they are called False Negative which is like when actual condition is Positive, it is falsely predicted as negative. When 1 is predicted as 0, then they are called False Positives which is like when actual condition is Negative, it is falsely predicted as positive. When 1 is predicted as 1, then they are called True Negatives which is like when actual condition is Negative, it is truly predicted as negative. Hence there should be high number of “True Positives” and “True Negatives”, and there should be less number of “False Positives” and “False Negatives”.
We can also derived Specificity and Sensitivity from confusion matrix. As of now we are concentrating in the accuracy for the logistic regression line. By observing the accuracy and confusion matrix, we can decide whether the model is good or bad. Any accuracy with more than 80% is really good. Most of the time it depends on the data, if the data is really good then it gives higher accuracy and vice versa.
LAB: Confusion Matrix & Accuracy
- Create confusion matrix for Fiberbits model:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(Fiber[['active_cust']],predict1)
print(cm1)
- Find the accuracy value for fiberbits model
total1=sum(sum(cm1))
accuracy1=(cm1[0,0]+cm1[1,1])/total1
accuracy1
- Try three different threshold values and note down the changes in accuracy value in Fiberbits model
predict1
Multicollinearity
Multicollinearity is nothing but the interdependency of the predictor variables. Within predictor variables, some variables are interdependent. At the time of interdependence, the final regression coefficients went for a toss, the variance of coefficients was very high that we cannot trust the coefficients. When we are doing the individual impact of analysis say X1,X2 or related and when we are doing the relation analysis as X1 versus Y and X2 versus Y because, the coefficients are wrong and we are getting into wrong conclusions. Then we realize that X_1 and X2 are related and that is called multicollinearity. We just cannot keep both of them in the model at the same time; it has to be X_1 or X_2, because one of them is carrying the complete information about the other one. By observing we get to know that multicollinearity is only related to predictor variables, it has nothing to do with Y versus X. It totally depends on X versus remaining all X’s, X1 versus remaining all X’s or X2 versus remaining all X’s. Now multicollinearity is an issue even in logistic regression, because we are hardly considering the Y or dependent variable in that scenario. Thus, if there is multicollinearity then the logistic regression coefficients will go for a toss. Multicollinearity needs to be treated in the logistic regression, because even in logistic regression we are using some kind of optimization when there are variables which are interlinked within this model. Definitely they are going to impact the coefficients and again they are going to led to wrong conclusion even in logistic regression when we are trying to analyse the impact of the each variable i.e., the Y variable. So multicollinearity needs to be treated even in logistic regression. The relation between X and Y is non-linear, that is why we are using logistic regression. The multicollinearity is an issue related to predictor variables. Multicollinearity needs to be fixed even in logistic regression as well. Otherwise the individual coefficients will be affected. The process of identification of multicollinearity is same as the logistic regression, because multicollinearity is all about the relation within X variables i.e., X1 versus remaining variables that is how we find the VIF (Variance Inflation Factors) values. We just have to find VIF values which is derived from indeed when we build individual model i.e., X versus remaining variables. Hence we have to use VIF values to identify multicollinearity. If it is multicollinearity, then we can give same treatment that we were giving it earlier.
Multicollinearity in Python
We take the “Fiberbits” Dataset and for multicollinearity we need to find VIF. Any VIF value more than 5, which is an indication of multicollinearity. But that doesn’t mean we have to drop all the variables that the VIF value are having more than 5. We will drop one by one variable. Hence X1 is depending on X2 and X2 is depending on X1, then X2 is redundant in presence of $X_1$ and X1 is redundant in presence of X2. So both of them will have VIF more than 5, but that does not mean we should drop both X1 and X2, we will lose out the basic information. For multicollinearity, we need to drop out one by one only. First find out the VIF, go for the highest VIF, then observe is there any VIF that is more than 5 then we can drop the highest variable of VIF.
def vif_cal(input_data, dependent_col):
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared
vif=round(1/(1-rsq),2)
print (xvar_names[i], " VIF = " , vif)
vif_cal(input_data=Fiber, dependent_col="active_cust")
Individual Impact of Variables
Out of all the predictor variables, that we have used for prediction of the Buying or NonBuying or that we have used for attrition versus non-attrition, hence the question arises is what are the important variables? If we have to choose top 5 variables, then what will be those variables? While selecting any model, is there any way that we can drop the variables that are not impacting or that are less impacting and keep only the important variables thus, we don’t have to collect or maintain the data for those less impacting variables. Why should we keep the variables those who are not impacting? How to rank the predictor variables in order of their importance?
Individual Impact – z-values and Wald Chi-square
How do we find out the individual impact of the variables? The answer is for finding out the individual impact of the variables is that we have to observe the z-values and Wald-chi square. There are z-values in our output against each variable. Observing the z-values or the absolute values of the z, we can say the impact of that variable. We can rank them by observing the Wald Chi-square or the square of the z-values. We have to simply square that z-value, that the variable with highest z-value will be the best one. Talking about absolute value, if you have the variable with z-value as (-30) and another variable with z-value as 10, the most important variable will be the variable with z-value as (-30), because some variables might be positively impacting while some variables might be negatively impacting but it only depends on impact thus, we have to consider as absolute z-value. Or in simple terms, we have to consider the Wald Chi-square value or Chi-square value or the z-square value. Let’s observe what are the top impacting variable or how do we rank the variable in our fiberbits data.
LAB: Individual Impact of Variables
- Identify top impacting and least impacting variables in fiberbits models.
- Find the variable importance and order them based on their impact.
result1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
result1
result1.fit()
result1.fit().summary()
result1.fit().summary2()
Model Selection
While trying to build the best model, we tend to add many variables, we tend to derive many variables, we tend to add many data and we sometimes try to build an optimal model by dropping the variables. For improving the model, there are many possibilities which we can implement it.
- By adding new independent variables
- By deriving new variables
- By transforming variables
- By collecting more data
- Or sometimes we want to just to drop the few variables even if we have to reduce or even if we lose out some of the accuracy.
May be instead of building a model with 200 or 300 variables, we want to build a model with just 50 or 60 variables or may be 20 variables and have the best accuracy possible. We can have the most top impacting variable and have an accuracy of 80%, rather than having 200-300 variables and accuracy of 85%. Thus, how do we build if we have several models with same accuracy level or how do we choose best models or how do we choose the model that is best suited or how do we choose the optimal model for a given set of parameters for given accuracy level. Considering that there are different models called M1, M2 or M3, then how do we get to know that M1 or M2 or M3 is apt model for the data? Hence, that question can be answered by observing the AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values. AIC and BIC values of given model or standalone AIC has no real use and has something like Adjusted R-square in linear regression. When we have 2-3 models; by observing the AIC values, the best fit is the model with least AIC values. AIC is an estimate of information that is loss when a given model is use to represent the process that generates the data or simply AIC is the information loss while building the model. Hence, we want to lose the least amount of information while building the model. If you have two models; model-1 and model-2, then compare accuracy between them. Let’s say the model-1 is having accuracy of 84% and model-2 is having accuracy of 85% and we want to choose a model, because model-1 is built on 20 variables and model-2 is built on 30 variables and we want to know that on which model we want to go with. Then based on AIC, we have to choose the model that has least information loss which is best model. AIC formula includes the maximum likelihood function for the model and number of parameters as well. L be the maximum value of the likelihood function for the model. k is the number of independent variables. Hence, if we have multiple numbers of parameters or if we are having too many independent variables, then we can observe the AIC value. For a given accuracy, if we want to choose the best model, then we have to observe the AIC value. AIC is information loss. If we have the 2-3 models of same level of accuracy i.e., all the models called M1, M2, M3, etc. have same nearby accuracy values, then we can know the best model out of them. For finding the best model out of all of the others and the least information loss model is been calculated by AIC. Now discussing about BIC; BIC is just a substitute to AIC which is having slightly different formula. We can either AIC or BIC (one of them is sufficient or we can simply fixed to AIC) throughout our analysis.
LAB-Logistic Regression Model Selection
- Find AIC and BIC values for the first fiber bits model(M1).
- What are the top-2 impacting variables in fiber bits model?
- What are the least impacting variables in fiber bits model?
- Can we drop any of these variables and build a new model(M2)?
- Can we add any new interaction and polynomial variables to increase the accuracy of the model?(M3)
- We have three models, what the best accuracy that you can expect on this data?
m1=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['monthly_bill']+['technical_issues_per_month']+['Speed_test_result']])
m1
m1.fit()
m1.fit().summary2()
- What are the top-2 impacting variables in fiber bits model?
- What are the least impacting variables in fiber bits model?
m1.fit().summary()
m2=sm.Logit(Fiber['active_cust'],Fiber[['months_on_network']+['Num_complaints']+['number_plan_changes']+['relocated']+['technical_issues_per_month']+['Speed_test_result']])
m2
m2.fit()
m2.fit().summary()
m2.fit().summary2()
m3=sm.Logit(Fiber['active_cust'],Fiber[["income"]+['months_on_network']+['Num_complaints']+['number_plan_changes']+['monthly_bill']+['technical_issues_per_month']])
m3
m3.fit()
m3.fit().summary()
m3.fit().summary2()
Conclusion: Logistic Regression
Here in this chapter, we learnt about the logistic regression, what will be the need of logistic regression, why we need logistic regression when we already have the linear regression, what is the difference between logistic regression and linear regression, how to fit a logistic regression and how to validate the logistic regression, how to do the model selection, etc. Thus the logistic regression is the good foundation of all the algorithms. Hence, if we have good understanding of the logistic regression and goodness of fit measures of logistic regression, then it will really help in understanding complex machine learning algorithms like neural networks and SVMs. Further topics will be much easier compared to this topic. In fact the neural networks is been derived from logistic regression. Hence, we have to be very careful while selecting the model and all the goodness of fit measures are calculated on the training data. We may have to do some out of time validation or cross validation to get an idea on actual error or the actual accuracy of the model. So in future topics we will be discussing about the neural networks, SVMs, etc.