Objective
The objective is to have an insight on the applications of logistic regression on a target case. Exhibit the “good” conditions in which logistic regression can lead to acceptable results and what happens when assumptions do not are not respected and the model is still applied.
Material
- Kaggle account http://www.kaggle.com
- Explore the data studio of Kaggle and look for the collection Pima Indians Diabetes Dataset using the keyword “diabetes”. Add it to your data space.
ToDo
Look for the location of your data
#This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python docker image: # https://github.com/kaggle/docker-python # For example, here's several helpful packages to load in import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename))
Load your data
#Importing the required packages to load and analyse pima-indian-#diabetes.csv data set import numpy as np import pandas as pd # load dataset pima_df = pd.read_csv("/kaggle/input/diabetes/diabetes.csv") pima_df.head(10)
Explanatory Data Analysis
#Let us check whether any of the columns has any value other than #numeric i.e. data is not corrupted such as a "?" instead of a number. And also find if there are #any columns with null/missing values print(pima_df[~pima_df.applymap(np.isreal).all(1)]) null_columns= pima_df.columns[pima_df.isnull().any()] print(pima_df[pima_df.isnull().any(axis=1)][null_columns].head())
Visualize data
import seaborn as sns, numpy as np #Let's Do Some Descriptive Analysis using pima_df.describe().T import seaborn as sns, numpy as np ax = sns.distplot(pima_df['Insulin'])
Pairplot Analysis
#Piarplot Data Visualization, type this code and see the output sns.pairplot(pima_df, diag_kind ='kde')
- Some of the attributes preg, test, pedi, age looks like they may have an exponential distribution
- Age probably should have a normal distribution, but due to the constraints on the data collection may have lead to the skewed distribution.
- There is no obvious relationship between age and onset of diabetes.
- There is no obvious relationship between pedi function and onset of diabetes.
- The scatter plots for all attributes clearly shows that there is hardly any relation between them as mostly cloud type distribution is observed.
#Write the given code print("pima test:",pima_df.groupby(['Outcome']).count())
- Most of the attributes are non diabetic. The ratio is almost 1:2 in favor or class 0(non-diabetic) .
- There are 500 records for non-diabetic class and 268 for diabetic, which will make our model more biased in predicting class 0 better than class 1(diabetic).
- Hence it is recommended to collect some more samples weighing both the classed sufficiently to make our model more performing and effective.
For the time being, let’s proceed to build our logistic model and see how it scores against against the given dataframe.
Logistic Model
from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt #To Split our Data set into training and test data from sklearn.model_selection import train_test_split #To calculate accuracy measures and confusion matrix from sklearn import metrics
Split Data Into Training & Test Data
#select all rows and first 8 columns which are the #independent attributes X = pima_df.iloc[:,0:8] #select all rows and the 8th column which is the target column #classification "Yes", "No" for diabetes(Dependent variable) Y = pima_df.iloc[:,8] test_size = 0.30 # taking 70:30 training and test set seed =1 # Random number seeding for reapeatability of the code #Splitting Data into train-test where 70% will be used for training #and rest 30% for Testing our model built on test data X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
Let us build our model
# Use method LogisticRegression() imported from sklearn Logistic_model = LogisticRegression() #Let's pass our training data sets which are X_train and y_train #Our fit method below actually do all the heavy lifting to come up #with sigmoid function which we discussed earlier #So here we get our optimal surface which will be our model Logistic_model.fit(X_train, y_train)
Assessment
Let’s See How Our Model Labels X_train Data To Make A Classification:
#Let's pass X-Train data to our model and see how it predicts to # label all the independent training data as shown below: y_predict = Logistic_model.predict(X_test) print("Y predict/hat ", y_predict) # compile the code and you will see the output as given below
You can see that using model.predict(X_test) function our model classified each column attribute (X-train) as 0/1 as a prediction
Measure How Model Has Performed(Scored)+ Code+ Markdown
Let’s find out the coefficient values of the plane(surface ) our model has found as a best fit surface using the below given code:
#coefficient can be calculated as shown below making use of #model.coef_ method column_label = list(X_train.columns) # To label all the coefficient model_Coeff = pd.DataFrame(Logistic_model.coef_, columns = column_label) model_Coeff['intercept'] = Logistic_model.intercept_ print("Coefficient Values Of The Surface Are: ", model_Coeff)
This is a kind of linear model which has below given coefficients and intercept of -5.058877.
- These values are nothing but the
z= 0.094Preg+0.0255Plas+…..+(-5.05) - Which, get’s fed into our Sigmoid function
sigmoid, g(z) = 1/( 1 + e ^−z).
Model Score:
Let’s see how our best fit model scores against the untrained test data using the underlying logistic function(sigmoid function)we discussed above.
#Pass the test data and see how our best fit model scores against them logmodel_score = Logistic_model.score(X_test, y_test) print("This is how our Model Scored:\n\n", logmodel_score)
Model Score comes out to be 0.774 which in terms of percentage is 77.4%. It is not up to the mark.
Also, it is imperative to point here, that earlier we discussed how the diabetic class was under represented as compared to non-diabetic class in terms of sample data, so we should seldom rely on this model and measure further using confusion matrix class level metrics(Recall, Precision etc.. )
Measure Model Performance Using Confusion Metrics:
#Note That In Confusion Matrix #First argument is true values, #Second argument is predicted values #this produces a 2x2 numpy array (matrix) print(metrics.confusion_matrix(y_test, y_predict)) #Lets run this and see the outcome below:
Our method metrics.confusion_matrix method comes up with a square matrix shown above where our model. Where X_test is our row and y_predict is our column values. Our confusion matrix comes up with following result which goes on to show that our model
- predicted 47 patient to be diabetic(True Positive) and 132 to be non-diabetic(True Negative)
- predicted 14 patient to be diabetic(False Positive) and 38 to be non-diabetic(False Negative)
Calculate Recall Value: A Class Level Metric To Measure Model Performance:
Recall:
Recall(For Non_diabetic) = TP/(TP+FN )
Here TP = 132,
FN= 14
- Recall = 132/(132+14)= 132/146 = 0.90 = 90 %
- Recall(For Diabetic) = TP/(TP+FN )
TP = 47, FN = 38 - Recall(For Diabetic)= 47/85 = 0.55 = 55 % ,
This model is performing poorly in case of diabetic, which is quite visible because of the lack of sample data being available in diabetic class, for modelling, as we discussed earlier.
Precision:
- Precision (For Non- Diabetic) = TP/(TP+FP)= 132/ 170 = 0.77 = 77%
- Precision (For Diabetic) = TP/(TP+FP)= 47/ 61 = 0.77 = 77%
which is low specially given the nature of the problem(Here healthcare industry) we are trying to solve where accuracy of more than 95% is expected.
This analysis has been authored by https://towardsdatascience.com/logistic-regression-for-dummies-a-detailed-explanation-9597f76edf46