CUSTOMER BANK CHURN MODELLING WITH MACHINE LEARNING

CHURN ANALYSIS

Ashutosh

13 min readNov 3, 2020

CONTENT

CHAPTER 1 INTRODUCTION

OBJECTIVE

REASON

BENIFITES

CHAPTER 2 DATA DESCRIPTION AND ANALYSIS (EDA)

2.1 Variable Description

2.2 Dataset Loading and basic statics

2.3 Visualizing Data

2.4 Univariate Analysis

2.5 Bivariate Analysis

2.6 Multivariate analysis

2.7 Pair plot

CHAPTER 3 Distribution of Data

3.1 Computing Confidence interval and Histogram

3.2 Statistics test (ks test)

3.3 Plotting KDE and Q-Q plot

CHAPTER 4 Data Pre-processing Standardization

CHAPTER 5 Modelling Data with Machine Learning algorithm

CHAPTER 6 Model Performance Evaluation

CHAPTER 7 Parameter Tunning and Performance Evaluation

7.1 Logistic Regression hyper-parameter tunning

7.2 PCA and Random Forest Implementation

7.3 RandomForest Hyper-parameter tunning

CHAPTER 8 Neural network Implementation

8.1 Building Models of neural network

8.2 Visualizing the model

8.3 Performance Evaluation in Neural Network

8.4 Visualizing Accuracy and Loss

CHAPTER 1 INTRODUCTION

The objective of this project is to predict which bank customers will churn by means of machine learning modelling techniques.

It is strategically important for companies to manage relationships with their customers,in order to increase their revenues.In business “customer relationship management”(CRM) therefore aims at ensuring customers satisfaction.The companies tries to successfully apply CRM to their business in order to improve their retention power.

This technique led companies to identify with adequate advance which clients will leave and thereby can take necessary measure to prevent churn.

CHAPTER 2 DATA DESCRIPTION AND ANALYSIS (EDA)

2.1 Information about the variables and their types in the data

Surname : The surname of the customer

CreditScore : The credit score of the customer

Geography : The country of the customer(Germany/France/Spain)

Gender : The gender of the customer (Female/Male)

Age : The age of the customer

Tenure : The customer’s number of years in the in the bank

Balance : The customer’s account balance

NumOfProducts : The number of bank products that the customer uses

HasCrCard : Does the customer has a card? (0=No,1=Yes)

IsActiveMember : Does the customer has an active membership (0=No,1=Yes)

EstimatedSalary : The estimated salary of the customer

Exited : Churned or not? (0=No,1=Yes)

2.2 Dataset loading and basic stats

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

IMPORT FILES FROM DRIVE INTO GOOGLE-COLAB:

STEP-1: Import Libraries

Code to read csv file into colaboratory:

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive

from google.colab import auth

from oauth2client.client import GoogleCredentials

STEP-2: Autheticate E-Mail ID

auth.authenticate_user()

gauth = GoogleAuth()

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

STEP-3: Get File from Drive using file-ID

Get the file

downloaded = drive.CreateFile({‘id’:’19NVxVEhM_aP_l3wzpoRo-vVG0WDi-e_K’}) # replace the id with id of file you want to access

downloaded.GetContentFile(‘Churn_Modelling.csv’)

df=pd.read_csv(‘Churn_Modelling.csv’)
df

df.shape

(10000, 14)

The data has 10000 rows( sample point) and 14 column(features)
All feature columns name has been shown below

df.columns
Index([‘RowNumber’, ‘CustomerId’, ‘Surname’, ‘CreditScore’, ‘Geography’, ‘Gender’, ‘Age’, ‘Tenure’, ‘Balance’, ‘NumOfProducts’, ‘HasCrCard’, ‘IsActiveMember’, ‘EstimatedSalary’, ‘Exited’], dtype=’object’)

The summary about data-frame which includes list of all columns with their data types and the number of non-null values in each column.

df.info()

df[‘Geography’].value_counts()

France 5014

Germany 2509

Spain 2477

Name: Geography, dtype: int64

So geography can be made a categorical variable for analysis

So geography can be made a categorical variable for analysis.

df.describe()

2.3 DATA VISUALIZATION

Visualizing missing data

df.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

plt.figure(figsize=(12,8))

sns.heatmap(df.isnull(),cmap=’viridis’)

The above heatmap shows that there is no missing data present.

Outliers visualization of different feature

2.4 Univariate Analysis

plt.figure(figsize=(15,10))

sns.boxplot(data=df, x=’CreditScore’,y=’Geography’,hue=’Gender’)

So it can be seen that Creditscore has lowerouter fence outliers and it is less in number.

plt.figure(figsize=(15,10))

sns.boxplot(data=df, x=’Age’,y=’Geography’,hue=’Gender’)

Age category has outer fence outliers and it is more in number.

plt.figure(figsize=(15,10))

sns.boxplot(data=df, x=’Tenure’,y=’Balance’,hue=’Gender’)

plt.figure(figsize=(15,10))

sns.boxplot(data=df, y=’IsActiveMember’, x=’EstimatedSalary’,hue=’Exited’)

plt.figure(figsize=(15,10))

sns.boxplot(data=df,x=’EstimatedSalary’)

Scatter plot of features

sns.countplot(x=’Exited’, data=df,hue=’Gender’)

sns.countplot(x=’Exited’, data=df,hue=’Geography’)

So customers of Germany has churned the most and France has more number of not churned member.

2.5 Bi-variate Analysis

import seaborn as sns

sns.scatterplot(data=df, x=’Balance’,y=’Age’,hue=’Gender’)

df.columns

Index([‘RowNumber’, ‘CustomerId’, ‘Surname’, ‘CreditScore’, ‘Geography’, ‘Gender’, ‘Age’, ‘Tenure’, ‘Balance’, ‘NumOfProducts’, ‘HasCrCard’, ‘IsActiveMember’, ‘EstimatedSalary’, ‘Exited’], dtype=’object’)

sns.scatterplot(data=df, x=’Balance’,y=’Exited’,hue=’Gender’)

sns.scatterplot(data=df, x=’NumOfProducts’,y=’Exited’,hue=’Gender’)

plt.figure(figsize=(15,10))

sns.heatmap(df.corr().abs(),annot=True)

sns.scatterplot(data=df, x=’Age’,y=’Exited’,hue=’Gender’)

sns.scatterplot(data=df, x=’IsActiveMember’,y=’Exited’,hue=’Gender’)

sns.scatterplot(data=df, x=’Balance’,y=’NumOfProducts’,hue=’Gender’)

sns.scatterplot(data=df, x=’Balance’,y=’Exited’,hue=’Gender’)

2.6 Pair plot

sns.pairplot(data=df,hue=’Geography’)

2.7 Multivariate Analysis

plt.figure(figsize=(20,20))

import plotly.express as px

fig = px.scatter_3d(df, x=’IsActiveMember’, y=’Age’, z=’Exited’,#hue=’Gender’)

color=’Geography’)

fig.show()

CHAPTER 3

Distribution of Data

df1=df.drop(columns=[‘RowNumber’, ‘CustomerId’, ‘Surname’, ‘Geography’,’Gender’],axis=0)

df1

list1=list(df1.columns)

list1

['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary',
 'Exited']

3.1 computing confidence interval and Histogram

import numpy

from pandas import read_csv

from sklearn.utils import resample

from sklearn.metrics import accuracy_score

from matplotlib import pyplot

import seaborn as sns

load dataset

for i in list1:

x=df1[i]

print(i)

sns.kdeplot(x,shade=True,bw_adjust=100)

configure bootstrap

n_iterations = 1000

n_size = int(len(x))

run bootstrap

medians = list()

for i in range(n_iterations):

prepare train and test sets

s = resample(x, n_samples=n_size);

m = numpy.median(s);

print(m)

medians.append(m)

# plot scores

pyplot.hist(medians)

pyplot.show()

confidence intervals

alpha = 0.95

p = ((1.0-alpha)/2.0) * 100

lower = numpy.percentile(medians, p)

p = (alpha+((1.0-alpha)/2.0)) * 100

upper = numpy.percentile(medians, p)

print(‘%.1f confidence interval %.1f and %.1f’ % (alpha*100, lower, upper))

3.2 Statistics Test (K-S Test)

import numpy as np

import seaborn as sns

from scipy import stats

import matplotlib.pyplot as plt

for i in list1:

x=df1[i]

print(stats.kstest(x, ‘norm’))

KstestResult(statistic=1.0, pvalue=0.0) KstestResult(statistic=1.0, pvalue=0.0) KstestResult(statistic=0.8324498680518208, pvalue=0.0) KstestResult(statistic=0.6383, pvalue=0.0) KstestResult(statistic=0.8413447460685429, pvalue=0.0) KstestResult(statistic=0.5468447460685429, pvalue=0.0) KstestResult(statistic=0.5, pvalue=0.0) KstestResult(statistic=1.0, pvalue=0.0) KstestResult(statistic=0.5, pvalue=0.0)

3.3 Plotting KDE and Q-Q plot sequentially

import pylab

stats.probplot(df1[‘Age’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘Age’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘Tenure’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘Tenure’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘Balance’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘Balance’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘NumOfProducts’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘NumOfProducts’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘HasCrCard’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘HasCrCard’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘IsActiveMember’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘IsActiveMember’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘EstimatedSalary’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘EstimatedSalary’],shade=True,bw_adjust=100)

import pylab

stats.probplot(df1[‘Balance’], dist=”norm”, plot=pylab)

pylab.show()

sns.kdeplot(df1[‘Balance’],shade=True,bw_adjust=100)

CHAPTER 4 DATA PRE-PROCESSING AND STANDARDIZATION

In the previous chapter,we have that all the features are having bell shape PDF which resembles distribution close of normal ,but in Q-Q plot it can be widely seen that feature “Tenure”,”Balance”,”Number of Product” are having widely dispersed outlier.So in order to reduce outlier.So in order to reduce outlier impact and to have more reliable model we will use data standardization stretegy ,so that we will have feature with zero mean and unity standard deviation that will be robust to outlier.We will not use Normalization as it is not robust to outliers.

df.info()

So the data set is balanced and there is no missing value present in it.If there will be any missing values then we will impute that by median of that so to less impact on outliers.
df2=df.drop(columns=[‘RowNumber’, ‘CustomerId’, ‘Surname’],axis=0)

df2

Here we have dropped “RowNumber”,”CustomerId”, “Surname” column as this column is less informative in concern with prediction modelling.

df3=df2.loc[:,[‘Geography’,’Gender’]]

df3

df3.shape

(10000, 2)

pd.get_dummies(df3).shape

(10000, 5)

df6=pd.get_dummies(df3)

df6

df7=df6.drop(‘Gender_Male’,axis=1)

df7

df8=df.drop(columns=[‘RowNumber’, ‘CustomerId’, ‘Surname’,’Gender’,’Geography’],axis=0)

df8

result = pd.concat([df7, df8], ignore_index=False, sort=False,axis=1)

result1=result.iloc[:,:-1]

result1

DATA STANDARDIZATION

from sklearn import preprocessing

# Get column names first

names = result1.columns

# Create the Scaler object

scaler = preprocessing.StandardScaler()

# Fit your data on the scaler object

scaled_df = scaler.fit_transform(result1)

scaled_df = pd.DataFrame(scaled_df, columns=names)

scaled_df

X = scaled_df

y = result.iloc[:,-1]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=111,stratify=y)

CHAPTER 5 MODELLING DATA WITH ML ALGORITHM

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets, linear_model, metrics

from sklearn import svm

from sklearn.metrics import accuracy_score

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=111,stratify=y)

#model.fit(X_train,y_train)

algorithm=[LogisticRegression(),GaussianNB(),RandomForestClassifier()]

models=[]

for i in algorithm:

model=i

model.fit(X_train,y_train)

b=model.predict(X_test)

from sklearn.metrics import r2_score

from sklearn.metrics import classification_report

classification_report(y_test,b)

print(i,classification_report(y_test, b))

import matplotlib.pyplot as plt

a=r2_score(y_test,b)

models.append([i,a])

df10=pd.DataFrame(models, columns = [‘model’, ‘Accuracy’])

df10

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)               precision    recall  f1-score   support

           0       0.83      0.96      0.89      2389
           1       0.57      0.22      0.31       611

    accuracy                           0.81      3000
   macro avg       0.70      0.59      0.60      3000
weighted avg       0.77      0.81      0.77      3000

GaussianNB(priors=None, var_smoothing=1e-09)               precision    recall  f1-score   support

           0       0.85      0.92      0.88      2389
           1       0.54      0.36      0.43       611

    accuracy                           0.81      3000
   macro avg       0.69      0.64      0.66      3000
weighted avg       0.79      0.81      0.79      3000

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)               precision    recall  f1-score   support

           0       0.88      0.96      0.91      2389
           1       0.74      0.47      0.57       611

    accuracy                           0.86      3000
   macro avg       0.81      0.71      0.74      3000
weighted avg       0.85      0.86      0.84      3000

CHAPTER 6 MODEL PERFORMANCE AND EVALUATION

Plotting of AUC and ROC curve for different algorithm . We are going to plot AUC and ROC curve for different algorithm ,it is a plot between False Positive Rate vs True Positive Rate. The algorithm which has larger AUC value will have higher performance then the algorithm with lower AUC value.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets, linear_model, metrics

from sklearn import svm

from sklearn.svm import SVC

svc = SVC

from sklearn.metrics import accuracy_score

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve,auc

sc_x=StandardScaler()

X_train=sc_x.fit_transform(X_train)

X_test=sc_x.transform(X_test)

# SVM CLASSIFIER

model_svc=svc()

model_svc.fit(X_train,y_train)

y_pred_svm = model_svc.decision_function(X_test)

#GaussianNB

model_GNB=GaussianNB()

model_GNB.fit(X_train,y_train)

y_pred_GNB = model_GNB.predict_proba(X_test)

#RandomForestClassifier

model_RFC=RandomForestClassifier()

model_RFC.fit(X_train,y_train)

y_pred_RFC = model_RFC.predict_proba(X_test)

#logistic classifier

model_logistic=LogisticRegression()

model_logistic.fit(X_train,y_train)

y_pred_logistic = model_logistic.decision_function(X_test)

#plot ROC and compare AUC

from sklearn.metrics import roc_auc_score,auc

log_fpr,log_tpr,threshold=roc_curve(y_test,y_pred_logistic)

GNB_fpr,GNB_tpr,threshold=roc_curve(y_test,y_pred_GNB[:,1])

RFC_fpr,RFC_tpr,threshold=roc_curve(y_test,y_pred_RFC[:,1])

svm_fpr,svm_tpr,threshold=roc_curve(y_test,y_pred_svm)

#AUC

auc_svm=auc(svm_fpr,svm_tpr)

auc_log=auc(log_fpr,log_tpr)

auc_GNB=auc(GNB_fpr,GNB_tpr)

auc_RFC=auc(RFC_fpr,RFC_tpr)

plt.figure(dpi=100)

plt.plot(svm_fpr,svm_tpr,linestyle=’-’,label=’svm(auc=%0.3f’%auc_svm)

plt.plot(log_fpr,log_tpr,linestyle=’-’,label=’log(auc=%0.3f’%auc_log)

plt.plot(GNB_fpr,GNB_tpr,linestyle=’-’,label=’GNB(auc=%0.3f’%auc_GNB)

plt.plot(RFC_fpr,RFC_tpr,linestyle=’-’,label=’RFC(auc=%0.3f’%auc_RFC)

plt.xlabel(‘False positive rate’)

plt.ylabel(‘True positive rate’)

plt.legend()

plt.show()

It can be clearly seen that area under RFC > SVM >GNB >LOG. So it can be concluded that for this data RFC having lower miss classified point then any other algorithm.

y_train=pd.get_dummies(y_train)

y_train

CHAPTER 7 PARAMETER TUNNING AND PERFORMANCE EVALUATION

Here we are taking logistic regression for parameter tunning because it has less accuracy as compared to other algorithm .So we try to improve the accuracy by parameter tunning and thereby improving performance of model.

7.1 Logistic regression hyper-parameter tunning

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

import warnings

estimators = []

estimators.append((‘SC’,StandardScaler()))

estimators.append((‘LR’,LogisticRegression()))

estimators

from sklearn.pipeline import Pipeline

model = Pipeline(estimators)

model.fit(X_train,y_train)

model.score(X_test,y_test)

from sklearn.model_selection import cross_val_score, StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle=True,random_state=111)

results = cross_val_score(model, X, y, cv=skf)

print(results.mean())

from sklearn.model_selection import GridSearchCV

model.get_params()

pg = {‘LR__C’: [0.001,0.01,0.1,1.0],

‘LR__penalty’: [‘l1’, ‘l2’, ‘elasticnet’, ‘none’]}

gs_model = GridSearchCV(model, param_grid=pg, cv=10, verbose=2)

gs_model.fit(X_train,y_train)

gs_model.best_params_

gs_model.best_score_

7.2 PCA and RandomForest Implementation

from sklearn.decomposition import PCA

from sklearn.ensemble import RandomForestClassifier

steps = []

steps.append((‘PCA’,PCA(n_components=3)))

steps.append((‘RF’,RandomForestClassifier()))

model = Pipeline(steps)

model.fit(X_train,y_train)

model.score(X_test,y_test)

0.818

7.3 RandomForest Hyper-parameter tunning

from sklearn.model_selection import cross_val_score, StratifiedKFold

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state = 42)

from sklearn.model_selection import RandomizedSearchCV

skf = StratifiedKFold(n_splits=10, shuffle=True,random_state=111)

results = cross_val_score(model, X, y, cv=skf)

print(results.mean())

from sklearn.model_selection import GridSearchCV

model.get_params()

# Number of trees in random forest

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split

max_features = [‘auto’, ‘sqrt’]

# Maximum number of levels in tree

max_depth = [int(x) for x in np.linspace(3,13, num =10)]

max_depth.append(None)

# Minimum number of samples required to split a node

min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node

min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree

bootstrap = [True, False]

# Create the random grid

random_grid = {‘n_estimators’: n_estimators,

‘max_features’: max_features,

‘max_depth’: max_depth,

‘min_samples_split’: min_samples_split,

‘min_samples_leaf’: min_samples_leaf,

‘bootstrap’: bootstrap}

print(random_grid)

rf = RandomForestRegressor()

# Random search of parameters, using 3 fold cross validation,

# search across 100 different combinations, and use all available cores

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model

rf_random.fit(X_train,y_train)

rf_random.best_params_

0.8219999999999998
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 13, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
Fitting 3 folds for each of 100 candidates, totalling 300 fits[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed: 18.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 38.1min finishedRandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None, oob_score=Fals...
                   iid='deprecated', n_iter=100, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [3, 4, 5, 6, 7, 8, 9, 10,
                                                      11, 13, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=2)

CHAPTER 8 NEURAL NETWORK IMPLEMENTATION

import keras

#pip install keras,tensorflow

from keras.models import Sequential

from keras.layers import Dense

8.1 Building models of neural network

model = Sequential()

#input layer

model.add(Dense(units=12, input_dim=12, activation=’relu’))

#activation functions could be: sigmoid, relu, leaky relu, tanh

# 1st hidden layer

model.add(Dense(units=8, activation=’relu’))

# 2nd hidden layer

model.add(Dense(units=4, activation=’relu’))

# Output Layer

model.add(Dense(units=2, activation=’softmax’))

model.summary()

Model: “sequential_6” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_28 (Dense) (None, 12) 156 _________________________________________________________________ dense_29 (Dense) (None, 8) 104 _________________________________________________________________ dense_30 (Dense) (None, 4) 36 _________________________________________________________________ dense_31 (Dense) (None, 2) 10 ================================================================= Total params: 306 Trainable params: 306 Non-trainable params: 0 _________________________

8.2 Visualizing the Model

#Visualizing the model

%pip install ann_visualizer

from ann_visualizer.visualize import ann_viz

from graphviz import Source

model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])

model.fit(X_train, y_train ,epochs=25, batch_size= 50)

#Prediction time

dm_pred = model.predict(X_test)

dm_pred.shape

(3000, 2)
dm_pred[0]

array([0.99096984, 0.00903017], dtype=float32)

np.max(dm_pred[0])

0.99096984
ann_viz(model,title=’Neural Network Model of churn prediction’)

graph_source=Source.from_file(‘network.gv’)

graph_source

8.3 Performance evaluation in Neural Network

test = []

for i in range(len(y_test)):

test.append(np.argmax(y_test.values[i]))

pred = []

for i in range(len(dm_pred)):

pred.append(np.argmax(dm_pred[i]))

from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(test,pred))

print(classification_report(test,pred))

history = model.fit(X_train, y_train,epochs=25, batch_size= 50,validation_data=(X_test,y_test))

8.4 Visualize Accuracy and Loss

#To visualize accuracy and loss

import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))

plt.plot(history.history[‘accuracy’])

plt.plot(history.history[‘val_accuracy’])

plt.title(‘Model Accuracy’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘Accuracy’)

plt.legend([‘Train’,’Test’],loc=’lower right’)

plt.figure(figsize=(10,8))

plt.plot(history.history[‘loss’])

plt.plot(history.history[‘val_loss’])

plt.title(‘Model Loss’)

plt.xlabel(‘Epochs’)

plt.ylabel(‘Loss’)

plt.legend([‘Train’,’Test’],loc=’upper right’)

So we can see from above graph that training error is reducing with increase in epoch value i.e higher the number of training lower will be training loss.

CUSTOMER BANK CHURN MODELLING WITH MACHINE LEARNING

CHURN ANALYSIS

CONTENT

CHAPTER 1

INTRODUCTION

CHAPTER 2

DATA DESCRIPTION AND ANALYSIS (EDA)

CHAPTER 3

Distribution of Data

CHAPTER 4

Data Pre-processing Standardization

CHAPTER 5

Modelling Data with Machine Learning algorithm

CHAPTER 6

Model Performance Evaluation

CHAPTER 7

Parameter Tunning and Performance Evaluation

CHAPTER 8

Neural network Implementation

CHAPTER 1

INTRODUCTION

CHAPTER 2

DATA DESCRIPTION AND ANALYSIS (EDA)

2.2 Dataset loading and basic stats

STEP-1: Import Libraries

STEP-2: Autheticate E-Mail ID

STEP-3: Get File from Drive using file-ID

2.3 DATA VISUALIZATION

Outliers visualization of different feature

2.4 Univariate Analysis

Scatter plot of features

2.5 Bi-variate Analysis

2.6 Pair plot

2.7 Multivariate Analysis

CHAPTER 3

Distribution of Data

3.1 computing confidence interval and Histogram

load dataset

configure bootstrap

run bootstrap

prepare train and test sets

confidence intervals

3.2 Statistics Test (K-S Test)

3.3 Plotting KDE and Q-Q plot sequentially

CHAPTER 4

DATA PRE-PROCESSING AND STANDARDIZATION

DATA STANDARDIZATION

CHAPTER 5

MODELLING DATA WITH ML ALGORITHM

CHAPTER 6

MODEL PERFORMANCE AND EVALUATION

CHAPTER 7

PARAMETER TUNNING AND PERFORMANCE EVALUATION

7.1 Logistic regression hyper-parameter tunning

7.2 PCA and RandomForest Implementation

7.3 RandomForest Hyper-parameter tunning

CHAPTER 8

NEURAL NETWORK IMPLEMENTATION

8.1 Building models of neural network

8.2 Visualizing the Model

8.3 Performance evaluation in Neural Network

8.4 Visualize Accuracy and Loss

Written by Ashutosh

No responses yet