Back to Python Mini Projects

Predicting Loan Repayment

Notebooks/Customer_loan_repayment_problem/loan-prediction-problem.ipynb

latest10.4 KB
Original Source

Predicting Loan Repayment

The dataset for this project is retrieved from kaggle, the home of Data Science.

The major aim of this project is to predict whether the customers will have their loan paid or not. Therefore, this is a supervised classification problem to be trained.

1- Importing Libraries

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import plotly.express as px

2- Getting Data

python
df=pd.read_csv('../input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')
python
df.head()
python
df.shape
2-1-Renaming columns
python
df.columns=df.columns.str.lower()
python
df.columns=['loan_id', 'gender', 'married', 'dependents', 'education','self_employed', 'applicant_income', 'co-applicant_income', 'loan_amount', 'loan_amount_term', 'credit_history', 'property_area', 'loan_status']
2-2-Checking null values
python
df.isnull().sum()

we take care of missing values in "loan_amount" and "credit_history". For other null values, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70-75% of missing values. This method is advised only when there are enough samples in the data set.

python
df['loan_amount']=df['loan_amount'].fillna(df['loan_amount'].mean())   
python
df['credit_history']=df['credit_history'].fillna(df['credit_history'].median())   
python
df.dropna(axis=0, inplace=True)
python
df.isnull().sum()
python
df.head()
python
df.shape
python
df.info()
python
df.describe()
2-3-Label Encoder for Dependents
python
type(df['dependents'].iloc[0])
python
df['dependents'].unique()
python
model6=LabelEncoder()
python
model6.fit(df['dependents'])
python
df['dependents']= model6.transform(df['dependents'])

3-Exploratory Data Analysis

3-1- Visualization
python
df[df['loan_status']=='Y'].count()['loan_status']
python
df[df['loan_status']=='N'].count()['loan_status']
python
plt.figure(figsize=(8,8))
plt.pie(x=[376,166], labels=['Yes','No'], autopct='%1.0f%%', pctdistance=0.5,labeldistance=0.7,colors=['g','r'])
plt.title('Distribution of Loan Status')

69% of applicants repay the loan and 39% do not repay the loan.

python
plt.figure(figsize=(15,10))

plt.subplot(2,3,1)
sns.countplot(x='gender' ,hue='loan_status', data=df,palette='plasma')

plt.subplot(2,3,2)
sns.countplot(x='married',hue='loan_status',data=df,palette='viridis')
plt.ylabel(' ')
plt.yticks([ ])

plt.subplot(2,3,3)
sns.countplot(x='education',hue='loan_status',data=df,palette='copper')
plt.ylabel(' ')
plt.yticks([ ])

plt.subplot(2,3,4)
sns.countplot(x='credit_history', data=df,hue='loan_status',palette='summer')

plt.subplot(2,3,5)
sns.countplot(x='self_employed',hue='loan_status',data=df,palette='autumn')
plt.ylabel(' ')
plt.yticks([ ])

plt.subplot(2,3,6)
sns.countplot(x='property_area',data=df,hue='loan_status',palette='PuBuGn')
plt.ylabel(' ')
plt.yticks([ ])

Comparison between Genders in getting the Loan shows that a Male Individual has more chance of repaying the Loan.

Comparison between Married Status in getting the Loan shows that a Married Individual has more chance of repaying the Loan.

Comparison between Education Status of an Individual in getting the Loan shows that a Graduate Individual has more chance of repaying the Loan.

Comparison between Self-Employed or Not in getting the Loan shows that Not Self-Employed has more chance of repaying the Loan.

Comparison between Credit History for getting the Loan shows that an individual with a credit history has more chance of repaying the Loan.

Comparison between Property Area for getting the Loan shows that People living in Semi-Urban Area have more chance to repay the Loan.

python
px.sunburst( data_frame=df,path=['gender','loan_status'], color='loan_amount')
python
plt.figure(figsize=(15,10))

plt.subplot(2,3,1)
sns.violinplot(x='gender', y='loan_amount',hue='loan_status', data=df,palette='plasma')

plt.subplot(2,3,2)
sns.violinplot(x='married',y='loan_amount',hue='loan_status',data=df,palette='viridis')
plt.ylabel(' ')
plt.yticks([ ])

plt.subplot(2,3,3)
sns.violinplot(x='education',y='loan_amount',hue='loan_status',data=df,palette='copper')
plt.ylabel(' ')
plt.yticks([ ])

plt.subplot(2,3,4)
sns.violinplot(x='credit_history',y='loan_amount', data=df,hue='loan_status',palette='summer')

plt.subplot(2,3,5)
sns.violinplot(x='self_employed',y='loan_amount',hue='loan_status',data=df,palette='autumn')
plt.ylabel(' ')
plt.yticks([ ])

plt.subplot(2,3,6)
sns.violinplot(x='property_area', y='loan_amount',data=df,hue='loan_status',palette='PuBuGn')
plt.ylabel(' ')
plt.yticks([ ])
python
plt.figure(figsize=(18,5))


plt.subplot(1,3,1)
sns.distplot(df['applicant_income'],bins=30,color='r',hist_kws=dict(edgecolor='white'))
plt.ylabel('frequency')

plt.subplot(1,3,2)
sns.distplot(df['co-applicant_income'],bins=30,color='blue',hist_kws=dict(edgecolor='white'))

plt.subplot(1,3,3)
sns.distplot(df['loan_amount'],bins=30,color='black',hist_kws=dict(edgecolor='white'))
python
px.scatter_3d(data_frame=df,x='applicant_income',y='co-applicant_income',z='loan_amount',color='loan_status')
3-2-Encoding
3-2-1-gender
python
model1=LabelEncoder()
python
model1.fit(df['gender'])
python
df['gender']= model1.transform(df['gender'])
3-2-2-married
python
model2=LabelEncoder()
python
model2.fit(df['married'])
python
df['married']= model2.transform(df['married'])
3-2-3-education
python
model3=LabelEncoder()
python
model3.fit(df['education'])
python
df['education']= model3.transform(df['education'])
3-2-4-self_employed
python
model4=LabelEncoder()
python
model4.fit(df['self_employed'])
python
df['self_employed']= model4.transform(df['self_employed'])
3-2-5-property_area
python
model5=LabelEncoder()
python
model5.fit(df['property_area'])
python
df['property_area']= model5.transform(df['property_area'])
3-2-6-loan status
python
model6=LabelEncoder()
python
model6.fit(df['loan_status'])
python
df['loan_status']= model6.transform(df['loan_status'])
python
df.head()
python
plt.figure(figsize=(12,8))

corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, square=True,annot=True,linewidths=2, cmap='viridis')
plt.title('Correlation Matrix for Loan Status')

From the above figure, we can see that Credit_History (Independent Variable) has the maximum correlation with Loan_Status (Dependent Variable). Which denotes that the Loan_Status is heavily dependent on the Credit_History.

4-Prediction

4-1-LogisticRegression
python
X=df.drop(['loan_id','loan_status'],axis=1)
y=df['loan_status']
python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
python
lr=LogisticRegression()
python
lr.fit(X_train, y_train)
python
lr_prediction=lr.predict(X_test)
python
print(confusion_matrix(y_test,lr_prediction))
print('\n')
print(classification_report(y_test,lr_prediction))
print('\n')
print('Logistic Regression accuracy: ', accuracy_score(y_test,lr_prediction))

4-2-Decision Tree

python
dt=DecisionTreeClassifier()
python
dt.fit(X_train, y_train)
python
dt_prediction=dt.predict(X_test)
python
print(confusion_matrix(y_test,dt_prediction))
print('\n')
print(classification_report(y_test,dt_prediction))
print('\n')
print('Decision Tree Accuracy: ', accuracy_score(y_test,dt_prediction))
4-3-Random Forest
python
rf=RandomForestClassifier(n_estimators=200)
python
rf.fit(X_train, y_train)
python
rf_prediction=rf.predict(X_test)
python
print(confusion_matrix(y_test,rf_prediction))
print('\n')
print(classification_report(y_test,rf_prediction))
print('\n')
print('Random Forest Accuracy: ', accuracy_score(y_test,rf_prediction))
4-4-KNearest Neighbors
python
error_rate=[]
for n in range(1,40):
    knn=KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train, y_train)
    knn_prediction=knn.predict(X_test)
    error_rate.append(np.mean(knn_prediction!=y_test))
print(error_rate)
python
plt.figure(figsize=(8,6))
sns.set_style('whitegrid')
plt.plot(list(range(1,40)),error_rate,color='b', marker='o', linewidth=2, markersize=12, markerfacecolor='r', markeredgecolor='r')
plt.xlabel('Number of Neighbors')
plt.ylabel('Error Rate')
plt.title('Elbow Method')
python
knn=KNeighborsClassifier(n_neighbors=23)
python
knn.fit(X_train, y_train)
python
knn_prediction=knn.predict(X_test)
python
print(confusion_matrix(y_test,knn_prediction))
print('\n')
print(classification_report(y_test,knn_prediction))
print('\n')
print('KNN accuracy Accuracy: ', accuracy_score(y_test,knn_prediction))
4-5-SVC
python
svc=SVC()
python
svc.fit(X_train, y_train)
python
svc_prediction=svc.predict(X_test)
python
print(confusion_matrix(y_test,svc_prediction))
print('\n')
print(classification_report(y_test,svc_prediction))
print('\n')
print('SVC َAccuracy: ', accuracy_score(y_test,svc_prediction))
python
print('Logistic Regression Accuracy: ', accuracy_score(y_test,lr_prediction))
print('Decision Tree Accuracy: ', accuracy_score(y_test,dt_prediction))
print('Random Forest Accuracy: ', accuracy_score(y_test,rf_prediction))
print('KNN Accuracy: ', accuracy_score(y_test,knn_prediction))
print('SVC Accuracy: ', accuracy_score(y_test,svc_prediction))

CONCLUSION

The Loan Status is heavily dependent on the Credit History for Predictions.

The Logistic Regression algorithm gives us the maximum Accuracy (80%) compared to the other 4 Machine Learning Classification Algorithms.