PART I: German Credit Score Classification Model EDA

By: Krishna J

Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn               as sns
import matplotlib.pyplot     as plt
import shap
import eli5
from sklearn.model_selection import train_test_split
#from sklearn.ensemble        import RandomForestClassifier
#from sklearn.linear_model    import LogisticRegression
from sklearn.preprocessing   import MinMaxScaler, StandardScaler
from sklearn.base            import TransformerMixin
from sklearn.pipeline        import Pipeline, FeatureUnion
from typing                  import List, Union, Dict
# Warnings will be used to silence various model warnings for tidier output
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
np.random.seed(0)
pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

Importing the source dataset

Source:

https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Professor Dr. Hans Hofmann Institut f"ur Statistik und "Okonometrie Universit"at Hamburg FB Wirtschaftswissenschaften Von-Melle-Park 5 2000 Hamburg 13

This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer.

In [2]:
feature_list = ['CurrentAcc', 'NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount', 
         'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors', 
         'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan', 'Property', 
         'ExistingCredit', 'Job', 'Dependents', 'Telephone', 'Foreignworker', 'CreditStatus']

german_xai = pd.read_csv('C:/Users/krish/Downloads/german.data.txt',names = feature_list, delimiter=' ')
In [3]:
german_xai.head()
Out[3]:
CurrentAcc NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ... Collateral Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

In [4]:
german_xai.shape
Out[4]:
(1000, 21)

The dataset has 1000 entries with 21 fields.

In [5]:
type(german_xai)
Out[5]:
pandas.core.frame.DataFrame
In [6]:
german_xai.head(10)

german_xai.columns
Out[6]:
CurrentAcc NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ... Collateral Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2
5 A14 36 A32 A46 9055 A65 A73 2 A93 A101 ... A124 35 A143 A153 1 A172 2 A192 A201 1
6 A14 24 A32 A42 2835 A63 A75 3 A93 A101 ... A122 53 A143 A152 1 A173 1 A191 A201 1
7 A12 36 A32 A41 6948 A61 A73 2 A93 A101 ... A123 35 A143 A151 1 A174 1 A192 A201 1
8 A14 12 A32 A43 3059 A64 A74 2 A91 A101 ... A121 61 A143 A152 1 A172 1 A191 A201 1
9 A12 30 A34 A40 5234 A61 A71 4 A94 A101 ... A123 28 A143 A152 2 A174 1 A191 A201 2

10 rows × 21 columns

Out[6]:
Index(['CurrentAcc', 'NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount',
       'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors',
       'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan',
       'Property', 'ExistingCredit', 'Job', 'Dependents', 'Telephone',
       'Foreignworker', 'CreditStatus'],
      dtype='object')

List of fields in the source dataset are listed above

In [7]:
german_xai.dtypes
Out[7]:
CurrentAcc           object
NumMonths             int64
CreditHistory        object
Purpose              object
CreditAmount          int64
Savings              object
EmployDuration       object
PayBackPercent        int64
Gender               object
Debtors              object
ResidenceDuration     int64
Collateral           object
Age                   int64
OtherPayBackPlan     object
Property             object
ExistingCredit        int64
Job                  object
Dependents            int64
Telephone            object
Foreignworker        object
CreditStatus          int64
dtype: object

Datatypes of each field is displayed above

Missing Value Check

In [8]:
import klib
klib.missingval_plot(german_xai)
No missing values found in the dataset.

Feature Engineering

Encoding categorical fields

1. Mapping to actual description

Here, first we are mapping the encrypted domain values of each field to its corresponding actual values depending on the description provided in the UCI machine learning repository.

Gender field desc:

  • A91 : male : divorced/separated;
  • A92 : female : divorced/separated/married;
  • A93 : male : single;
  • A94 : male : married/widowed;
  • A95 : female : single. Male is encoded as 1 and female as 0.

Creating new field marital status to study the impact as protected attribute.

In [9]:
german_xai['Gender'].value_counts()
#german_xai.replace({'Marital_Status':{'A93':'Single','A91':'divorced/married/widowed','A92':'divorced/married/widowed','A94':'divorced/married/widowed'},'Gender':{'A91':'1','A93':'1','A94':'1','A92':'0'}},inplace=True)
german_xai.replace({'Gender':{'A91':'1','A93':'1','A94':'1','A92':'0'}},inplace=True)
german_xai['Gender'].value_counts()
Out[9]:
A93    548
A92    310
A94     92
A91     50
Name: Gender, dtype: int64
Out[9]:
1    690
0    310
Name: Gender, dtype: int64
In [10]:
#german_xai['Age'].value_counts()
german_xai['Age']=german_xai['Age'].apply(lambda x: np.int(x >= 26))
german_xai['Age'].value_counts()
Out[10]:
1    810
0    190
Name: Age, dtype: int64

Entries with age greater than or equal to 26yrs is encoded as 1 otherwise 0

In [11]:
#Encoding target field
german_xai.CreditStatus.value_counts()
german_xai['CreditStatus'].replace({1:1 , 2: 0}, inplace=True)
german_xai.CreditStatus.value_counts()
Out[11]:
1    700
2    300
Name: CreditStatus, dtype: int64
Out[11]:
1    700
0    300
Name: CreditStatus, dtype: int64

Target field CreditStatus is encoded as 1 = Good, 0 = Bad (positive class) ; in actual data 1 = Good, 2 = Bad. https://aif360.readthedocs.io/en/latest/modules/generated/aif360.datasets.GermanDataset.html#aif360.datasets.GermanDataset

Status of checking account desc:

  • A11 : <0;
  • A12 : 0 to 200;
  • A13 : >=200;
  • A14 : no account checking.
In [12]:
german_xai['CurrentAcc'].replace({'A11':'LT200' , 'A12': 'LT200','A13': 'GE200','A14': 'None'}, inplace=True)
german_xai.CurrentAcc.value_counts()
Out[12]:
LT200    543
None     394
GE200     63
Name: CurrentAcc, dtype: int64

Employment duration desc:

  • A71 : unemployed;
  • A72 : ... < 1 year;
  • A73 : 1 <= ... < 4 years;
  • A74 : 4 <= ... < 7 years;
  • A75 : .. >= 7 years.
In [13]:
german_xai['EmployDuration'].replace({'A71':'unemployed' , 'A72': 'LT1','A73': '1-4','A74': '4-7', 'A75': 'GE7'}, inplace=True)
german_xai.EmployDuration.value_counts()
Out[13]:
1-4           339
GE7           253
4-7           174
LT1           172
unemployed     62
Name: EmployDuration, dtype: int64

Credit History desc:

  • A30 : no credits taken/ all credits paid back duly,
  • A31 : all credits at this bank paid back duly,
  • A32 : existing credits paid back duly till now,
  • A33 : delay in paying off in the past,
  • A34 : critical account/ other credits existing (not at this bank).
In [14]:
german_xai['CreditHistory'].replace({'A30':'none/paid' , 'A31': 'none/paid','A32': 'none/paid','A33': 'Delay', 'A34': 'other'}, inplace=True)
german_xai['CreditHistory'].value_counts()
Out[14]:
none/paid    619
other        293
Delay         88
Name: CreditHistory, dtype: int64

Savings Desc:

  • A61 : ... < 100 DM
  • A62 : 100 <= ... < 500 DM
  • A63 : 500 <= ... < 1000 DM
  • A64 : .. >= 1000 DM
  • A65 : unknown/ no savings account
In [15]:
german_xai['Savings'].replace({'A61':'LT500' , 'A62': 'LT500','A63': 'GT500','A64': 'GT500', 'A65': 'none'}, inplace=True)
german_xai['Savings'].value_counts()
Out[15]:
LT500    706
none     183
GT500    111
Name: Savings, dtype: int64

Debtors desc: Other debtors / guarantors

  • A101 : none
  • A102 : co-applicant
  • A103 : guarantor
In [16]:
german_xai['Debtors'].replace({'A101':'none' , 'A102': 'co-applicant','A103': 'guarantor'}, inplace=True)
german_xai['Debtors'].value_counts()
Out[16]:
none            907
guarantor        52
co-applicant     41
Name: Debtors, dtype: int64

Collateral desc:

  • A121 : real estate
  • A122 : if not A121 : building society savings agreement/ life insurance
  • A123 : if not A121/A122 : car or other, not in attribute 6
  • A124 : unknown / no property
In [17]:
german_xai['Collateral'].replace({'A121':'real_estate' , 'A122': 'savings/life_insurance','A123': 'car/other', 'A124':'unknown/none'}, inplace=True)
german_xai['Collateral'].value_counts()
Out[17]:
car/other                 332
real_estate               282
savings/life_insurance    232
unknown/none              154
Name: Collateral, dtype: int64

Property: Housing

  • A151 : rent
  • A152 : own
  • A153 : for free
In [18]:
german_xai['Property'].replace({'A151':'rent' , 'A152': 'own','A153': 'free'}, inplace=True)
german_xai['Property'].value_counts()
Out[18]:
own     713
rent    179
free    108
Name: Property, dtype: int64

Telephone desc:

  • A191 : none
  • A192 : yes, registered under the customers name

Foreign worker

  • A201 : yes
  • A202 : no
In [19]:
german_xai['Foreignworker'].replace({'A201':1 , 'A202': 0}, inplace=True)
german_xai['Telephone'].replace({'A191':0 , 'A192': 1}, inplace=True)
german_xai['Telephone'].value_counts()
german_xai['Foreignworker'].value_counts()
Out[19]:
0    596
1    404
Name: Telephone, dtype: int64
Out[19]:
1    963
0     37
Name: Foreignworker, dtype: int64

Purpose desc:

  • A40 : car (new)
  • A41 : car (used)
  • A42 : furniture/equipment
  • A43 : radio/television
  • A44 : domestic appliances
  • A45 : repairs
  • A46 : education
  • A47 : (vacation - does not exist?)
  • A48 : retraining
  • A49 : business
  • A410 : others
In [20]:
german_xai['Purpose'].replace({'A40':'CarNew' , 'A41': 'CarUsed' , 'A42': 'furniture/equip','A43':'radio/tv','A44':'domestic app','A45':'repairs','A46':'education','A47':'vacation','A48':'retraining','A49':'biz','A410':'others'}, inplace=True)
german_xai['Purpose'].value_counts()
Out[20]:
radio/tv           280
CarNew             234
furniture/equip    181
CarUsed            103
biz                 97
education           50
repairs             22
others              12
domestic app        12
retraining           9
Name: Purpose, dtype: int64

Job desc:

  • A171 : unemployed/ unskilled - non-resident
  • A172 : unskilled - resident
  • A173 : skilled employee / official
  • A174 : management/ self-employed/highly qualified employee/ officer
In [21]:
german_xai['Job'].replace({'A171':'unemp/unskilled-non_resident' , 'A172': 'unskilled-resident','A173': 'skilled_employee','A174':'management/self-emp/officer/highly_qualif_emp'}, inplace=True)
german_xai['Job'].value_counts()
Out[21]:
skilled_employee                                 630
unskilled-resident                               200
management/self-emp/officer/highly_qualif_emp    148
unemp/unskilled-non_resident                      22
Name: Job, dtype: int64

Other installment plans desc

  • A141 : bank
  • A142 : stores
  • A143 : none
In [22]:
german_xai['OtherPayBackPlan'].replace({'A141':'bank' , 'A142': 'stores','A143': 'none'}, inplace=True)
german_xai['OtherPayBackPlan'].value_counts()
Out[22]:
none      814
bank      139
stores     47
Name: OtherPayBackPlan, dtype: int64
In [23]:
german_xai.head()
Out[23]:
CurrentAcc NumMonths CreditHistory Purpose CreditAmount Savings EmployDuration PayBackPercent Gender Debtors ... Collateral Age OtherPayBackPlan Property ExistingCredit Job Dependents Telephone Foreignworker CreditStatus
0 LT200 6 other radio/tv 1169 none GE7 4 1 none ... real_estate 1 none own 2 skilled_employee 1 1 1 1
1 LT200 48 none/paid radio/tv 5951 LT500 1-4 2 0 none ... real_estate 0 none own 1 skilled_employee 1 0 1 0
2 None 12 other education 2096 LT500 4-7 2 1 none ... real_estate 1 none own 1 unskilled-resident 2 0 1 1
3 LT200 42 none/paid furniture/equip 7882 LT500 4-7 2 1 guarantor ... savings/life_insurance 1 none free 1 skilled_employee 2 0 1 1
4 LT200 24 Delay CarNew 4870 LT500 1-4 3 1 none ... unknown/none 1 none free 2 skilled_employee 2 0 1 0

5 rows × 21 columns

In [24]:
german_xai = german_xai.reindex(columns=['CurrentAcc','NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount', 
         'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors', 
         'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan', 'Property', 
         'ExistingCredit', 'Job', 'Dependents', 'Telephone', 'Foreignworker', 'CreditStatus'])
##german_xai.head()

Writing data to csv file for re-usability

In [25]:
german_xai.to_csv('C:/Users/krish/Downloads/German-mapped_upd.csv', index=False)
In [26]:
German_df = pd.read_csv('C:/Users/krish/Downloads/German-mapped_upd.csv')
print(German_df.shape)
print (German_df.columns)
(1000, 21)
Index(['CurrentAcc', 'NumMonths', 'CreditHistory', 'Purpose', 'CreditAmount',
       'Savings', 'EmployDuration', 'PayBackPercent', 'Gender', 'Debtors',
       'ResidenceDuration', 'Collateral', 'Age', 'OtherPayBackPlan',
       'Property', 'ExistingCredit', 'Job', 'Dependents', 'Telephone',
       'Foreignworker', 'CreditStatus'],
      dtype='object')

Data Analysis

Correlation Analysis

In [27]:
corrMatrix = round(German_df.corr(),1)
corrMatrix
Out[27]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker CreditStatus
NumMonths 1.0 0.6 0.1 0.1 0.0 0.0 -0.0 -0.0 0.2 0.1 -0.2
CreditAmount 0.6 1.0 -0.3 0.1 0.0 0.0 0.0 0.0 0.3 0.1 -0.2
PayBackPercent 0.1 -0.3 1.0 0.1 0.0 0.1 0.0 -0.1 0.0 0.1 -0.1
Gender 0.1 0.1 0.1 1.0 -0.0 0.3 0.1 0.2 0.1 -0.1 0.1
ResidenceDuration 0.0 0.0 0.0 -0.0 1.0 0.0 0.1 0.0 0.1 0.1 -0.0
Age 0.0 0.0 0.1 0.3 0.0 1.0 0.1 0.2 0.2 -0.1 0.1
ExistingCredit -0.0 0.0 0.0 0.1 0.1 0.1 1.0 0.1 0.1 0.0 0.0
Dependents -0.0 0.0 -0.1 0.2 0.0 0.2 0.1 1.0 -0.0 -0.1 0.0
Telephone 0.2 0.3 0.0 0.1 0.1 0.2 0.1 -0.0 1.0 0.1 0.0
Foreignworker 0.1 0.1 0.1 -0.1 0.1 -0.1 0.0 -0.1 0.1 1.0 -0.1
CreditStatus -0.2 -0.2 -0.1 0.1 -0.0 0.1 0.0 0.0 0.0 -0.1 1.0
In [28]:
plt.figure(figsize=(15,15))
sns.heatmap(corrMatrix, annot=True,cmap="Blues")
plt.show()
Out[28]:
<Figure size 1080x1080 with 0 Axes>
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3ba21ce48>
In [29]:
klib.corr_plot(German_df,annot=False)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3ba934988>

Observation:There is a good correlation credit amount and number of months

Correlation w.r.to target field

In [30]:
klib.corr_plot(German_df,target='CreditStatus')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bafbda08>
In [31]:
klib.corr_mat(German_df)
Out[31]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker CreditStatus
NumMonths 1.00 0.62 0.07 0.08 0.03 0.01 -0.01 -0.02 0.16 0.14 -0.21
CreditAmount 0.62 1.00 -0.27 0.09 0.03 0.05 0.02 0.02 0.28 0.05 -0.15
PayBackPercent 0.07 -0.27 1.00 0.09 0.05 0.06 0.02 -0.07 0.01 0.09 -0.07
Gender 0.08 0.09 0.09 1.00 -0.01 0.25 0.09 0.20 0.08 -0.05 0.08
ResidenceDuration 0.03 0.03 0.05 -0.01 1.00 0.01 0.09 0.04 0.10 0.05 -0.00
Age 0.01 0.05 0.06 0.25 0.01 1.00 0.14 0.17 0.16 -0.05 0.13
ExistingCredit -0.01 0.02 0.02 0.09 0.09 0.14 1.00 0.11 0.07 0.01 0.05
Dependents -0.02 0.02 -0.07 0.20 0.04 0.17 0.11 1.00 -0.01 -0.08 0.00
Telephone 0.16 0.28 0.01 0.08 0.10 0.16 0.07 -0.01 1.00 0.11 0.04
Foreignworker 0.14 0.05 0.09 -0.05 0.05 -0.05 0.01 -0.08 0.11 1.00 -0.08
CreditStatus -0.21 -0.15 -0.07 0.08 -0.00 0.13 0.05 0.00 0.04 -0.08 1.00
In [32]:
klib.cat_plot(German_df)
Out[32]:
GridSpec(6, 10)
In [33]:
klib.dist_plot(German_df)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bc4530c8>
In [34]:
import matplotlib.pyplot as plt 

import numpy as np 

age_count=German_df.Age.value_counts(sort=True)

print(age_count)

plt.figure(figsize=(10,5))

age_count.plot(kind='bar', color='skyblue', rot=0) 

plt.ylabel('Frequency',fontsize=12,color='green')

plt.xlabel('Age',fontsize=12,color='green')

plt.suptitle('Distribution of Age field',fontsize=15,color='orange',fontweight='bold')

plt.annotate(age_count[1],xy=(0,300),verticalalignment="top",horizontalalignment="center")
plt.annotate(age_count[0],xy=(1,100),verticalalignment="top",horizontalalignment="center")

LABELS=["1:Age>26","0:Age<26"]
plt.xticks(range(2),LABELS)
1    810
0    190
Name: Age, dtype: int64
Out[34]:
<Figure size 720x360 with 0 Axes>
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bc60c8c8>
Out[34]:
Text(0, 0.5, 'Frequency')
Out[34]:
Text(0.5, 0, 'Age')
Out[34]:
Text(0.5, 0.98, 'Distribution of Age field')
Out[34]:
Text(0, 300, '810')
Out[34]:
Text(1, 100, '190')
Out[34]:
([<matplotlib.axis.XTick at 0x1e3bd5e8608>,
  <matplotlib.axis.XTick at 0x1e3bd5e85c8>],
 [Text(0, 0, '1:Age>26'), Text(0, 0, '0:Age<26')])

Observation:There are more entries having age greater than 26yrs

In [35]:
plt.figure(figsize=(10,5))

plt.hist(German_df.CreditAmount, color='tomato') 

plt.ylabel('Frequency')

plt.xlabel('Credit Amount')

plt.suptitle('Distribution of Credit Amount field',fontsize=15,color='slategrey',fontweight='bold')
Out[35]:
<Figure size 720x360 with 0 Axes>
Out[35]:
(array([445., 293.,  97.,  80.,  38.,  19.,  14.,   8.,   5.,   1.]),
 array([  250. ,  2067.4,  3884.8,  5702.2,  7519.6,  9337. , 11154.4,
        12971.8, 14789.2, 16606.6, 18424. ]),
 <a list of 10 Patch objects>)
Out[35]:
Text(0, 0.5, 'Frequency')
Out[35]:
Text(0.5, 0, 'Credit Amount')
Out[35]:
Text(0.5, 0.98, 'Distribution of Credit Amount field')

Observation:There are more entries having lower credit amount that higher credit amount

In [36]:
plt.figure(figsize=(10,5))

plt.hist(German_df.NumMonths, color='tan') 

plt.ylabel('Frequency')

plt.xlabel('Number of Months')

plt.suptitle('Distribution of NumMonths field',fontsize=15,color='teal',fontweight='bold')
Out[36]:
<Figure size 720x360 with 0 Axes>
Out[36]:
(array([171., 262., 337.,  57.,  86.,  17.,  54.,   2.,  13.,   1.]),
 array([ 4. , 10.8, 17.6, 24.4, 31.2, 38. , 44.8, 51.6, 58.4, 65.2, 72. ]),
 <a list of 10 Patch objects>)
Out[36]:
Text(0, 0.5, 'Frequency')
Out[36]:
Text(0.5, 0, 'Number of Months')
Out[36]:
Text(0.5, 0.98, 'Distribution of NumMonths field')

Observation:There are more entries having lower duration than higher duration in months

In [37]:
target_count=German_df.CreditStatus.value_counts(sort=True)

print(target_count)

plt.figure(figsize=(10,5))

target_count.plot(kind='bar', color='gold', rot=0) 

plt.ylabel('Frequency',fontsize=12,color='green')

plt.xlabel('Credit Status',fontsize=12,color='green')

plt.suptitle('Distribution of Credit Status field',fontsize=15,color='red',fontweight='bold')

plt.annotate(target_count[1],xy=(0,300),verticalalignment="top",horizontalalignment="center")
plt.annotate(target_count[0],xy=(1,200),verticalalignment="top",horizontalalignment="center")

LABELS=["1:Good credit score","0:Bad credit score"]
plt.xticks(range(2),LABELS)
1    700
0    300
Name: CreditStatus, dtype: int64
Out[37]:
<Figure size 720x360 with 0 Axes>
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bd77da08>
Out[37]:
Text(0, 0.5, 'Frequency')
Out[37]:
Text(0.5, 0, 'Credit Status')
Out[37]:
Text(0.5, 0.98, 'Distribution of Credit Status field')
Out[37]:
Text(0, 300, '700')
Out[37]:
Text(1, 200, '300')
Out[37]:
([<matplotlib.axis.XTick at 0x1e3bd779c48>,
  <matplotlib.axis.XTick at 0x1e3bd779c08>],
 [Text(0, 0, '1:Good credit score'), Text(0, 0, '0:Bad credit score')])

Observation:There are more entries having good credit score than bad credit score

In [38]:
German_df['Age'].describe()
Out[38]:
count    1000.000000
mean        0.810000
std         0.392497
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: Age, dtype: float64
In [39]:
German_df.Gender.unique()
Out[39]:
array([1, 0], dtype=int64)
In [40]:
Gender_count=German_df.Gender.value_counts()
print(Gender_count)

plt.figure(figsize=(10,5))

Gender_count.plot(kind='bar', color='pink', rot=0) 

plt.ylabel('Frequency',fontsize=12,color='blue')

plt.xlabel('Gender',fontsize=12,color='blue')

plt.suptitle('Distribution of Gender field',fontsize=15,color='Green',fontweight='bold')

plt.annotate(Gender_count[1],xy=(0,300),verticalalignment="top",horizontalalignment="center")
plt.annotate(Gender_count[0],xy=(1,200),verticalalignment="top",horizontalalignment="center")

LABELS=["1:Male","0:Female"]
plt.xticks(range(2),LABELS)
1    690
0    310
Name: Gender, dtype: int64
Out[40]:
<Figure size 720x360 with 0 Axes>
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bd7d72c8>
Out[40]:
Text(0, 0.5, 'Frequency')
Out[40]:
Text(0.5, 0, 'Gender')
Out[40]:
Text(0.5, 0.98, 'Distribution of Gender field')
Out[40]:
Text(0, 300, '690')
Out[40]:
Text(1, 200, '310')
Out[40]:
([<matplotlib.axis.XTick at 0x1e3bd7f9048>,
  <matplotlib.axis.XTick at 0x1e3bd7fafc8>],
 [Text(0, 0, '1:Male'), Text(0, 0, '0:Female')])

Observation:There are more male entries than female entries

In [41]:
colour=['blue','pink','orange','green','tan','violet','olive','gold','tomato','skyblue']
for i,j in zip(German_df.columns,colour):
    field_count=German_df[i].value_counts()
    #print(field_count)

    plt.figure(figsize=(10,5))

    field_count.plot(kind='bar', color=j, rot=0) 

    plt.ylabel('Frequency',fontsize=12,color='black')

    plt.xlabel(i,fontsize=12,color='black')

    plt.suptitle('Distribution of '+ i,fontsize=15,color='Green',fontweight='bold')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bd850408>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'CurrentAcc')
Out[41]:
Text(0.5, 0.98, 'Distribution of CurrentAcc')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bd89de48>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'NumMonths')
Out[41]:
Text(0.5, 0.98, 'Distribution of NumMonths')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bd976448>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'CreditHistory')
Out[41]:
Text(0.5, 0.98, 'Distribution of CreditHistory')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bd9cc388>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'Purpose')
Out[41]:
Text(0.5, 0.98, 'Distribution of Purpose')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bda36308>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'CreditAmount')
Out[41]:
Text(0.5, 0.98, 'Distribution of CreditAmount')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3be99c348>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'Savings')
Out[41]:
Text(0.5, 0.98, 'Distribution of Savings')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3be9e7148>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'EmployDuration')
Out[41]:
Text(0.5, 0.98, 'Distribution of EmployDuration')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bea3e508>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'PayBackPercent')
Out[41]:
Text(0.5, 0.98, 'Distribution of PayBackPercent')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bea96888>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'Gender')
Out[41]:
Text(0.5, 0.98, 'Distribution of Gender')
Out[41]:
<Figure size 720x360 with 0 Axes>
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3beae9088>
Out[41]:
Text(0, 0.5, 'Frequency')
Out[41]:
Text(0.5, 0, 'Debtors')
Out[41]:
Text(0.5, 0.98, 'Distribution of Debtors')

Converting categorical fields to numerical fields

In [42]:
german_xai=pd.get_dummies(German_df,columns=['CurrentAcc','CreditHistory','Purpose','Savings','EmployDuration','Debtors','Collateral','OtherPayBackPlan','Property','Job'])
german_xai.head()
Out[42]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_bank OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly_qualif_emp Job_skilled_employee Job_unemp/unskilled-non_resident Job_unskilled-resident
0 6 1169 4 1 4 1 2 1 1 1 ... 0 1 0 0 1 0 0 1 0 0
1 48 5951 2 0 2 0 1 1 0 1 ... 0 1 0 0 1 0 0 1 0 0
2 12 2096 2 1 3 1 1 2 0 1 ... 0 1 0 0 1 0 0 0 0 1
3 42 7882 2 1 4 1 1 2 0 1 ... 0 1 0 1 0 0 0 1 0 0
4 24 4870 3 1 4 1 2 2 0 1 ... 0 1 0 1 0 0 0 1 0 0

5 rows × 52 columns

In [43]:
german_xai.columns
Out[43]:
Index(['NumMonths', 'CreditAmount', 'PayBackPercent', 'Gender',
       'ResidenceDuration', 'Age', 'ExistingCredit', 'Dependents', 'Telephone',
       'Foreignworker', 'CreditStatus', 'CurrentAcc_GE200', 'CurrentAcc_LT200',
       'CurrentAcc_None', 'CreditHistory_Delay', 'CreditHistory_none/paid',
       'CreditHistory_other', 'Purpose_CarNew', 'Purpose_CarUsed',
       'Purpose_biz', 'Purpose_domestic app', 'Purpose_education',
       'Purpose_furniture/equip', 'Purpose_others', 'Purpose_radio/tv',
       'Purpose_repairs', 'Purpose_retraining', 'Savings_GT500',
       'Savings_LT500', 'Savings_none', 'EmployDuration_1-4',
       'EmployDuration_4-7', 'EmployDuration_GE7', 'EmployDuration_LT1',
       'EmployDuration_unemployed', 'Debtors_co-applicant',
       'Debtors_guarantor', 'Debtors_none', 'Collateral_car/other',
       'Collateral_real_estate', 'Collateral_savings/life_insurance',
       'Collateral_unknown/none', 'OtherPayBackPlan_bank',
       'OtherPayBackPlan_none', 'OtherPayBackPlan_stores', 'Property_free',
       'Property_own', 'Property_rent',
       'Job_management/self-emp/officer/highly_qualif_emp',
       'Job_skilled_employee', 'Job_unemp/unskilled-non_resident',
       'Job_unskilled-resident'],
      dtype='object')

Reordering index

In [44]:
german_xai = german_xai.reindex(columns=['NumMonths', 'CreditAmount', 'PayBackPercent', 'Gender',
       'ResidenceDuration', 'Age', 'ExistingCredit', 'Dependents', 'Telephone',
       'Foreignworker', 'CurrentAcc_GE200',
       'CurrentAcc_LT200', 'CurrentAcc_None', 'CreditHistory_Delay',
       'CreditHistory_none/paid', 'CreditHistory_other', 'Purpose_CarNew',
       'Purpose_CarUsed', 'Purpose_biz', 'Purpose_domestic app',
       'Purpose_education', 'Purpose_furniture/equip', 'Purpose_others',
       'Purpose_radio/tv', 'Purpose_repairs', 'Purpose_retraining',
       'Savings_GT500', 'Savings_LT500', 'Savings_none', 'EmployDuration_1-4',
       'EmployDuration_4-7', 'EmployDuration_GE7', 'EmployDuration_LT1',
       'EmployDuration_unemployed', 'Debtors_co-applicant',
       'Debtors_guarantor', 'Debtors_none', 'Collateral_car/other',
       'Collateral_real_estate', 'Collateral_savings/life_insurance',
       'Collateral_unknown/none', 'OtherPayBackPlan_bank',
       'OtherPayBackPlan_none', 'OtherPayBackPlan_stores', 'Property_free',
       'Property_own', 'Property_rent',
       'Job_management/self-emp/officer/highly_qualif_emp',
       'Job_skilled_employee', 'Job_unemp/unskilled-non_resident',
       'Job_unskilled-resident','CreditStatus'])
german_xai.head()
Out[44]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly_qualif_emp Job_skilled_employee Job_unemp/unskilled-non_resident Job_unskilled-resident CreditStatus
0 6 1169 4 1 4 1 2 1 1 1 ... 1 0 0 1 0 0 1 0 0 1
1 48 5951 2 0 2 0 1 1 0 1 ... 1 0 0 1 0 0 1 0 0 0
2 12 2096 2 1 3 1 1 2 0 1 ... 1 0 0 1 0 0 0 0 1 1
3 42 7882 2 1 4 1 1 2 0 1 ... 1 0 1 0 0 0 1 0 0 1
4 24 4870 3 1 4 1 2 2 0 1 ... 1 0 1 0 0 0 1 0 0 0

5 rows × 52 columns

Scaling Credit Amount

In [45]:
from sklearn.preprocessing import MinMaxScaler #since the field is not normally distributed
scaler = MinMaxScaler()
german_xai[['CreditAmount']]=scaler.fit_transform(german_xai[['CreditAmount']])
In [46]:
german_xai.head()
Out[46]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly_qualif_emp Job_skilled_employee Job_unemp/unskilled-non_resident Job_unskilled-resident CreditStatus
0 6 0.050567 4 1 4 1 2 1 1 1 ... 1 0 0 1 0 0 1 0 0 1
1 48 0.313690 2 0 2 0 1 1 0 1 ... 1 0 0 1 0 0 1 0 0 0
2 12 0.101574 2 1 3 1 1 2 0 1 ... 1 0 0 1 0 0 0 0 1 1
3 42 0.419941 2 1 4 1 1 2 0 1 ... 1 0 1 0 0 0 1 0 0 1
4 24 0.254209 3 1 4 1 2 2 0 1 ... 1 0 1 0 0 0 1 0 0 0

5 rows × 52 columns

Writing data to csv file

In [47]:
german_xai.to_csv('C:/Users/krish/Downloads/German-encoded_upd.csv', index=False)

Splitting into train and test data

In [48]:
X = german_xai.iloc[:, :-1]
y = german_xai['CreditStatus']
X.head()
y.head()
X_train,X_test,y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=40,stratify=y)
Out[48]:
NumMonths CreditAmount PayBackPercent Gender ResidenceDuration Age ExistingCredit Dependents Telephone Foreignworker ... OtherPayBackPlan_bank OtherPayBackPlan_none OtherPayBackPlan_stores Property_free Property_own Property_rent Job_management/self-emp/officer/highly_qualif_emp Job_skilled_employee Job_unemp/unskilled-non_resident Job_unskilled-resident
0 6 0.050567 4 1 4 1 2 1 1 1 ... 0 1 0 0 1 0 0 1 0 0
1 48 0.313690 2 0 2 0 1 1 0 1 ... 0 1 0 0 1 0 0 1 0 0
2 12 0.101574 2 1 3 1 1 2 0 1 ... 0 1 0 0 1 0 0 0 0 1
3 42 0.419941 2 1 4 1 1 2 0 1 ... 0 1 0 1 0 0 0 1 0 0
4 24 0.254209 3 1 4 1 2 2 0 1 ... 0 1 0 1 0 0 0 1 0 0

5 rows × 51 columns

Out[48]:
0    1
1    0
2    1
3    1
4    0
Name: CreditStatus, dtype: int64
In [49]:
german_xai.dtypes
german_xai.shape
Out[49]:
NumMonths                                              int64
CreditAmount                                         float64
PayBackPercent                                         int64
Gender                                                 int64
ResidenceDuration                                      int64
Age                                                    int64
ExistingCredit                                         int64
Dependents                                             int64
Telephone                                              int64
Foreignworker                                          int64
CurrentAcc_GE200                                       uint8
CurrentAcc_LT200                                       uint8
CurrentAcc_None                                        uint8
CreditHistory_Delay                                    uint8
CreditHistory_none/paid                                uint8
CreditHistory_other                                    uint8
Purpose_CarNew                                         uint8
Purpose_CarUsed                                        uint8
Purpose_biz                                            uint8
Purpose_domestic app                                   uint8
Purpose_education                                      uint8
Purpose_furniture/equip                                uint8
Purpose_others                                         uint8
Purpose_radio/tv                                       uint8
Purpose_repairs                                        uint8
Purpose_retraining                                     uint8
Savings_GT500                                          uint8
Savings_LT500                                          uint8
Savings_none                                           uint8
EmployDuration_1-4                                     uint8
EmployDuration_4-7                                     uint8
EmployDuration_GE7                                     uint8
EmployDuration_LT1                                     uint8
EmployDuration_unemployed                              uint8
Debtors_co-applicant                                   uint8
Debtors_guarantor                                      uint8
Debtors_none                                           uint8
Collateral_car/other                                   uint8
Collateral_real_estate                                 uint8
Collateral_savings/life_insurance                      uint8
Collateral_unknown/none                                uint8
OtherPayBackPlan_bank                                  uint8
OtherPayBackPlan_none                                  uint8
OtherPayBackPlan_stores                                uint8
Property_free                                          uint8
Property_own                                           uint8
Property_rent                                          uint8
Job_management/self-emp/officer/highly_qualif_emp      uint8
Job_skilled_employee                                   uint8
Job_unemp/unskilled-non_resident                       uint8
Job_unskilled-resident                                 uint8
CreditStatus                                           int64
dtype: object
Out[49]:
(1000, 52)
In [50]:
import klib
klib.missingval_plot(X)
klib.missingval_plot(y)
No missing values found in the dataset.
No missing values found in the dataset.

Feature Selection

1. Using Mutual info classif

In [51]:
from sklearn.feature_selection import mutual_info_classif
mutual_info=mutual_info_classif(X_train, y_train,random_state=40)
mutual_info
Out[51]:
array([0.05678877, 0.02318715, 0.        , 0.00573952, 0.        ,
       0.01872571, 0.0136521 , 0.        , 0.        , 0.0095328 ,
       0.02398797, 0.05296151, 0.06220834, 0.03330972, 0.03190313,
       0.00553199, 0.00157823, 0.01061717, 0.        , 0.        ,
       0.01199827, 0.01585764, 0.01401993, 0.02156122, 0.02378362,
       0.        , 0.0200516 , 0.01483981, 0.0201077 , 0.02075406,
       0.0081356 , 0.00738015, 0.01843426, 0.01352031, 0.01477018,
       0.00101237, 0.00829283, 0.        , 0.00020473, 0.02448042,
       0.        , 0.        , 0.02107073, 0.        , 0.        ,
       0.0046772 , 0.02035891, 0.02148182, 0.        , 0.        ,
       0.        ])

Estimate mutual information for a discrete target variable.

Mutual information (MI) [1] between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [52]:
X.columns
Out[52]:
Index(['NumMonths', 'CreditAmount', 'PayBackPercent', 'Gender',
       'ResidenceDuration', 'Age', 'ExistingCredit', 'Dependents', 'Telephone',
       'Foreignworker', 'CurrentAcc_GE200', 'CurrentAcc_LT200',
       'CurrentAcc_None', 'CreditHistory_Delay', 'CreditHistory_none/paid',
       'CreditHistory_other', 'Purpose_CarNew', 'Purpose_CarUsed',
       'Purpose_biz', 'Purpose_domestic app', 'Purpose_education',
       'Purpose_furniture/equip', 'Purpose_others', 'Purpose_radio/tv',
       'Purpose_repairs', 'Purpose_retraining', 'Savings_GT500',
       'Savings_LT500', 'Savings_none', 'EmployDuration_1-4',
       'EmployDuration_4-7', 'EmployDuration_GE7', 'EmployDuration_LT1',
       'EmployDuration_unemployed', 'Debtors_co-applicant',
       'Debtors_guarantor', 'Debtors_none', 'Collateral_car/other',
       'Collateral_real_estate', 'Collateral_savings/life_insurance',
       'Collateral_unknown/none', 'OtherPayBackPlan_bank',
       'OtherPayBackPlan_none', 'OtherPayBackPlan_stores', 'Property_free',
       'Property_own', 'Property_rent',
       'Job_management/self-emp/officer/highly_qualif_emp',
       'Job_skilled_employee', 'Job_unemp/unskilled-non_resident',
       'Job_unskilled-resident'],
      dtype='object')
In [53]:
mutual_info=pd.Series(mutual_info)
mutual_info.index=X_train.columns
mutual_info.sort_values(ascending=False)
Out[53]:
CurrentAcc_None                                      0.062208
NumMonths                                            0.056789
CurrentAcc_LT200                                     0.052962
CreditHistory_Delay                                  0.033310
CreditHistory_none/paid                              0.031903
Collateral_savings/life_insurance                    0.024480
CurrentAcc_GE200                                     0.023988
Purpose_repairs                                      0.023784
CreditAmount                                         0.023187
Purpose_radio/tv                                     0.021561
Job_management/self-emp/officer/highly_qualif_emp    0.021482
OtherPayBackPlan_none                                0.021071
EmployDuration_1-4                                   0.020754
Property_rent                                        0.020359
Savings_none                                         0.020108
Savings_GT500                                        0.020052
Age                                                  0.018726
EmployDuration_LT1                                   0.018434
Purpose_furniture/equip                              0.015858
Savings_LT500                                        0.014840
Debtors_co-applicant                                 0.014770
Purpose_others                                       0.014020
ExistingCredit                                       0.013652
EmployDuration_unemployed                            0.013520
Purpose_education                                    0.011998
Purpose_CarUsed                                      0.010617
Foreignworker                                        0.009533
Debtors_none                                         0.008293
EmployDuration_4-7                                   0.008136
EmployDuration_GE7                                   0.007380
Gender                                               0.005740
CreditHistory_other                                  0.005532
Property_own                                         0.004677
Purpose_CarNew                                       0.001578
Debtors_guarantor                                    0.001012
Collateral_real_estate                               0.000205
Job_skilled_employee                                 0.000000
OtherPayBackPlan_bank                                0.000000
Property_free                                        0.000000
OtherPayBackPlan_stores                              0.000000
Job_unemp/unskilled-non_resident                     0.000000
Purpose_retraining                                   0.000000
Collateral_unknown/none                              0.000000
Collateral_car/other                                 0.000000
Purpose_domestic app                                 0.000000
Purpose_biz                                          0.000000
Telephone                                            0.000000
Dependents                                           0.000000
ResidenceDuration                                    0.000000
PayBackPercent                                       0.000000
Job_unskilled-resident                               0.000000
dtype: float64
In [54]:
mutual_info.sort_values(ascending=False).plot.bar(figsize=(15,5))
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3bfb9b688>

Selecting top 25% features having highest dependencies w.r.to target variable CreditStatus along with protected variables under consideration age, gender, marital status.

In [55]:
mutual_info.sort_values(ascending=False)[0:10]
Out[55]:
CurrentAcc_None                      0.062208
NumMonths                            0.056789
CurrentAcc_LT200                     0.052962
CreditHistory_Delay                  0.033310
CreditHistory_none/paid              0.031903
Collateral_savings/life_insurance    0.024480
CurrentAcc_GE200                     0.023988
Purpose_repairs                      0.023784
CreditAmount                         0.023187
Purpose_radio/tv                     0.021561
dtype: float64
In [56]:
german_xai_imp=german_xai[['CurrentAcc_None',
'NumMonths',
'CurrentAcc_LT200',
'CreditHistory_Delay',
'CreditHistory_none/paid',
'Collateral_savings/life_insurance',
'CurrentAcc_GE200',
'Purpose_repairs',
'CreditAmount',
'Purpose_radio/tv',
'Gender','Age','CreditStatus']]
german_xai_imp.head()
Out[56]:
CurrentAcc_None NumMonths CurrentAcc_LT200 CreditHistory_Delay CreditHistory_none/paid Collateral_savings/life_insurance CurrentAcc_GE200 Purpose_repairs CreditAmount Purpose_radio/tv Gender Age CreditStatus
0 0 6 1 0 0 0 0 0 0.050567 1 1 1 1
1 0 48 1 0 1 0 0 0 0.313690 1 0 0 0
2 1 12 0 0 0 0 0 0 0.101574 0 1 1 1
3 0 42 1 0 1 1 0 0 0.419941 0 1 1 1
4 0 24 1 1 0 0 0 0 0.254209 0 1 1 0
In [57]:
german_xai_imp.dtypes
Out[57]:
CurrentAcc_None                        uint8
NumMonths                              int64
CurrentAcc_LT200                       uint8
CreditHistory_Delay                    uint8
CreditHistory_none/paid                uint8
Collateral_savings/life_insurance      uint8
CurrentAcc_GE200                       uint8
Purpose_repairs                        uint8
CreditAmount                         float64
Purpose_radio/tv                       uint8
Gender                                 int64
Age                                    int64
CreditStatus                           int64
dtype: object

2. Using correlation

In [58]:
corrMatrix = round(german_xai_imp.corr(),1)
corrMatrix
Out[58]:
CurrentAcc_None NumMonths CurrentAcc_LT200 CreditHistory_Delay CreditHistory_none/paid Collateral_savings/life_insurance CurrentAcc_GE200 Purpose_repairs CreditAmount Purpose_radio/tv Gender Age CreditStatus
CurrentAcc_None 1.0 -0.1 -0.9 0.0 -0.2 -0.0 -0.2 -0.0 -0.0 0.1 0.0 0.1 0.3
NumMonths -0.1 1.0 0.1 0.1 -0.0 -0.1 -0.1 -0.0 0.6 -0.0 0.1 0.0 -0.2
CurrentAcc_LT200 -0.9 0.1 1.0 -0.0 0.2 0.0 -0.3 0.0 0.1 -0.1 -0.0 -0.1 -0.3
CreditHistory_Delay 0.0 0.1 -0.0 1.0 -0.4 0.0 -0.0 0.0 0.1 -0.0 0.1 0.1 -0.0
CreditHistory_none/paid -0.2 -0.0 0.2 -0.4 1.0 -0.0 0.0 -0.0 -0.0 0.0 -0.1 -0.2 -0.2
Collateral_savings/life_insurance -0.0 -0.1 0.0 0.0 -0.0 1.0 -0.0 -0.0 -0.0 -0.1 -0.0 -0.0 -0.0
CurrentAcc_GE200 -0.2 -0.1 -0.3 -0.0 0.0 -0.0 1.0 -0.0 -0.1 0.1 -0.0 0.0 0.0
Purpose_repairs -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0 -0.1 0.0 -0.0 -0.0
CreditAmount -0.0 0.6 0.1 0.1 -0.0 -0.0 -0.1 -0.0 1.0 -0.2 0.1 0.0 -0.2
Purpose_radio/tv 0.1 -0.0 -0.1 -0.0 0.0 -0.1 0.1 -0.1 -0.2 1.0 0.0 -0.1 0.1
Gender 0.0 0.1 -0.0 0.1 -0.1 -0.0 -0.0 0.0 0.1 0.0 1.0 0.3 0.1
Age 0.1 0.0 -0.1 0.1 -0.2 -0.0 0.0 -0.0 0.0 -0.1 0.3 1.0 0.1
CreditStatus 0.3 -0.2 -0.3 -0.0 -0.2 -0.0 0.0 -0.0 -0.2 0.1 0.1 0.1 1.0
In [59]:
klib.corr_plot(german_xai_imp,annot=False)
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3c0fecac8>
In [60]:
corrMatrix1 = round(german_xai_imp.corr(),1)
corrMatrix1
plt.figure(figsize=(15,15))
sns.heatmap(corrMatrix1, annot=True,cmap="Blues")
plt.show()
Out[60]:
CurrentAcc_None NumMonths CurrentAcc_LT200 CreditHistory_Delay CreditHistory_none/paid Collateral_savings/life_insurance CurrentAcc_GE200 Purpose_repairs CreditAmount Purpose_radio/tv Gender Age CreditStatus
CurrentAcc_None 1.0 -0.1 -0.9 0.0 -0.2 -0.0 -0.2 -0.0 -0.0 0.1 0.0 0.1 0.3
NumMonths -0.1 1.0 0.1 0.1 -0.0 -0.1 -0.1 -0.0 0.6 -0.0 0.1 0.0 -0.2
CurrentAcc_LT200 -0.9 0.1 1.0 -0.0 0.2 0.0 -0.3 0.0 0.1 -0.1 -0.0 -0.1 -0.3
CreditHistory_Delay 0.0 0.1 -0.0 1.0 -0.4 0.0 -0.0 0.0 0.1 -0.0 0.1 0.1 -0.0
CreditHistory_none/paid -0.2 -0.0 0.2 -0.4 1.0 -0.0 0.0 -0.0 -0.0 0.0 -0.1 -0.2 -0.2
Collateral_savings/life_insurance -0.0 -0.1 0.0 0.0 -0.0 1.0 -0.0 -0.0 -0.0 -0.1 -0.0 -0.0 -0.0
CurrentAcc_GE200 -0.2 -0.1 -0.3 -0.0 0.0 -0.0 1.0 -0.0 -0.1 0.1 -0.0 0.0 0.0
Purpose_repairs -0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.0 -0.1 0.0 -0.0 -0.0
CreditAmount -0.0 0.6 0.1 0.1 -0.0 -0.0 -0.1 -0.0 1.0 -0.2 0.1 0.0 -0.2
Purpose_radio/tv 0.1 -0.0 -0.1 -0.0 0.0 -0.1 0.1 -0.1 -0.2 1.0 0.0 -0.1 0.1
Gender 0.0 0.1 -0.0 0.1 -0.1 -0.0 -0.0 0.0 0.1 0.0 1.0 0.3 0.1
Age 0.1 0.0 -0.1 0.1 -0.2 -0.0 0.0 -0.0 0.0 -0.1 0.3 1.0 0.1
CreditStatus 0.3 -0.2 -0.3 -0.0 -0.2 -0.0 0.0 -0.0 -0.2 0.1 0.1 0.1 1.0
Out[60]:
<Figure size 1080x1080 with 0 Axes>
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3c0e67f48>
In [61]:
german_upd=german_xai_imp.drop(['CurrentAcc_LT200','CreditAmount'],axis=1)
german_upd
Out[61]:
CurrentAcc_None NumMonths CreditHistory_Delay CreditHistory_none/paid Collateral_savings/life_insurance CurrentAcc_GE200 Purpose_repairs Purpose_radio/tv Gender Age CreditStatus
0 0 6 0 0 0 0 0 1 1 1 1
1 0 48 0 1 0 0 0 1 0 0 0
2 1 12 0 0 0 0 0 0 1 1 1
3 0 42 0 1 1 0 0 0 1 1 1
4 0 24 1 0 0 0 0 0 1 1 0
... ... ... ... ... ... ... ... ... ... ... ...
995 1 12 0 1 0 0 0 0 0 1 1
996 0 30 0 1 1 0 0 0 1 1 1
997 1 12 0 1 0 0 0 1 1 1 1
998 0 45 0 1 0 0 0 1 1 0 0
999 0 45 0 0 0 0 0 0 1 1 1

1000 rows × 11 columns

In [62]:
corrMatrix2 = round(german_upd.corr(),1)
corrMatrix2
plt.figure(figsize=(15,15))
sns.heatmap(corrMatrix2, annot=True,cmap="Blues")
plt.show()
Out[62]:
CurrentAcc_None NumMonths CreditHistory_Delay CreditHistory_none/paid Collateral_savings/life_insurance CurrentAcc_GE200 Purpose_repairs Purpose_radio/tv Gender Age CreditStatus
CurrentAcc_None 1.0 -0.1 0.0 -0.2 -0.0 -0.2 -0.0 0.1 0.0 0.1 0.3
NumMonths -0.1 1.0 0.1 -0.0 -0.1 -0.1 -0.0 -0.0 0.1 0.0 -0.2
CreditHistory_Delay 0.0 0.1 1.0 -0.4 0.0 -0.0 0.0 -0.0 0.1 0.1 -0.0
CreditHistory_none/paid -0.2 -0.0 -0.4 1.0 -0.0 0.0 -0.0 0.0 -0.1 -0.2 -0.2
Collateral_savings/life_insurance -0.0 -0.1 0.0 -0.0 1.0 -0.0 -0.0 -0.1 -0.0 -0.0 -0.0
CurrentAcc_GE200 -0.2 -0.1 -0.0 0.0 -0.0 1.0 -0.0 0.1 -0.0 0.0 0.0
Purpose_repairs -0.0 -0.0 0.0 -0.0 -0.0 -0.0 1.0 -0.1 0.0 -0.0 -0.0
Purpose_radio/tv 0.1 -0.0 -0.0 0.0 -0.1 0.1 -0.1 1.0 0.0 -0.1 0.1
Gender 0.0 0.1 0.1 -0.1 -0.0 -0.0 0.0 0.0 1.0 0.3 0.1
Age 0.1 0.0 0.1 -0.2 -0.0 0.0 -0.0 -0.1 0.3 1.0 0.1
CreditStatus 0.3 -0.2 -0.0 -0.2 -0.0 0.0 -0.0 0.1 0.1 0.1 1.0
Out[62]:
<Figure size 1080x1080 with 0 Axes>
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3c0f39e48>

No higher correlation is observed between input variables (except gender, marital status (0.7) and credit amount, num of months (0.6) and between target variable and input variables. But since we are trying to understand the impact of protected variables let us retain them without dropping.

writing data to csv file

In [63]:
german_upd.to_csv('C:/Users/krish/Downloads/German-reduced_upd.csv', index=False)

List of protected attributes

(https://arxiv.org/pdf/1811.11154.pdf)

In [64]:
from IPython.display  import Image
Image(filename='C:/Users/krish/Desktop/MAIN PJT/list of protected variables.png',width=500,height=30)
Out[64]:

From the above, we have 3 protected fields in our dataset:

1. Gender
2. Age

Now, let us identify previlege class in each protected attribute.

1.Gender

In [68]:
print(german_upd['Gender'].value_counts())
german_upd.groupby(['Gender'])['CreditStatus'].mean()
#https://arxiv.org/pdf/1810.01943.pdf, https://arxiv.org/pdf/2005.12379.pdf
1    690
0    310
Name: Gender, dtype: int64
Out[68]:
Gender
0    0.648387
1    0.723188
Name: CreditStatus, dtype: float64

Males(1) are more than females and for males(1) target variable CreditScore is more favorable having higher value for given number of males than female group average. Hence male(1) is privelieged class.

2.Age

In [69]:
print(german_upd['Age'].value_counts())
german_upd.groupby(['Age'])['CreditStatus'].mean()
1    810
0    190
Name: Age, dtype: int64
Out[69]:
Age
0    0.578947
1    0.728395
Name: CreditStatus, dtype: float64

Age >26: 1; else 0; so ppl above 26 are more and group average of ppl with age >26 is higher than the group of age < 26 ,so age(1) is priveleiged group

In [70]:
german_upd.columns
Out[70]:
Index(['CurrentAcc_None', 'NumMonths', 'CreditHistory_Delay',
       'CreditHistory_none/paid', 'Collateral_savings/life_insurance',
       'CurrentAcc_GE200', 'Purpose_repairs', 'Purpose_radio/tv', 'Gender',
       'Age', 'CreditStatus'],
      dtype='object')
In [ ]: