By Kyoosik Kim
Home Mortgage Disclosure Act (HMDA) is a federal law in the US that requires financial institutions to provide mortgage data to the public. The data contains basic information regarding mortgage loans such as applicant's income and race, loan amount, regions, and many others. What we focus is the relationship between these factors and the result of evaluation. There are 6 different available results, among which approval and denial are the major counts. Through the analysis, we will find patterns in numerical and categorical features, and eventually apply them to building a predictive model.
- Setup and Query Dataset
- Preparation
- Data Cleaning: Category Integration, Type Conversion
- Data Analysis: Scatterplot, Piechart, Boxplot
- Clustering: Hierarchical, K-means, Density-based
- Modeling: Decision Tree, Naive Bayes, Random Forest, Gradient Boosting, RFE, PCA, Voting Ensemble
- Conclusion
The data is pulled directly from the CFPB API. Although downloading in CSV is also available on Custom Dataset page at the same website, API has a benefit of flexibility like querying data. The data set is HMDA data of Detroit Metropolitan Area where social issues often arise such as racial discrimination or high crime rate. The data format is JSON to be able to fit MongoDB so that we can query the data fast. The purpose of querying is to briefly learn about the features as a preliminary process before detailed analysis. The following are libraries to be used for the entire project.
# file control, api, mongodb
import json
import os
import urllib3
from pymongo import MongoClient
# dataframe
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
# clustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans, DBSCAN
import scipy.cluster.hierarchy as shc
# model preparation
from sklearn import preprocessing as ppr
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
# modeling
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE
# model evaluation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import *
# others
import scipy.stats as stats
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')
"""
year, state, county can be given as condition
pull all the features in JSON
save the result as a file for later use
"""
def pull_dataset(file_name, year, msamd, code):
http = urllib3.PoolManager()
# request if file doesn't exist
if not os.path.isfile(file_name):
cfpb_url = ('https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.json?&'
'$where=as_of_year+IN+('+year+')+AND+action_taken+IN+(1,3)+AND+(msamd=%'+msamd+'%'+code+')&'
'$select=tract_to_msamd_income,'
'population,'
'minority_population,'
'number_of_owner_occupied_units,'
'number_of_1_to_4_family_units,'
'loan_amount_000s,'
'applicant_income_000s,'
'loan_type_name,'
'loan_purpose_name,'
'co_applicant_race_name_1,'
'census_tract_number,'
'applicant_sex_name,'
'applicant_race_name_1,'
'agency_abbr,'
'action_taken_name&$limit=0')
# receive data and store in JSON
response = http.request('GET', cfpb_url)
json_data = json.loads(response.data.decode('utf-8'))
results = json_data['results']
# write file
with open(file_name, 'w', encoding='utf-8') as outfile:
json_string = json.dumps(results)
outfile.write(json_string)
outfile.close()
# read file if file already exists
with open(file_name, 'rb') as infile:
data = json.loads(infile.read())
infile.close()
return data
# recent 3 years to local DB
detroit = pull_dataset("detroit.json", "2017,2016,2015", "2219804", "22")
# do not write if db already exists
client = MongoClient('localhost:27017')
if "hmda" not in client.list_database_names():
db = client.hmda
db['detroit'].insert_many(detroit)
db = client.hmda
query_count = db.detroit.count()
query_find_one = db.detroit.find_one()
print("Number of Cases:", query_count)
print("\n<Case Example>")
pprint(query_find_one)
# screen meaningless features
def check_disticnt(keys):
one_value_keys = []
for key in keys:
query_distinct = db.detroit.distinct(key)
if len(query_distinct) == 1:
one_value_keys.append(key)
return one_value_keys
keys = query_find_one
print("Number of Keys:", len(keys))
print("\nKeys to be dropped for having the same value:")
pprint(check_disticnt(keys))
def convert_to_list(query_result, roundup_keys):
doc_list = []
for doc in query_result:
for key in roundup_keys:
try:
doc[key] = round(doc[key], 2)
except:
doc[key] = "NaN"
doc_list.append(doc)
return doc_list
query_action_taken = db.detroit.aggregate([
{"$group":
{
"_id": "$action_taken_name",
"count": {"$sum": 1},
"amount": {"$avg": "$loan_amount_000s"},
"income": {"$avg": "$applicant_income_000s"}
}
},
{"$sort": {"count": -1}}
])
pprint(convert_to_list(query_action_taken, ['amount', 'income']))
As the target is loan evaluation results; approved vs. denied, the query result above help us understand how the classes are distributed. In addition, it is also telling that loan amount and income are different across the loan results. Therefore, these features are suspected to play a major role for analysis.
# query loan accepted
field_val = "Loan originated"
total_count = db.detroit.find({"action_taken_name": field_val}).count()
query_loan_originated = db.detroit.aggregate([
{"$match": {"action_taken_name": field_val}},
{"$group": {"_id": "$applicant_race_name_1",
"count": {"$sum": 1},
"amount": {"$avg": "$loan_amount_000s"},
"income": {"$avg": "$applicant_income_000s"}}
},
{"$sort": {"count": -1}},
{"$addFields": {"case_share": {"$divide": ["$count", total_count]}}}
])
acceptance = convert_to_list(query_loan_originated, ['amount', 'income', 'case_share'])
# query loan denied
field_val = "Application denied by financial institution"
total_count = db.detroit.find({"action_taken_name": field_val}).count()
query_loan_originated = db.detroit.aggregate([
{"$match": {"action_taken_name": field_val}},
{"$group": {"_id": "$applicant_race_name_1",
"count": {"$sum": 1},
"amount": {"$avg": "$loan_amount_000s"},
"income": {"$avg": "$applicant_income_000s"}}
},
{"$sort": {"count": -1}},
{"$addFields": {"case_share": {"$divide": ["$count", total_count]}}}
])
denial = convert_to_list(query_loan_originated, ['amount', 'income', 'case_share'])
pd.concat({'Accepted': pd.DataFrame(acceptance), 'Denied': pd.DataFrame(denial)})
Another query result of racial factors seem to be useful. It is quite clear that there are differences among different races, thus, race-related features should be significant for analysis. Along with these, some other features like loan type, agency, loan purpose will be put together with aforementioned features to find patterns.
The next step becomes turning JSON data on MongoDB into a CSV file so we can make use of analytical tools on Python. Once this step is done, we should handle missing values first before moving forward.
""" Read from Mongo and Store into DataFrame """
def read_mongo(collection):
# Connect to MongoDB
client = MongoClient('localhost:27017')
db = client.hmda
# Make a query to the specific DB and Collection
cursor = db[collection].find({})
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
df.drop(['_id'], axis=1, inplace=True)
return df
hmda_df = read_mongo("detroit")
hmda_df.shape
hmda_df.head()
hmda_df.info()
Fortunately, most of the features are complete or missing a very few. For those features with missing values, the following can be considered for handling.
While there could be other methods, the above are largely suitable for the situation. To decide which, we need to see if the missing values are systematic or random. The latter is more flexible for both imputation methods while the former case needs further investigation. Since there does not seem to have a pattern in missing values, we can simply drop them when filling with statistics could make it more complicated.
hmda_df.dropna(inplace=True)
hmda_df.shape
The last step before analysis is cleaning the data. Since we have many categorical features, it is necessary to look at them to see if any of them is problematic. This is a safer way because some analysis would later need to use categorical features together with numerical features like boxplot as an instance. It would be a good idea to organize categories first before doing such a thing.
# handle a trivial error
def fix_error_message(df, cols):
error_message = "Information not provided by applicant in mail, Internet, or telephone application"
for col in cols:
df[col] = df[col].apply(lambda x :
"Not available" if x==error_message else x)
return df
# fix error values
cols = list(hmda_df.select_dtypes('object').columns)
hmda_df = fix_error_message(hmda_df, cols)
# long name to shorter version
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="Native Hawaiian or Other Pacific Islander",
value="Islander")
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="American Indian or Alaska Native",
value="Native American")
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="Black or African American",
value="Black")
# horizontal barplot
def value_counts_bar(col, title):
vc = col.value_counts()
plt.figure(figsize=(6, len(vc.keys())/2))
plt.barh(vc.keys(), vc.values)
plt.title(title, fontsize=15)
plt.xticks(rotation='vertical', fontsize=10)
plt.show()
• Loan Result
value_counts_bar(hmda_df.action_taken_name, "Action Taken")
# change class value names
hmda_df['action_taken_name'] = hmda_df['action_taken_name'].apply(lambda x : "Approved"
if x == "Loan originated" else "Denied")
• Agency
value_counts_bar(hmda_df.agency_abbr, "Agency")
hmda_df['agency_abbr'] = hmda_df['agency_abbr'].apply(lambda x : "Others"
if x not in ['CFPB', 'HUD', 'NCUA'] else x)
While the top 3 values (HUD/CFPB/NCUA) account for the most, the rest are only a fraction. Since having too many values for such a small count could lead to overfitting, they can be merged together as Others.
• Applicant Race
value_counts_bar(hmda_df.applicant_race_name_1, "Applicant Race")
# White and Black are major two
hmda_df['applicant_race_name_1'] = hmda_df['applicant_race_name_1'].apply(lambda x : 'Others'
if x not in ['White', 'Black'] else x)
• Applicant Sex
value_counts_bar(hmda_df.applicant_sex_name, "Applicant Sex")
hmda_df['applicant_sex_name'] = hmda_df['applicant_sex_name'].apply(lambda x : "Others"
if x not in ['Male', 'Female'] else x)
• Co-applicant Race
value_counts_bar(hmda_df.co_applicant_race_name_1, "Co-applicant Race")
# with or without co-applicant
hmda_df['co_applicant_race_name_1'] = hmda_df['co_applicant_race_name_1'].apply(lambda x : "With co-applicant"
if x != 'No co-applicant' else x)
hmda_df.rename(columns={'co_applicant_race_name_1': 'co_applicant'}, inplace=True)
• Loan Purpose
value_counts_bar(hmda_df.loan_purpose_name, "Loan Purpose")
• Loan Type
value_counts_bar(hmda_df.loan_type_name, "Loan Type")
hmda_df['loan_type_name'] = hmda_df['loan_type_name'].apply(lambda x : "Non-conventional"
if x != "Conventional" else x)
• Census Tract Number
hmda_df = hmda_df[hmda_df['census_tract_number'] != ""]
print("Number of Census Tract:", len(hmda_df['census_tract_number'].unique()))
Census tract numbers are not numerical data as their zipcode-like numbers do not mean anything. Detroit Census Information can help to understand the basic statistics with a visual support. Census tract numbers are basically deemed as categorical data, but there are over a hundred different values. In fact, we might still use it as categorical data but this could cause overfitting issues, missing general trends. The best way to deal with this is clustering. This technique, though, requires us to know about the dataset first as we have yet to know what based we should cluster them on.
• Type Conversion and Summary
# object to category
object_cols = list(hmda_df.select_dtypes('object').columns)
object_cols.remove('census_tract_number')
hmda_df[object_cols] = hmda_df[object_cols].astype('category')
hmda_df.describe(include=['category']).drop('count').T
As plotting data always comes in handy to feel the data, scatterplot and boxplot will be used as data visualization. Before them, we need to know the distribution for numerical data. If they are overly skewed or they have strange patterns, they should be handled first.
hmda_df.describe().drop('count').T
The statistics are already showing some problems with dollar-related features that the max is too much compared with 75 quantile. This means the observations are highly skewed. The details can be revealed by plotting them.
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
axes[0].hist(hmda_df.applicant_income_000s, bins=1000)
axes[0].axvline(hmda_df.applicant_income_000s.mean(), color='red', linestyle='--')
axes[0].set_xlim([0, 300])
axes[0].set_title("Income", fontsize=15)
axes[0].set_xlabel("Thousand $")
axes[1].hist(hmda_df.loan_amount_000s, bins=300)
axes[1].axvline(hmda_df.loan_amount_000s.mean(), color='red', linestyle='--')
axes[1].set_xlim([0, 500])
axes[1].set_title("Loan Amount", fontsize=15)
axes[1].set_xlabel("Thousand $")
axes[2].hist(hmda_df.tract_to_msamd_income, bins=40)
axes[2].axvline(hmda_df.tract_to_msamd_income.mean(), color='red', linestyle='--')
axes[2].set_xlim([0, 300])
axes[2].set_title("Income by MSA Tract", fontsize=15)
axes[2].set_xlabel("Thousand $")
plt.tight_layout()
plt.show()
As expected from statistics, Income and Loan Amount are positively skewed, having a lot of small dollar amounts. Before considering transformation, we need to note that there are 0 amount cases which doesn't make sense. Especially, there are so many cases of loan amount equal to zero (possibly just less than a $1,000). We could suspect that these are administrative issues such as applications filed for fixing errors or any kind of paper jobs for any reason. Therefore, we would consider excluding these zero amount cases.
hmda_df = hmda_df[(hmda_df['applicant_income_000s'] > 0) & (hmda_df['applicant_income_000s'] < 250)]
hmda_df = hmda_df[(hmda_df['loan_amount_000s'] > 0) & (hmda_df['loan_amount_000s'] < 400)]
hmda_df = hmda_df[(hmda_df['tract_to_msamd_income'] > 0) & (hmda_df['tract_to_msamd_income'] < 280)]
hmda_df.shape
print("Average Income: $", round(hmda_df['applicant_income_000s'].mean(), 2), "K")
print("Average Loan Amount: $", round(hmda_df['loan_amount_000s'].mean(), 2), "K")
Now, we need to decide whether to transform the distributions of the numerical variables. Our goal is classification, which often times requires sparsity among cases throughout features. Although we need more investigation, the skewed distributions suggest the close distance that will make it difficult to classify. Transforming the numerical features will help to spread out the observations across the range.
hmda_df['income_sqrt'] = np.sqrt(hmda_df['applicant_income_000s'])
hmda_df['loan_amount_sqrt'] = np.sqrt(hmda_df['loan_amount_000s'])
fig, axes = plt.subplots(1, 2, figsize=(10,4))
for action in hmda_df['action_taken_name'].unique():
sns.distplot(hmda_df[hmda_df['action_taken_name']==action].income_sqrt,
hist=False, kde=True, ax=axes[0])
sns.distplot(hmda_df[hmda_df['action_taken_name']==action].loan_amount_sqrt,
hist=False, kde=True, label=action, ax=axes[1])
axes[1].legend(bbox_to_anchor=(1,0.6))
plt.show()
dist_0 = hmda_df[hmda_df['action_taken_name']=='Approved']['income_sqrt']
dist_1 = hmda_df[hmda_df['action_taken_name']=='Denied']['income_sqrt']
stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)
print("Stat/P-value: (", round(stat, 2), ',', p_value, ")")
Even if the t-test is telling the two distributions are significantly different. That is, an application of small income and small loan amount is more likely to be denied than that of large income and large amount. there is a large overlap shared by two classes. Our goal then becomes utilizing other features to separate the observations even more clearly. The following distributions are from the rest of numerical features.
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].hist(hmda_df.population, bins=40)
axes[0].axvline(hmda_df.population.mean(), color='red', linestyle='--')
axes[0].set_xlim([0, 8000])
axes[0].set_title("Population per Census Tract", fontsize=12)
axes[0].set_xlabel("Number of Residents")
axes[1].hist(hmda_df.minority_population, bins=40)
axes[1].axvline(hmda_df.minority_population.mean(), color='red', linestyle='--')
axes[1].set_xlim([0, 100])
axes[1].set_title("Minority per Census Tract", fontsize=12)
axes[1].set_xlabel("Ratio (%)")
axes[2].hist(hmda_df.number_of_1_to_4_family_units, bins=40)
axes[2].axvline(hmda_df.number_of_1_to_4_family_units.mean(), color='red', linestyle='--')
axes[2].set_xlim([0, 3200])
axes[2].set_title("1~4 Family Units", fontsize=12)
axes[2].set_xlabel("Units")
axes[3].hist(hmda_df.number_of_owner_occupied_units, bins=40)
axes[3].axvline(hmda_df.number_of_owner_occupied_units.mean(), color='red', linestyle='--')
axes[3].set_xlim([0, 2500])
axes[3].set_title("Number of Owner Occupied Units", fontsize=12)
axes[3].set_xlabel("Units")
plt.tight_layout()
plt.show()
They are roughly bell-shaped except for Minority Population (%). The Minority Population feature indicates that the neighborhoods are severely segregated. For the rest, they don't seem to need transformation.
fig, axes = plt.subplots(1, 3, figsize=(15,4.5))
for action in hmda_df['action_taken_name'].unique():
sns.distplot(hmda_df[hmda_df['action_taken_name']==action].population,
hist=False, kde=True, ax=axes[0])
sns.distplot(hmda_df[hmda_df['action_taken_name']==action].minority_population,
hist=False, kde=True, label=str(action), ax=axes[1])
sns.distplot(hmda_df[hmda_df['action_taken_name']==action].number_of_1_to_4_family_units,
hist=False, kde=True, ax=axes[2])
axes[1].legend(bbox_to_anchor=(0.7,-0.2))
plt.show()
From another view, an application from high minority population rate is more likely to be denied. Again, the rest don't quite seem to have differences by class. However, histogram can't tell everything about the differences between classes until many other features possibly reveal underlying properties.
# add more ratio features
hmda_df['tract_income_to_loan'] = hmda_df['tract_to_msamd_income'] / hmda_df['applicant_income_000s']
hmda_df['income_to_loan'] = hmda_df['loan_amount_000s'] / hmda_df['applicant_income_000s']
corr = hmda_df.corr()
corr
It is natural that sqrt features are highly correlated with their original features. Other than them, we see high correlations between dollar amount features. Another thing to note is that census-based feature are also meaningfully related. This can be easily seen on heatmap.
plt.figure(figsize=(8,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, square=True, cmap="PiYG", vmin=-0.7, vmax=0.7)
Population is related with Number of 1~4 Family Units as well as Number of Owner Occupied Units, and this is because the number of units are absolute counts. These numbers are supposed to increase as population is bigger. What we can do is turning these into relative numbers by dividing the number of units with population. Also, we need to drop the original features as we transformed them. This is important because we want features to be independent each other for better classification.
# transform to ratio
hmda_df['number_of_1_to_4_family_units'] = hmda_df['number_of_1_to_4_family_units'] / hmda_df['population']
hmda_df['number_of_owner_occupied_units'] = hmda_df['number_of_owner_occupied_units'] / hmda_df['population']
# drop old features
hmda_df.drop(columns=['applicant_income_000s', 'loan_amount_000s', 'tract_to_msamd_income'], inplace=True)
# updated heatmap
corr = hmda_df.corr()
plt.figure(figsize=(8,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, square=True, cmap="PiYG", vmin=-0.7, vmax=0.7)
# for better visualization or faster calculation
def df_reducer(df, size):
indice = np.random.choice(list(df.index), size, replace=False)
return df.loc[indice]
- Income vs. Loan Amount
reduced_df = df_reducer(hmda_df, 5000)
# income vs. loan amount by sex and race
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))
sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
data=reduced_df, hue="action_taken_name", s=15, ax=axes[0])
axes[0].set_title("Income vs. Loan Amount by Loan Result", fontsize=12)
axes[0].legend(bbox_to_anchor=(0.5, -0.6), loc=8)
sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
data=reduced_df, hue="applicant_sex_name", s=15, ax=axes[1])
axes[1].set_title("Income vs. Loan Amount by Sex", fontsize=12)
axes[1].legend(bbox_to_anchor=(0.5, -0.6), loc=8)
sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
data=reduced_df, hue="applicant_race_name_1", s=15, ax=axes[2])
axes[2].set_title("Income vs. Loan Amount by Race", fontsize=12)
axes[2].legend(bbox_to_anchor=(0.5, -0.6), loc=8)
plt.show()
Again, larger income and larger loan amount help applicants get their mortgage loans. There is a pattern that women are less likely to get their loans approved than men, even if the visualization is not balanced for categories. Also, black population has weaker chances to get their mortgage loans. However, all these categories are not decisively strong.
- 3D Plot: Income vs. Loan Amount vs. Minority Population
reduced_df = df_reducer(hmda_df, 500)
# comment out for interactive plotting
# %matplotlib notebook
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
actions = reduced_df['action_taken_name'].unique()
ranges = [('r', actions[0]), ('b', actions[1])]
for c, action in ranges:
xs = reduced_df[reduced_df['action_taken_name']==action]['minority_population']
ys = reduced_df[reduced_df['action_taken_name']==action]['income_sqrt']
zs = reduced_df[reduced_df['action_taken_name']==action]['loan_amount_sqrt']
ax.scatter(xs, ys, zs, c=c, s=5)
ax.set_title('Mortgage Approval (red) vs. Denial (blue)\n', fontsize=15)
ax.set_xlabel('Minority Population (%)')
ax.set_ylabel('Income')
ax.set_zlabel('Loan Amount')
plt.show()
As for high minority population rate, it is quite clear to group certain classes. For example, an application of high minority population rate/low income/low loan amount is very likely to be denied. The problem is when the minority population rate is low where dots are mixed together. In other words, Loan Amount and Income lose its sharpness to cut the classes into groups when the racial factor becomes mild.
- Minority Ratio by Loan Result
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
actions = reduced_df['action_taken_name'].unique()
ranges = [('r', actions[0]), ('b', actions[1])]
for c, action in ranges:
xs = reduced_df[reduced_df['action_taken_name']==action]['minority_population']
ys = reduced_df[reduced_df['action_taken_name']==action]['number_of_1_to_4_family_units']
zs = reduced_df[reduced_df['action_taken_name']==action]['number_of_owner_occupied_units']
ax.scatter(xs, ys, zs, c=c, s=5)
ax.set_xlabel('Minority Population (%)')
ax.set_ylabel('Number of 1~4 Family Units')
ax.set_zlabel('Number of Owner Occupied Units')
plt.show()
Unit related features have the same problem that it becomes unrecognizable as minority population rate is lower. This may or may not be solved by categorical features.
%matplotlib inline
def cate_pie(col):
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
actions = list(hmda_df['action_taken_name'].unique())
indice = np.arange(len(actions))
for action, i in zip(actions, indice):
category = hmda_df[hmda_df['action_taken_name']==action][col]
vc = category.value_counts(sort=False)
axes[i].pie(vc.values, labels=vc.keys(), autopct='%1.1f%%')
axes[i].set_title(action, fontsize=14)
axes[i].axis('equal')
- Applicant Race
cate_pie('applicant_race_name_1')
The pie chart reveals very clearly that white applicants have more chances to earn the approval.
- Applicant Sex
cate_pie("applicant_sex_name")
Sex is not as strong, but still there is discrepany between men and women.
- Co-applicant
cate_pie("co_applicant")
Chances are high that a sole applicant is a single, reversely, applicant with somebody is married. A married applicant would probably be financially more established than a single who is probably younger and early in his/her career, which could explain the two different charts.
- Agency
cate_pie("agency_abbr")
- Loan Purpose
cate_pie('loan_purpose_name')
It seems like mortgage loans are more open to home purchase purpose. Put it another way, loan officers are hard to find reasons to deny home purchase applications or encouraged to give them approval for their benefits. Clearly, home improvement is not always necessary for a living.
- Loan Type
cate_pie('loan_type_name')
This categorical feature doesn't look like it will help classification from the pies. However, boxplot might uncover some differences.
The purpose of boxplot for this project is to see if categorical data can work for classification by interacting with numerical data. Therefore, a categorical value measured differently by numerical data is likely helpful to distinguish classes. We want to go through some of the categorical data that showed certain patterns.
""" draw boxplot by category and loan results """
def category_boxplot(cate, title):
sns.set(style="whitegrid")
plt.figure(figsize=(8, 4))
# income
ax = sns.boxplot(x="income_sqrt", y="action_taken_name", hue=cate,
data=hmda_df)
ax.set_title("Income by " + title, fontsize=15)
ax.set_ylabel("")
ax.legend(bbox_to_anchor=(1.05, 1), loc=2)
plt.show()
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="applicant_sex_name",
data=hmda_df, ax=axes[0])
axes[0].set_title("Loan Amount by Sex", fontsize=15)
axes[0].set_xlabel("")
axes[0].legend(bbox_to_anchor=(0.5, -0.4), loc=8)
sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="agency_abbr",
data=hmda_df, ax=axes[1])
axes[1].set_title("Loan Amount by Agency", fontsize=15)
axes[1].set_xlabel("")
axes[1].legend(bbox_to_anchor=(0.5, -0.4), loc=8)
sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="loan_type_name",
data=hmda_df, ax=axes[2])
axes[2].set_title("Loan Amount by Loan Type", fontsize=15)
axes[2].set_xlabel("")
axes[2].legend(bbox_to_anchor=(0.5, -0.35), loc=8)
None of them are impressively separating classes. But they are still plotting two classes more or less differently.
Now that we have found out some patterns, we can base them for clustering census tract numbers. To visualize about a hundred of them, we can draw yet another scatterplots using numerical features. By doing so, we can decide which clustering methods to use. Since often there is no clear standard when it comes to clustering, it is critical to plot data and visually judge what method is most suitable. But we can always go with trial and error.
# criteria 1 based on loan approval rate
census_minor = hmda_df.groupby(['census_tract_number'])['minority_population'].mean()
census_ownership = hmda_df.groupby(['census_tract_number'])['number_of_owner_occupied_units'].mean()
census_df_1 = pd.DataFrame({'minority': census_minor, 'ownership': census_ownership})
census_df_1.reset_index(inplace=True)
# criteria 2 based on income vs. loan amount
census_income = hmda_df.groupby(['census_tract_number'])['income_sqrt'].mean()
census_loan_amount = hmda_df.groupby(['census_tract_number'])['loan_amount_sqrt'].mean()
census_df_2 = pd.DataFrame({'income': census_income, 'loan_amount': census_loan_amount})
census_df_2.reset_index(inplace=True)
# merge criteria 1 and 2
census_df = pd.merge(census_df_1, census_df_2)
# combination of numerical data
fig, axes = plt.subplots(2, 2, figsize=(12, 11))
sns.scatterplot(x="minority", y="ownership", data=census_df, ax=axes[0,0])
axes[0,0].set_title("Minority vs. Ownership", fontsize=15)
sns.scatterplot(x="minority", y="loan_amount", data=census_df, ax=axes[0,1])
axes[0,1].set_title("Minority vs. Loan Amount", fontsize=15)
sns.scatterplot(x="income", y="loan_amount", data=census_df, ax=axes[1,0])
axes[1,0].set_title("Income vs. Loan Amount", fontsize=15)
sns.scatterplot(x="ownership", y="loan_amount", data=census_df, ax=axes[1,1])
axes[1,1].set_title("Ownership vs. Loan Amount", fontsize=15)
plt.tight_layout()
The utmost advantage of hierarchical clustering is that it gives a general idea of how many groups in the data. Especially, dendrogram is very useful to see it visually. First, we can use the approval rate plot (first from the previous plots) for this method as we focus on distance.
def hier_clustering_labels(X, n):
clustering = AgglomerativeClustering(
n_clusters=n, affinity='euclidean', linkage='ward')
clustering.fit_predict(X)
return clustering.labels_
X = [[x[0], x[1]] for x in zip(census_df['minority'], census_df['loan_amount'])]
X = np.array(X)
sns.set(style='white')
plt.figure(figsize=(12,6))
plt.title("Census Tract Dendrogram")
dend = shc.dendrogram(shc.linkage(X, method='ward'))
It is hard to which census tract number is related to another from the dendrogram, but we can still figure out the number to group them depending on where to cut off. The bset way is plotting them and iterate some numbers for clustering to decide the number of clusters.
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()
axes[0,0].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 2), cmap='rainbow')
axes[0,0].set_title("Cluster n=2", fontsize=15)
axes[0,1].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 3), cmap='rainbow')
axes[0,1].set_title("Cluster n=3", fontsize=15)
axes[0,2].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 4), cmap='rainbow')
axes[0,2].set_title("Cluster n=4", fontsize=15)
axes[1,0].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 5), cmap='rainbow')
axes[1,0].set_title("Cluster n=5", fontsize=15)
axes[1,1].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 6), cmap='rainbow')
axes[1,1].set_title("Cluster n=6", fontsize=15)
axes[1,2].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 7), cmap='rainbow')
axes[1,2].set_title("Cluster n=7", fontsize=15)
plt.tight_layout()
Unfortunately, the clusters are cut vertically, which means that only x-axis may be enough to group them. So, the clustering might be meaningless because an existing feature already can do the job. However, clustering with n=3 seems to make some sense. We can add this feature and always remove it later when modeling.
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], hier_clustering_labels(X, 3)))
hmda_df['census_hier'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])
def kmeans_labels(X, n_clusters):
kmeans = KMeans(n_clusters=n_clusters).fit(X)
return kmeans.labels_
X = [[x[0], x[1]] for x in zip(census_df['income'], census_df['loan_amount'])]
X = np.array(X)
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()
axes[0,0].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 2), cmap='rainbow')
axes[0,0].set_title("Cluster n=2", fontsize=15)
axes[0,1].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 3), cmap='rainbow')
axes[0,1].set_title("Cluster n=3", fontsize=15)
axes[0,2].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 4), cmap='rainbow')
axes[0,2].set_title("Cluster n=4", fontsize=15)
axes[1,0].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 5), cmap='rainbow')
axes[1,0].set_title("Cluster n=5", fontsize=15)
axes[1,1].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 6), cmap='rainbow')
axes[1,1].set_title("Cluster n=6", fontsize=15)
axes[1,2].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 7), cmap='rainbow')
axes[1,2].set_title("Cluster n=7", fontsize=15)
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], kmeans_labels(X, 5)))
hmda_df['census_kmeans'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])
This could be useful because the clusters are naturally separated by income level and loan amount. It could strengthen the pattern we have already found by increasing the sparsity.
DBSCAN takes a different course than the hierarchical clustering does as previously shown. This method is based on density, which means how closely the dots are located from each other. It needs scailing for better processing beforehand, but the rest are very similar steps as done for hierarchical clustering. That is, plotting and iteration are heavily used to find the best number of clusters. Specifically speaking, finding the best epilon and minimum number of samples is our goal.
def db_scan_labels(X, eps, min_samples):
db_scan = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
return db_scan.labels_
- Income vs. Loan Amount: Noise Removal
X = [[x[0], x[1]] for x in zip(census_df['minority'], census_df['ownership'])]
X = np.array(X)
X = ppr.StandardScaler().fit_transform(X)
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()
axes[0,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 2), cmap='rainbow')
axes[0,0].set_title("eps=0.15, min_sample=2", fontsize=15)
axes[0,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 3), cmap='rainbow')
axes[0,1].set_title("eps=0.15, min_sample=3", fontsize=15)
axes[0,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 4), cmap='rainbow')
axes[0,2].set_title("eps=0.2, min_sample=4", fontsize=15)
axes[1,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.25, 4), cmap='rainbow')
axes[1,0].set_title("eps=0.25, min_sample=5", fontsize=15)
axes[1,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 4), cmap='rainbow')
axes[1,1].set_title("eps=0.3, min_sample=4", fontsize=15)
axes[1,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 5), cmap='rainbow')
axes[1,2].set_title("eps=0.3, min_sample=5", fontsize=15)
Aware that purple dots are noises, the result of epsilon equal to 0.15 and minimum samples 3 should probably be the best clustering as the others have a couple or more flaws. However, there are so many unneccesary small clusters which we can get rid of once clustering process is done.
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], db_scan_labels(X, 0.15, 3)))
hmda_df['census_dbscan_1'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])
# 2 clusters and all noises
hmda_df['census_dbscan_1'] = hmda_df['census_dbscan_1'].apply(lambda x : -1
if x not in [0, 10] else x)
- Ownership vs. Loan Amount: Grouping
X = [[x[0], x[1]] for x in zip(census_df['ownership'], census_df['loan_amount'])]
X = np.array(X)
X = ppr.StandardScaler().fit_transform(X)
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()
axes[0,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 5), cmap='rainbow')
axes[0,0].set_title("eps=0.15, min_sample=4", fontsize=15)
axes[0,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 6), cmap='rainbow')
axes[0,1].set_title("eps=0.2, min_sample=3", fontsize=15)
axes[0,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 7), cmap='rainbow')
axes[0,2].set_title("eps=0.2, min_sample=7", fontsize=15)
axes[1,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.25, 6), cmap='rainbow')
axes[1,0].set_title("eps=0.25, min_sample=6", fontsize=15)
axes[1,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 6), cmap='rainbow')
axes[1,1].set_title("eps=0.3, min_sample=6", fontsize=15)
axes[1,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.35, 5), cmap='rainbow')
axes[1,2].set_title("eps=0.35, min_sample=5", fontsize=15)
It should be reasonable to have only one large cluster and tag all the rest as noise.
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], db_scan_labels(X, 0.2, 7)))
hmda_df['census_dbscan_2'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])
# 1 cluster and all noises
hmda_df['census_dbscan_2'] = hmda_df['census_dbscan_2'].apply(lambda x : x if x == 0 else -1)
# drop census tract number
hmda_df.drop(['census_tract_number'], axis=1, inplace=True)
Decision Tree is simple, thus, it is often used as a baseline model. We will build a Decision Tree model with all features included, and then test other models like Logistics, Random Forest (bagging), Naive Bayes (independence), Gradient Boosting (boosting). Also, we will manipulate a feature set using Recursive Feature Elimination and Pinciple Component Analysis.
# reduce the most class
second_large_supp = len(hmda_df[hmda_df['action_taken_name']=="Denied"])
size_drop = len(hmda_df[hmda_df['action_taken_name']=="Approved"]) - second_large_supp
drop_indice = np.random.choice(list(hmda_df[hmda_df['action_taken_name']=="Approved"].index),
size_drop, replace=False)
hmda_df = hmda_df.loc[~hmda_df.index.isin(drop_indice)]
def split_X_y(df, features, y):
cate_cols = df.select_dtypes(include='category').columns
for col in cate_cols:
le = ppr.LabelEncoder()
le.fit(df[col])
df[col] = le.transform(df[col])
X = np.array(df[features])
y = np.array(df[y])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33)
# normalize for faster calculation
X_train = ppr.normalize(X_train)
X_test = ppr.normalize(X_test)
return X_train, X_test, y_train, y_test
# all features
features = list(hmda_df.columns)
features.remove('action_taken_name')
# X, y
X_train, X_test, y_train, y_test = split_X_y(hmda_df, features, 'action_taken_name')
# train and show scores
def train_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("5-fold Cross Validation Accuracy:", round(cross_val_score(clf, X_train, y_train, cv=5).mean(), 2))
print("\nConfusion Matrix\n", confusion_matrix(y_test, pred))
print("\n", classification_report(y_test, pred))
return clf
clf_dt = train_evaluate(tree.DecisionTreeClassifier(), X_train, X_test, y_train, y_test)
The result is not so impressive, however, the errors are uniformly distributed at least.
clf_lr = train_evaluate(LogisticRegression(), X_train, X_test, y_train, y_test)
Unlike the Decision Tree result, the errors are concentrated on Approval cases.
clf_rf = train_evaluate(RandomForestClassifier(max_depth=20), X_train, X_test, y_train, y_test)
The result is better than the Decision Tree, but it is still so good.
clf_nb = train_evaluate(GaussianNB(), X_train, X_test, y_train, y_test)
Like the Logistic Regression, it is biased to predict Denied only.
Gradient Boosting is a very sophisticated model to configurate parameters for the best performance. The following will seek the optimal parameters based on AUC scores.
def roc_auc(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
train_pred = clf.predict(X_train)
fpr, tpr, thrd = roc_curve(y_train, train_pred)
roc_auc_train = auc(fpr, tpr)
test_pred = clf.predict(X_test)
fpr, tpr, thrd = roc_curve(y_test, test_pred)
roc_auc_test = auc(fpr, tpr)
return roc_auc_train, roc_auc_test
# learning rate
learning_rates = [1, 0.5, 0.25, 0.1, 0.05, 0.01]
lr_train = []
lr_test = []
for lr in learning_rates:
clf_gb = GradientBoostingClassifier(learning_rate=lr)
roc_auc_train, roc_auc_test = roc_auc(
clf_gb, X_train, X_test, y_train, y_test)
lr_train.append(roc_auc_train)
lr_test.append(roc_auc_test)
# number of estimators
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 150, 200]
est_train = []
est_test = []
for n in n_estimators:
clf_gb = GradientBoostingClassifier(n_estimators=n)
roc_auc_train, roc_auc_test = roc_auc(
clf_gb, X_train, X_test, y_train, y_test)
est_train.append(roc_auc_train)
est_test.append(roc_auc_test)
# max depth
max_depths = np.arange(1, 16)
md_train = []
md_test = []
for d in max_depths:
clf_gb = GradientBoostingClassifier(max_depth=d)
roc_auc_train, roc_auc_test = roc_auc(
clf_gb, X_train, X_test, y_train, y_test)
md_train.append(roc_auc_train)
md_test.append(roc_auc_test)
# max_features
max_features = list(range(1, len(features)))
mf_train = []
mf_test = []
for f in max_features:
clf_gb = GradientBoostingClassifier(max_features=f)
roc_auc_train, roc_auc_test = roc_auc(
clf_gb, X_train, X_test, y_train, y_test)
mf_train.append(roc_auc_train)
mf_test.append(roc_auc_test)
fig, axes = plt.subplots(2, 2, figsize=(9, 9))
axes[0,0].plot(learning_rates, lr_train, 'b')
axes[0,0].plot(learning_rates, lr_test, 'r')
axes[0,0].set_title("Learning Rate")
axes[0,1].plot(n_estimators, est_train, 'b')
axes[0,1].plot(n_estimators, est_test, 'r')
axes[0,1].set_title("N Estimators")
axes[1,0].plot(max_depths, md_train, 'b')
axes[1,0].plot(max_depths, md_test, 'r')
axes[1,0].set_title("Max Depth")
axes[1,1].plot(max_features, mf_train, 'b')
axes[1,1].plot(max_features, mf_test, 'r')
axes[1,1].set_title("Max Features")
clf_gb = train_evaluate(GradientBoostingClassifier(
learning_rate=0.2,
n_estimators=150,
max_depth=7,
max_features=5
), X_train, X_test, y_train, y_test)
Similar with Gradient Boosting, Neural Network need a certain set of parameters for tuning. This can be done using the Grid Search method.
params = {
'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
'alpha': [0.0001, 0.001, 0.01],
'learning_rate': ['constant','adaptive'],
}
clf = GridSearchCV(MLPClassifier(), params, n_jobs=-1, cv=3)
clf.fit(X_train, y_train)
print('Best parameters found:', clf.best_params_)
clf_mlp = train_evaluate(MLPClassifier(
alpha=0.001,
activation='relu',
solver='adam',
hidden_layer_sizes=(50,100,50),
learning_rate='constant'),
X_train, X_test, y_train, y_test)
model = LogisticRegression()
rfe = RFE(model, 9).fit(X_train, y_train)
feature_rank = []
features_rfe = []
for feature, rank in zip(features, rfe.ranking_):
feature_rank.append((feature, rank))
if rank == 1:
features_rfe.append(feature)
X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe = split_X_y(
hmda_df, features_rfe, 'action_taken_name')
sorted(feature_rank, key=lambda tup : tup[1])
clf_lr_rfe = train_evaluate(LogisticRegression(), X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe)
Selecting 9 of half the features, the f1-score of Logistic Regression is improved from 0.59. This indicates that the features had too much interaction to interfere with our target classes. Now, we can take this into consideration to select features in a heuristic way. The rule is to include features that care less correlated and use the Naive Bayes model which assumes independence among features.
features_comb = ['agency_abbr', 'applicant_sex_name', 'minority_population', 'income_to_loan',
'loan_purpose_name', 'loan_type_name', 'co_applicant',
'census_kmeans', 'census_dbscan_1', 'census_hier',
'number_of_1_to_4_family_units', 'number_of_owner_occupied_units']
X_train_comb, X_test_comb, y_train_comb, y_test_comb = split_X_y(
hmda_df, features_comb, 'action_taken_name')
clf_nb_comb = train_evaluate(GaussianNB(), X_train_comb, X_test_comb, y_train_comb, y_test_comb)
PCA takes a totally differnt look at features in that RFE and other feature selection methods simply include/exclude features. For that reason, we can hardly say which feature makes how much contributions.
reduced_df = df_reducer(hmda_df, 10000)
X_train_pca, X_test_pca, y_train_pca, y_test_pca = split_X_y(
reduced_df, features_rfe, 'action_taken_name')
pca = PCA(n_components=5)
pca.fit(X_train_pca)
X_train_pca = pca.transform(X_train_pca)
X_test_pca = pca.transform(X_test_pca)
clf_pca = SVC(C=5, gamma=10)
clf_pca.fit(X_train_pca, y_train_pca)
print(pca.explained_variance_ratio_)
train_evaluate(clf_pca, X_train_pca, X_test_pca, y_train_pca, y_test_pca)
So far, we have found out that Gradient Boosting works the best followed by Random Forest. What we can do with these is mixing them together to make them vote. Also, we can add Naive Bayes because the model is fundamentally different. Voting Ensemble is useful when a certain level of diversity is secured.
estimators = []
estimators.append(('nb', clf_nb))
estimators.append(('rf', clf_rf))
estimators.append(('gb', clf_gb))
ensemble = VotingClassifier(estimators)
clf_ens = train_evaluate(ensemble, X_train, X_test, y_train, y_test)
Through the data analysis, we learnt the following patterns.
However, these patterns are not so strong that it was difficult to clearly cut the groups. There were such interactions among features that classifying observations in one way would be no longer the same if in another way. Figuratively speaking, observations were tangled even in a finely separated group. For example, two cases having almost or exact the same condition would be classified differently; Approved and Denied, repectively.
To solve this, we used sophisticated models like Gradient Boosting and MLP. Also, we tried a different feature set as well as PCA. However, the entanglement was at a serious level making the best model performance around 0.7 accuracy and f1-score.