HMDA Data Analysis and Prediction Model

By Kyoosik Kim

Home Mortgage Disclosure Act (HMDA) is a federal law in the US that requires financial institutions to provide mortgage data to the public. The data contains basic information regarding mortgage loans such as applicant's income and race, loan amount, regions, and many others. What we focus is the relationship between these factors and the result of evaluation. There are 6 different available results, among which approval and denial are the major counts. Through the analysis, we will find patterns in numerical and categorical features, and eventually apply them to building a predictive model.

  1. Setup and Query Dataset
  2. Preparation
  3. Data Cleaning: Category Integration, Type Conversion
  4. Data Analysis: Scatterplot, Piechart, Boxplot
  5. Clustering: Hierarchical, K-means, Density-based
  6. Modeling: Decision Tree, Naive Bayes, Random Forest, Gradient Boosting, RFE, PCA, Voting Ensemble
  7. Conclusion

1. Setup and Query Dataset

The data is pulled directly from the CFPB API. Although downloading in CSV is also available on Custom Dataset page at the same website, API has a benefit of flexibility like querying data. The data set is HMDA data of Detroit Metropolitan Area where social issues often arise such as racial discrimination or high crime rate. The data format is JSON to be able to fit MongoDB so that we can query the data fast. The purpose of querying is to briefly learn about the features as a preliminary process before detailed analysis. The following are libraries to be used for the entire project.

In [1]:
# file control, api, mongodb
import json
import os
import urllib3
from pymongo import MongoClient

# dataframe
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

# clustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans, DBSCAN
import scipy.cluster.hierarchy as shc

# model preparation
from sklearn import preprocessing as ppr
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

# modeling
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE

# model evaluation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import *

# others
import scipy.stats as stats
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')
In [2]:
""" 
year, state, county can be given as condition
pull all the features in JSON
save the result as a file for later use
"""
def pull_dataset(file_name, year, msamd, code):
    http = urllib3.PoolManager()
    # request if file doesn't exist
    if not os.path.isfile(file_name):
        cfpb_url = ('https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.json?&'
            '$where=as_of_year+IN+('+year+')+AND+action_taken+IN+(1,3)+AND+(msamd=%'+msamd+'%'+code+')&'
            '$select=tract_to_msamd_income,'
            'population,'
            'minority_population,'
            'number_of_owner_occupied_units,'
            'number_of_1_to_4_family_units,'
            'loan_amount_000s,'
            'applicant_income_000s,'
            'loan_type_name,'
            'loan_purpose_name,'
            'co_applicant_race_name_1,'
            'census_tract_number,'
            'applicant_sex_name,'
            'applicant_race_name_1,'
            'agency_abbr,'
            'action_taken_name&$limit=0')

        # receive data and store in JSON
        response = http.request('GET', cfpb_url)
        json_data = json.loads(response.data.decode('utf-8'))
        results = json_data['results']
        # write file
        with open(file_name, 'w', encoding='utf-8') as outfile:
            json_string = json.dumps(results)
            outfile.write(json_string)
        outfile.close()

    # read file if file already exists
    with open(file_name, 'rb') as infile:
        data = json.loads(infile.read())
    infile.close()

    return data
In [3]:
# recent 3 years to local DB
detroit = pull_dataset("detroit.json", "2017,2016,2015", "2219804", "22")

# do not write if db already exists
client = MongoClient('localhost:27017')
if "hmda" not in client.list_database_names():
    db = client.hmda
    db['detroit'].insert_many(detroit)
In [4]:
db = client.hmda
query_count = db.detroit.count()
query_find_one = db.detroit.find_one()

print("Number of Cases:", query_count)
print("\n<Case Example>")
pprint(query_find_one)
Number of Cases: 119794

<Case Example>
{'_id': ObjectId('5c38f906d9973504dc7db2e4'),
 'action_taken_name': 'Loan originated',
 'agency_abbr': 'HUD',
 'applicant_income_000s': 43,
 'applicant_race_name_1': 'White',
 'applicant_sex_name': 'Female',
 'census_tract_number': '5833.00',
 'co_applicant_race_name_1': 'No co-applicant',
 'loan_amount_000s': 66,
 'loan_purpose_name': 'Refinancing',
 'loan_type_name': 'FHA-insured',
 'minority_population': 11.329999923706055,
 'number_of_1_to_4_family_units': 1031,
 'number_of_owner_occupied_units': 862,
 'population': 2550,
 'tract_to_msamd_income': 96.79000091552734}
In [5]:
# screen meaningless features
def check_disticnt(keys):
    one_value_keys = []
    for key in keys:
        query_distinct = db.detroit.distinct(key)
        if len(query_distinct) == 1:
            one_value_keys.append(key)

    return one_value_keys

keys = query_find_one
print("Number of Keys:", len(keys))
print("\nKeys to be dropped for having the same value:")
pprint(check_disticnt(keys))
Number of Keys: 16

Keys to be dropped for having the same value:
[]
In [6]:
def convert_to_list(query_result, roundup_keys):
    doc_list = []
    for doc in query_result:
        for key in roundup_keys:
            try:
                doc[key] = round(doc[key], 2)
            except:
                doc[key] = "NaN"
        doc_list.append(doc)

    return doc_list
In [7]:
query_action_taken = db.detroit.aggregate([
    {"$group":
     {
        "_id": "$action_taken_name",
        "count": {"$sum": 1},
        "amount": {"$avg": "$loan_amount_000s"},
        "income": {"$avg": "$applicant_income_000s"}
     }
    },
    {"$sort": {"count": -1}}
])

pprint(convert_to_list(query_action_taken, ['amount', 'income']))
[{'_id': 'Loan originated', 'amount': 159.8, 'count': 88522, 'income': 90.57},
 {'_id': 'Application denied by financial institution',
  'amount': 109.23,
  'count': 31272,
  'income': 70.96}]

As the target is loan evaluation results; approved vs. denied, the query result above help us understand how the classes are distributed. In addition, it is also telling that loan amount and income are different across the loan results. Therefore, these features are suspected to play a major role for analysis.

In [8]:
# query loan accepted
field_val = "Loan originated"
total_count = db.detroit.find({"action_taken_name": field_val}).count()

query_loan_originated = db.detroit.aggregate([
    {"$match": {"action_taken_name": field_val}},
    {"$group": {"_id": "$applicant_race_name_1",
                "count": {"$sum": 1},
                "amount": {"$avg": "$loan_amount_000s"},
                "income": {"$avg": "$applicant_income_000s"}}
    },
    {"$sort": {"count": -1}},
    {"$addFields": {"case_share": {"$divide": ["$count", total_count]}}}
])

acceptance = convert_to_list(query_loan_originated, ['amount', 'income', 'case_share'])
In [9]:
# query loan denied
field_val = "Application denied by financial institution"
total_count = db.detroit.find({"action_taken_name": field_val}).count()

query_loan_originated = db.detroit.aggregate([
    {"$match": {"action_taken_name": field_val}},
    {"$group": {"_id": "$applicant_race_name_1",
                "count": {"$sum": 1},
                "amount": {"$avg": "$loan_amount_000s"},
                "income": {"$avg": "$applicant_income_000s"}}
    },
    {"$sort": {"count": -1}},
    {"$addFields": {"case_share": {"$divide": ["$count", total_count]}}}
])

denial = convert_to_list(query_loan_originated, ['amount', 'income', 'case_share'])
In [10]:
pd.concat({'Accepted': pd.DataFrame(acceptance), 'Denied': pd.DataFrame(denial)})
Out[10]:
_id amount case_share count income
Accepted 0 White 153.18 0.72 64065 92.32
1 Information not provided by applicant in mail,... 142.64 0.12 10753 89.92
2 Black or African American 115.17 0.10 8973 69.56
3 Asian 230.07 0.04 3497 112.08
4 Not applicable 1179.68 0.01 758 119.50
5 American Indian or Alaska Native 122.13 0.00 327 78.69
6 Native Hawaiian or Other Pacific Islander 178.23 0.00 149 91.58
Denied 0 White 125.41 0.51 15893 76.92
1 Black or African American 73.23 0.24 7607 57.51
2 Information not provided by applicant in mail,... 91.27 0.21 6417 70.92
3 Asian 168.86 0.03 877 84.43
4 American Indian or Alaska Native 90.10 0.01 296 56.45
5 Not applicable 925.04 0.00 105 43.75
6 Native Hawaiian or Other Pacific Islander 102.83 0.00 77 78.18

Another query result of racial factors seem to be useful. It is quite clear that there are differences among different races, thus, race-related features should be significant for analysis. Along with these, some other features like loan type, agency, loan purpose will be put together with aforementioned features to find patterns.


2. Preparation

The next step becomes turning JSON data on MongoDB into a CSV file so we can make use of analytical tools on Python. Once this step is done, we should handle missing values first before moving forward.

In [11]:
""" Read from Mongo and Store into DataFrame """
def read_mongo(collection):
    # Connect to MongoDB
    client = MongoClient('localhost:27017')
    db = client.hmda
    # Make a query to the specific DB and Collection
    cursor = db[collection].find({})
    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))
    df.drop(['_id'], axis=1, inplace=True)

    return df
In [12]:
hmda_df = read_mongo("detroit")
hmda_df.shape
Out[12]:
(119794, 15)
In [13]:
hmda_df.head()
Out[13]:
action_taken_name agency_abbr applicant_income_000s applicant_race_name_1 applicant_sex_name census_tract_number co_applicant_race_name_1 loan_amount_000s loan_purpose_name loan_type_name minority_population number_of_1_to_4_family_units number_of_owner_occupied_units population tract_to_msamd_income
0 Loan originated HUD 43.0 White Female 5833.00 No co-applicant 66 Refinancing FHA-insured 11.330000 1031.0 862.0 2550 96.790001
1 Loan originated FDIC 32.0 White Female 5837.00 No co-applicant 100 Home purchase Conventional 14.890000 1417.0 1216.0 3721 99.059998
2 Loan originated HUD 147.0 White Male 5970.00 No co-applicant 250 Refinancing Conventional 7.350000 1995.0 1540.0 4656 132.679993
3 Loan originated CFPB 57.0 White Male 5761.00 White 71 Home purchase VA-guaranteed 16.049999 2463.0 2141.0 6082 111.169998
4 Loan originated NCUA 104.0 White Female 5645.01 White 219 Refinancing Conventional 26.900000 1186.0 1116.0 4246 244.990005
In [14]:
hmda_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119794 entries, 0 to 119793
Data columns (total 15 columns):
action_taken_name                 119794 non-null object
agency_abbr                       119794 non-null object
applicant_income_000s             113106 non-null float64
applicant_race_name_1             119794 non-null object
applicant_sex_name                119794 non-null object
census_tract_number               119794 non-null object
co_applicant_race_name_1          119794 non-null object
loan_amount_000s                  119794 non-null int64
loan_purpose_name                 119794 non-null object
loan_type_name                    119794 non-null object
minority_population               119787 non-null float64
number_of_1_to_4_family_units     119787 non-null float64
number_of_owner_occupied_units    119784 non-null float64
population                        119794 non-null int64
tract_to_msamd_income             119787 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 13.7+ MB

Fortunately, most of the features are complete or missing a very few. For those features with missing values, the following can be considered for handling.

  1. Drop the cases
  2. Fill with statistics such as mean/median (numeric), or mode (categorical)
  3. Estimate from other features

While there could be other methods, the above are largely suitable for the situation. To decide which, we need to see if the missing values are systematic or random. The latter is more flexible for both imputation methods while the former case needs further investigation. Since there does not seem to have a pattern in missing values, we can simply drop them when filling with statistics could make it more complicated.

In [15]:
hmda_df.dropna(inplace=True)
hmda_df.shape
Out[15]:
(113101, 15)

3. Data Cleaning: Categorical Data

The last step before analysis is cleaning the data. Since we have many categorical features, it is necessary to look at them to see if any of them is problematic. This is a safer way because some analysis would later need to use categorical features together with numerical features like boxplot as an instance. It would be a good idea to organize categories first before doing such a thing.

In [16]:
# handle a trivial error
def fix_error_message(df, cols):
    error_message = "Information not provided by applicant in mail, Internet, or telephone application"
    for col in cols:
        df[col] = df[col].apply(lambda x :
                    "Not available" if x==error_message else x)
    return df
In [17]:
# fix error values
cols = list(hmda_df.select_dtypes('object').columns)
hmda_df = fix_error_message(hmda_df, cols)

# long name to shorter version
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="Native Hawaiian or Other Pacific Islander",
                        value="Islander")
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="American Indian or Alaska Native",
                        value="Native American")
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="Black or African American",
                        value="Black")
In [18]:
# horizontal barplot
def value_counts_bar(col, title):
    vc = col.value_counts()
    plt.figure(figsize=(6, len(vc.keys())/2))
    plt.barh(vc.keys(), vc.values)
    plt.title(title, fontsize=15)
    plt.xticks(rotation='vertical', fontsize=10)
    plt.show()


• Loan Result

In [19]:
value_counts_bar(hmda_df.action_taken_name, "Action Taken")
In [20]:
# change class value names
hmda_df['action_taken_name'] = hmda_df['action_taken_name'].apply(lambda x : "Approved"
                                    if x == "Loan originated" else "Denied")


• Agency

In [21]:
value_counts_bar(hmda_df.agency_abbr, "Agency")
In [22]:
hmda_df['agency_abbr'] = hmda_df['agency_abbr'].apply(lambda x : "Others"
                                if x not in ['CFPB', 'HUD', 'NCUA'] else x)

While the top 3 values (HUD/CFPB/NCUA) account for the most, the rest are only a fraction. Since having too many values for such a small count could lead to overfitting, they can be merged together as Others.


• Applicant Race

In [23]:
value_counts_bar(hmda_df.applicant_race_name_1, "Applicant Race")
In [24]:
# White and Black are major two
hmda_df['applicant_race_name_1'] = hmda_df['applicant_race_name_1'].apply(lambda x : 'Others'
                                        if x not in ['White', 'Black'] else x)


• Applicant Sex

In [25]:
value_counts_bar(hmda_df.applicant_sex_name, "Applicant Sex")
In [26]:
hmda_df['applicant_sex_name'] = hmda_df['applicant_sex_name'].apply(lambda x : "Others"
                                        if x not in ['Male', 'Female'] else x)


• Co-applicant Race

In [27]:
value_counts_bar(hmda_df.co_applicant_race_name_1, "Co-applicant Race")
In [28]:
# with or without co-applicant
hmda_df['co_applicant_race_name_1'] = hmda_df['co_applicant_race_name_1'].apply(lambda x : "With co-applicant"
                                            if x != 'No co-applicant' else x)
hmda_df.rename(columns={'co_applicant_race_name_1': 'co_applicant'}, inplace=True)


• Loan Purpose

In [29]:
value_counts_bar(hmda_df.loan_purpose_name, "Loan Purpose")


• Loan Type

In [30]:
value_counts_bar(hmda_df.loan_type_name, "Loan Type")
In [31]:
hmda_df['loan_type_name'] = hmda_df['loan_type_name'].apply(lambda x : "Non-conventional"
                                    if x != "Conventional" else x)


• Census Tract Number

In [32]:
hmda_df = hmda_df[hmda_df['census_tract_number'] != ""]
print("Number of Census Tract:", len(hmda_df['census_tract_number'].unique()))
Number of Census Tract: 598

Census tract numbers are not numerical data as their zipcode-like numbers do not mean anything. Detroit Census Information can help to understand the basic statistics with a visual support. Census tract numbers are basically deemed as categorical data, but there are over a hundred different values. In fact, we might still use it as categorical data but this could cause overfitting issues, missing general trends. The best way to deal with this is clustering. This technique, though, requires us to know about the dataset first as we have yet to know what based we should cluster them on.


• Type Conversion and Summary

In [33]:
# object to category
object_cols = list(hmda_df.select_dtypes('object').columns)
object_cols.remove('census_tract_number')
hmda_df[object_cols] = hmda_df[object_cols].astype('category')
In [34]:
hmda_df.describe(include=['category']).drop('count').T
Out[34]:
unique top freq
action_taken_name 2 Approved 83400
agency_abbr 4 HUD 58077
applicant_race_name_1 3 White 76333
applicant_sex_name 3 Male 67355
co_applicant 2 No co-applicant 77015
loan_purpose_name 3 Home purchase 52227
loan_type_name 2 Conventional 84550

3. Data Analysis

As plotting data always comes in handy to feel the data, scatterplot and boxplot will be used as data visualization. Before them, we need to know the distribution for numerical data. If they are overly skewed or they have strange patterns, they should be handled first.

In [35]:
hmda_df.describe().drop('count').T
Out[35]:
mean std min 25% 50% 75% max
applicant_income_000s 85.417123 98.959746 1.00 43.000000 66.000000 102.000000 8998.000000
loan_amount_000s 139.464921 112.363212 1.00 69.000000 118.000000 184.000000 3934.000000
minority_population 27.860051 26.912933 1.74 9.760000 17.049999 33.970001 100.000000
number_of_1_to_4_family_units 1413.824246 475.310410 30.00 1088.000000 1388.000000 1686.000000 2638.000000
number_of_owner_occupied_units 1152.064173 464.993325 6.00 823.000000 1115.000000 1450.000000 2551.000000
population 3957.019911 1383.486978 436.00 2929.000000 3794.000000 4781.000000 7968.000000
tract_to_msamd_income 143.221244 58.357710 0.00 106.239998 132.679993 176.369995 351.420013

The statistics are already showing some problems with dollar-related features that the max is too much compared with 75 quantile. This means the observations are highly skewed. The details can be revealed by plotting them.


• Histogram

In [36]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

axes[0].hist(hmda_df.applicant_income_000s, bins=1000)
axes[0].axvline(hmda_df.applicant_income_000s.mean(), color='red', linestyle='--')
axes[0].set_xlim([0, 300])
axes[0].set_title("Income", fontsize=15)
axes[0].set_xlabel("Thousand $")

axes[1].hist(hmda_df.loan_amount_000s, bins=300)
axes[1].axvline(hmda_df.loan_amount_000s.mean(), color='red', linestyle='--')
axes[1].set_xlim([0, 500])
axes[1].set_title("Loan Amount", fontsize=15)
axes[1].set_xlabel("Thousand $")

axes[2].hist(hmda_df.tract_to_msamd_income, bins=40)
axes[2].axvline(hmda_df.tract_to_msamd_income.mean(), color='red', linestyle='--')
axes[2].set_xlim([0, 300])
axes[2].set_title("Income by MSA Tract", fontsize=15)
axes[2].set_xlabel("Thousand $")

plt.tight_layout()
plt.show()

As expected from statistics, Income and Loan Amount are positively skewed, having a lot of small dollar amounts. Before considering transformation, we need to note that there are 0 amount cases which doesn't make sense. Especially, there are so many cases of loan amount equal to zero (possibly just less than a $1,000). We could suspect that these are administrative issues such as applications filed for fixing errors or any kind of paper jobs for any reason. Therefore, we would consider excluding these zero amount cases.

In [37]:
hmda_df = hmda_df[(hmda_df['applicant_income_000s'] > 0) & (hmda_df['applicant_income_000s'] < 250)]
hmda_df = hmda_df[(hmda_df['loan_amount_000s'] > 0) & (hmda_df['loan_amount_000s'] < 400)]
hmda_df = hmda_df[(hmda_df['tract_to_msamd_income'] > 0) & (hmda_df['tract_to_msamd_income'] < 280)]

hmda_df.shape
Out[37]:
(105712, 15)
In [38]:
print("Average Income: $", round(hmda_df['applicant_income_000s'].mean(), 2), "K")
print("Average Loan Amount: $", round(hmda_df['loan_amount_000s'].mean(), 2), "K")
Average Income: $ 73.27 K
Average Loan Amount: $ 123.88 K

Now, we need to decide whether to transform the distributions of the numerical variables. Our goal is classification, which often times requires sparsity among cases throughout features. Although we need more investigation, the skewed distributions suggest the close distance that will make it difficult to classify. Transforming the numerical features will help to spread out the observations across the range.

In [39]:
hmda_df['income_sqrt'] = np.sqrt(hmda_df['applicant_income_000s'])
hmda_df['loan_amount_sqrt'] = np.sqrt(hmda_df['loan_amount_000s'])
In [40]:
fig, axes = plt.subplots(1, 2, figsize=(10,4))

for action in hmda_df['action_taken_name'].unique():
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].income_sqrt,
                 hist=False, kde=True, ax=axes[0])
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].loan_amount_sqrt,
                            hist=False, kde=True, label=action, ax=axes[1])

axes[1].legend(bbox_to_anchor=(1,0.6))
plt.show()
In [41]:
dist_0 = hmda_df[hmda_df['action_taken_name']=='Approved']['income_sqrt']
dist_1 = hmda_df[hmda_df['action_taken_name']=='Denied']['income_sqrt']

stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)
print("Stat/P-value: (", round(stat, 2), ',', p_value, ")")
Stat/P-value: ( 61.47 , 0.0 )

Even if the t-test is telling the two distributions are significantly different. That is, an application of small income and small loan amount is more likely to be denied than that of large income and large amount. there is a large overlap shared by two classes. Our goal then becomes utilizing other features to separate the observations even more clearly. The following distributions are from the rest of numerical features.

In [42]:
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

axes[0].hist(hmda_df.population, bins=40)
axes[0].axvline(hmda_df.population.mean(), color='red', linestyle='--')
axes[0].set_xlim([0, 8000])
axes[0].set_title("Population per Census Tract", fontsize=12)
axes[0].set_xlabel("Number of Residents")

axes[1].hist(hmda_df.minority_population, bins=40)
axes[1].axvline(hmda_df.minority_population.mean(), color='red', linestyle='--')
axes[1].set_xlim([0, 100])
axes[1].set_title("Minority per Census Tract", fontsize=12)
axes[1].set_xlabel("Ratio (%)")

axes[2].hist(hmda_df.number_of_1_to_4_family_units, bins=40)
axes[2].axvline(hmda_df.number_of_1_to_4_family_units.mean(), color='red', linestyle='--')
axes[2].set_xlim([0, 3200])
axes[2].set_title("1~4 Family Units", fontsize=12)
axes[2].set_xlabel("Units")

axes[3].hist(hmda_df.number_of_owner_occupied_units, bins=40)
axes[3].axvline(hmda_df.number_of_owner_occupied_units.mean(), color='red', linestyle='--')
axes[3].set_xlim([0, 2500])
axes[3].set_title("Number of Owner Occupied Units", fontsize=12)
axes[3].set_xlabel("Units")

plt.tight_layout()
plt.show()

They are roughly bell-shaped except for Minority Population (%). The Minority Population feature indicates that the neighborhoods are severely segregated. For the rest, they don't seem to need transformation.

In [43]:
fig, axes = plt.subplots(1, 3, figsize=(15,4.5))

for action in hmda_df['action_taken_name'].unique():
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].population,
                 hist=False, kde=True, ax=axes[0])
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].minority_population,
                            hist=False, kde=True, label=str(action), ax=axes[1])
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].number_of_1_to_4_family_units,
                            hist=False, kde=True, ax=axes[2])

axes[1].legend(bbox_to_anchor=(0.7,-0.2))
plt.show()

From another view, an application from high minority population rate is more likely to be denied. Again, the rest don't quite seem to have differences by class. However, histogram can't tell everything about the differences between classes until many other features possibly reveal underlying properties.

In [44]:
# add more ratio features
hmda_df['tract_income_to_loan'] = hmda_df['tract_to_msamd_income'] / hmda_df['applicant_income_000s']
hmda_df['income_to_loan'] = hmda_df['loan_amount_000s'] / hmda_df['applicant_income_000s']
In [45]:
corr = hmda_df.corr()
corr
Out[45]:
applicant_income_000s loan_amount_000s minority_population number_of_1_to_4_family_units number_of_owner_occupied_units population tract_to_msamd_income income_sqrt loan_amount_sqrt tract_income_to_loan income_to_loan
applicant_income_000s 1.000000 0.517786 -0.154108 0.016629 0.135023 0.109391 0.412028 0.983278 0.476673 -0.241609 -0.124608
loan_amount_000s 0.517786 1.000000 -0.271881 0.032788 0.209857 0.178856 0.565910 0.530721 0.968685 -0.055753 0.131545
minority_population -0.154108 -0.271881 1.000000 -0.152041 -0.404635 -0.189936 -0.463538 -0.176079 -0.329330 -0.024006 -0.043099
number_of_1_to_4_family_units 0.016629 0.032788 -0.152041 1.000000 0.882745 0.854852 0.143533 0.016996 0.028809 0.018445 0.002901
number_of_owner_occupied_units 0.135023 0.209857 -0.404635 0.882745 1.000000 0.820659 0.432146 0.147526 0.220513 0.031800 0.024365
population 0.109391 0.178856 -0.189936 0.854852 0.820659 1.000000 0.219182 0.115367 0.175825 0.003819 0.020837
tract_to_msamd_income 0.412028 0.565910 -0.463538 0.143533 0.432146 0.219182 1.000000 0.426615 0.546941 0.049301 0.049926
income_sqrt 0.983278 0.530721 -0.176079 0.016996 0.147526 0.115367 0.426615 1.000000 0.498302 -0.307051 -0.169688
loan_amount_sqrt 0.476673 0.968685 -0.329330 0.028809 0.220513 0.175825 0.546941 0.498302 1.000000 -0.052419 0.145504
tract_income_to_loan -0.241609 -0.055753 -0.024006 0.018445 0.031800 0.003819 0.049301 -0.307051 -0.052419 1.000000 0.868928
income_to_loan -0.124608 0.131545 -0.043099 0.002901 0.024365 0.020837 0.049926 -0.169688 0.145504 0.868928 1.000000

It is natural that sqrt features are highly correlated with their original features. Other than them, we see high correlations between dollar amount features. Another thing to note is that census-based feature are also meaningfully related. This can be easily seen on heatmap.

In [46]:
plt.figure(figsize=(8,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, square=True, cmap="PiYG", vmin=-0.7, vmax=0.7)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f03e5e0828>

Population is related with Number of 1~4 Family Units as well as Number of Owner Occupied Units, and this is because the number of units are absolute counts. These numbers are supposed to increase as population is bigger. What we can do is turning these into relative numbers by dividing the number of units with population. Also, we need to drop the original features as we transformed them. This is important because we want features to be independent each other for better classification.

In [47]:
# transform to ratio
hmda_df['number_of_1_to_4_family_units'] = hmda_df['number_of_1_to_4_family_units'] / hmda_df['population']
hmda_df['number_of_owner_occupied_units'] = hmda_df['number_of_owner_occupied_units'] / hmda_df['population']

# drop old features
hmda_df.drop(columns=['applicant_income_000s', 'loan_amount_000s', 'tract_to_msamd_income'], inplace=True)

# updated heatmap
corr = hmda_df.corr()
plt.figure(figsize=(8,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, square=True, cmap="PiYG", vmin=-0.7, vmax=0.7)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f03e840fd0>

• Scatterplot

In [48]:
# for better visualization or faster calculation
def df_reducer(df, size):
    indice = np.random.choice(list(df.index), size, replace=False)
    return df.loc[indice]

- Income vs. Loan Amount

In [49]:
reduced_df = df_reducer(hmda_df, 5000)

# income vs. loan amount by sex and race
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
                data=reduced_df, hue="action_taken_name", s=15, ax=axes[0])
axes[0].set_title("Income vs. Loan Amount by Loan Result", fontsize=12)
axes[0].legend(bbox_to_anchor=(0.5, -0.6), loc=8)

sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
                data=reduced_df, hue="applicant_sex_name", s=15, ax=axes[1])
axes[1].set_title("Income vs. Loan Amount by Sex", fontsize=12)
axes[1].legend(bbox_to_anchor=(0.5, -0.6), loc=8)

sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
                data=reduced_df, hue="applicant_race_name_1", s=15, ax=axes[2])
axes[2].set_title("Income vs. Loan Amount by Race", fontsize=12)
axes[2].legend(bbox_to_anchor=(0.5, -0.6), loc=8)

plt.show()

Again, larger income and larger loan amount help applicants get their mortgage loans. There is a pattern that women are less likely to get their loans approved than men, even if the visualization is not balanced for categories. Also, black population has weaker chances to get their mortgage loans. However, all these categories are not decisively strong.

- 3D Plot: Income vs. Loan Amount vs. Minority Population

In [50]:
reduced_df = df_reducer(hmda_df, 500)

# comment out for interactive plotting
# %matplotlib notebook

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')

actions = reduced_df['action_taken_name'].unique()
ranges = [('r', actions[0]), ('b', actions[1])]

for c, action in ranges:
    xs = reduced_df[reduced_df['action_taken_name']==action]['minority_population']
    ys = reduced_df[reduced_df['action_taken_name']==action]['income_sqrt']
    zs = reduced_df[reduced_df['action_taken_name']==action]['loan_amount_sqrt']
    ax.scatter(xs, ys, zs, c=c, s=5)

ax.set_title('Mortgage Approval (red) vs. Denial (blue)\n', fontsize=15)
ax.set_xlabel('Minority Population (%)')
ax.set_ylabel('Income')
ax.set_zlabel('Loan Amount')

plt.show()

As for high minority population rate, it is quite clear to group certain classes. For example, an application of high minority population rate/low income/low loan amount is very likely to be denied. The problem is when the minority population rate is low where dots are mixed together. In other words, Loan Amount and Income lose its sharpness to cut the classes into groups when the racial factor becomes mild.


- Minority Ratio by Loan Result

In [51]:
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')

actions = reduced_df['action_taken_name'].unique()
ranges = [('r', actions[0]), ('b', actions[1])]

for c, action in ranges:
    xs = reduced_df[reduced_df['action_taken_name']==action]['minority_population']
    ys = reduced_df[reduced_df['action_taken_name']==action]['number_of_1_to_4_family_units']
    zs = reduced_df[reduced_df['action_taken_name']==action]['number_of_owner_occupied_units']
    ax.scatter(xs, ys, zs, c=c, s=5)

ax.set_xlabel('Minority Population (%)')
ax.set_ylabel('Number of 1~4 Family Units')
ax.set_zlabel('Number of Owner Occupied Units')

plt.show()

Unit related features have the same problem that it becomes unrecognizable as minority population rate is lower. This may or may not be solved by categorical features.


• Pie Chart

In [52]:
%matplotlib inline

def cate_pie(col):
    fig, axes = plt.subplots(1, 2, figsize=(8, 4))

    actions = list(hmda_df['action_taken_name'].unique())
    indice = np.arange(len(actions))
    for action, i in zip(actions, indice):
        category = hmda_df[hmda_df['action_taken_name']==action][col]
        vc = category.value_counts(sort=False)
        axes[i].pie(vc.values, labels=vc.keys(), autopct='%1.1f%%')
        axes[i].set_title(action, fontsize=14)
        axes[i].axis('equal')

- Applicant Race

In [53]:
cate_pie('applicant_race_name_1')

The pie chart reveals very clearly that white applicants have more chances to earn the approval.


- Applicant Sex

In [54]:
cate_pie("applicant_sex_name")

Sex is not as strong, but still there is discrepany between men and women.


- Co-applicant

In [55]:
cate_pie("co_applicant")

Chances are high that a sole applicant is a single, reversely, applicant with somebody is married. A married applicant would probably be financially more established than a single who is probably younger and early in his/her career, which could explain the two different charts.


- Agency

In [56]:
cate_pie("agency_abbr")


- Loan Purpose

In [57]:
cate_pie('loan_purpose_name')

It seems like mortgage loans are more open to home purchase purpose. Put it another way, loan officers are hard to find reasons to deny home purchase applications or encouraged to give them approval for their benefits. Clearly, home improvement is not always necessary for a living.


- Loan Type

In [58]:
cate_pie('loan_type_name')

This categorical feature doesn't look like it will help classification from the pies. However, boxplot might uncover some differences.


• Boxplot

The purpose of boxplot for this project is to see if categorical data can work for classification by interacting with numerical data. Therefore, a categorical value measured differently by numerical data is likely helpful to distinguish classes. We want to go through some of the categorical data that showed certain patterns.

In [59]:
""" draw boxplot by category and loan results """
def category_boxplot(cate, title):
    sns.set(style="whitegrid")
    plt.figure(figsize=(8, 4))
    # income
    ax = sns.boxplot(x="income_sqrt", y="action_taken_name", hue=cate,
                     data=hmda_df)
    ax.set_title("Income by " + title, fontsize=15)
    ax.set_ylabel("")
    ax.legend(bbox_to_anchor=(1.05, 1), loc=2)

    plt.show()
In [60]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="applicant_sex_name",
            data=hmda_df, ax=axes[0])
axes[0].set_title("Loan Amount by Sex", fontsize=15)
axes[0].set_xlabel("")
axes[0].legend(bbox_to_anchor=(0.5, -0.4), loc=8)

sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="agency_abbr",
            data=hmda_df, ax=axes[1])
axes[1].set_title("Loan Amount by Agency", fontsize=15)
axes[1].set_xlabel("")
axes[1].legend(bbox_to_anchor=(0.5, -0.4), loc=8)

sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="loan_type_name",
            data=hmda_df, ax=axes[2])
axes[2].set_title("Loan Amount by Loan Type", fontsize=15)
axes[2].set_xlabel("")
axes[2].legend(bbox_to_anchor=(0.5, -0.35), loc=8)
Out[60]:
<matplotlib.legend.Legend at 0x1f03dcd9128>

None of them are impressively separating classes. But they are still plotting two classes more or less differently.


5. Clustering

Now that we have found out some patterns, we can base them for clustering census tract numbers. To visualize about a hundred of them, we can draw yet another scatterplots using numerical features. By doing so, we can decide which clustering methods to use. Since often there is no clear standard when it comes to clustering, it is critical to plot data and visually judge what method is most suitable. But we can always go with trial and error.

In [61]:
# criteria 1 based on loan approval rate
census_minor = hmda_df.groupby(['census_tract_number'])['minority_population'].mean()
census_ownership = hmda_df.groupby(['census_tract_number'])['number_of_owner_occupied_units'].mean()
census_df_1 = pd.DataFrame({'minority': census_minor, 'ownership': census_ownership})
census_df_1.reset_index(inplace=True)

# criteria 2 based on income vs. loan amount
census_income = hmda_df.groupby(['census_tract_number'])['income_sqrt'].mean()
census_loan_amount = hmda_df.groupby(['census_tract_number'])['loan_amount_sqrt'].mean()
census_df_2 = pd.DataFrame({'income': census_income, 'loan_amount': census_loan_amount})
census_df_2.reset_index(inplace=True)

# merge criteria 1 and 2
census_df = pd.merge(census_df_1, census_df_2)
In [62]:
# combination of numerical data
fig, axes = plt.subplots(2, 2, figsize=(12, 11))

sns.scatterplot(x="minority", y="ownership", data=census_df, ax=axes[0,0])
axes[0,0].set_title("Minority vs. Ownership", fontsize=15)

sns.scatterplot(x="minority", y="loan_amount", data=census_df, ax=axes[0,1])
axes[0,1].set_title("Minority vs. Loan Amount", fontsize=15)

sns.scatterplot(x="income", y="loan_amount", data=census_df, ax=axes[1,0])
axes[1,0].set_title("Income vs. Loan Amount", fontsize=15)

sns.scatterplot(x="ownership", y="loan_amount", data=census_df, ax=axes[1,1])
axes[1,1].set_title("Ownership vs. Loan Amount", fontsize=15)

plt.tight_layout()

• Hierarchical Clustering: Dendrogram

The utmost advantage of hierarchical clustering is that it gives a general idea of how many groups in the data. Especially, dendrogram is very useful to see it visually. First, we can use the approval rate plot (first from the previous plots) for this method as we focus on distance.

In [63]:
def hier_clustering_labels(X, n):
    clustering = AgglomerativeClustering(
        n_clusters=n, affinity='euclidean', linkage='ward')
    clustering.fit_predict(X)

    return clustering.labels_
In [64]:
X = [[x[0], x[1]] for x in zip(census_df['minority'], census_df['loan_amount'])]
X = np.array(X)
In [65]:
sns.set(style='white')
plt.figure(figsize=(12,6))
plt.title("Census Tract Dendrogram")
dend = shc.dendrogram(shc.linkage(X, method='ward'))

It is hard to which census tract number is related to another from the dendrogram, but we can still figure out the number to group them depending on where to cut off. The bset way is plotting them and iterate some numbers for clustering to decide the number of clusters.

In [66]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 2), cmap='rainbow')
axes[0,0].set_title("Cluster n=2", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 3), cmap='rainbow')
axes[0,1].set_title("Cluster n=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 4), cmap='rainbow')
axes[0,2].set_title("Cluster n=4", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 5), cmap='rainbow')
axes[1,0].set_title("Cluster n=5", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 6), cmap='rainbow')
axes[1,1].set_title("Cluster n=6", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 7), cmap='rainbow')
axes[1,2].set_title("Cluster n=7", fontsize=15)

plt.tight_layout()

Unfortunately, the clusters are cut vertically, which means that only x-axis may be enough to group them. So, the clustering might be meaningless because an existing feature already can do the job. However, clustering with n=3 seems to make some sense. We can add this feature and always remove it later when modeling.

In [67]:
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], hier_clustering_labels(X, 3)))
hmda_df['census_hier'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

• Centroid Based Clustering: K-Means

In [68]:
def kmeans_labels(X, n_clusters):
    kmeans = KMeans(n_clusters=n_clusters).fit(X)

    return kmeans.labels_
In [69]:
X = [[x[0], x[1]] for x in zip(census_df['income'], census_df['loan_amount'])]
X = np.array(X)
In [70]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 2), cmap='rainbow')
axes[0,0].set_title("Cluster n=2", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 3), cmap='rainbow')
axes[0,1].set_title("Cluster n=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 4), cmap='rainbow')
axes[0,2].set_title("Cluster n=4", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 5), cmap='rainbow')
axes[1,0].set_title("Cluster n=5", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 6), cmap='rainbow')
axes[1,1].set_title("Cluster n=6", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 7), cmap='rainbow')
axes[1,2].set_title("Cluster n=7", fontsize=15)
Out[70]:
Text(0.5,1,'Cluster n=7')
In [71]:
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], kmeans_labels(X, 5)))
hmda_df['census_kmeans'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

This could be useful because the clusters are naturally separated by income level and loan amount. It could strengthen the pattern we have already found by increasing the sparsity.


• Density Based Clustering: DBSCAN

DBSCAN takes a different course than the hierarchical clustering does as previously shown. This method is based on density, which means how closely the dots are located from each other. It needs scailing for better processing beforehand, but the rest are very similar steps as done for hierarchical clustering. That is, plotting and iteration are heavily used to find the best number of clusters. Specifically speaking, finding the best epilon and minimum number of samples is our goal.

In [72]:
def db_scan_labels(X, eps, min_samples):
    db_scan = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    return db_scan.labels_

- Income vs. Loan Amount: Noise Removal

In [73]:
X = [[x[0], x[1]] for x in zip(census_df['minority'], census_df['ownership'])]
X = np.array(X)
X = ppr.StandardScaler().fit_transform(X)
In [74]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 2), cmap='rainbow')
axes[0,0].set_title("eps=0.15, min_sample=2", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 3), cmap='rainbow')
axes[0,1].set_title("eps=0.15, min_sample=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 4), cmap='rainbow')
axes[0,2].set_title("eps=0.2, min_sample=4", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.25, 4), cmap='rainbow')
axes[1,0].set_title("eps=0.25, min_sample=5", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 4), cmap='rainbow')
axes[1,1].set_title("eps=0.3, min_sample=4", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 5), cmap='rainbow')
axes[1,2].set_title("eps=0.3, min_sample=5", fontsize=15)
Out[74]:
Text(0.5,1,'eps=0.3, min_sample=5')

Aware that purple dots are noises, the result of epsilon equal to 0.15 and minimum samples 3 should probably be the best clustering as the others have a couple or more flaws. However, there are so many unneccesary small clusters which we can get rid of once clustering process is done.

In [75]:
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], db_scan_labels(X, 0.15, 3)))
hmda_df['census_dbscan_1'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

# 2 clusters and all noises
hmda_df['census_dbscan_1'] = hmda_df['census_dbscan_1'].apply(lambda x : -1
                                    if x not in [0, 10] else x)


- Ownership vs. Loan Amount: Grouping

In [76]:
X = [[x[0], x[1]] for x in zip(census_df['ownership'], census_df['loan_amount'])]
X = np.array(X)
X = ppr.StandardScaler().fit_transform(X)
In [77]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 5), cmap='rainbow')
axes[0,0].set_title("eps=0.15, min_sample=4", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 6), cmap='rainbow')
axes[0,1].set_title("eps=0.2, min_sample=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 7), cmap='rainbow')
axes[0,2].set_title("eps=0.2, min_sample=7", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.25, 6), cmap='rainbow')
axes[1,0].set_title("eps=0.25, min_sample=6", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 6), cmap='rainbow')
axes[1,1].set_title("eps=0.3, min_sample=6", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.35, 5), cmap='rainbow')
axes[1,2].set_title("eps=0.35, min_sample=5", fontsize=15)
Out[77]:
Text(0.5,1,'eps=0.35, min_sample=5')

It should be reasonable to have only one large cluster and tag all the rest as noise.

In [78]:
# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], db_scan_labels(X, 0.2, 7)))
hmda_df['census_dbscan_2'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

# 1 cluster and all noises
hmda_df['census_dbscan_2'] = hmda_df['census_dbscan_2'].apply(lambda x : x if x == 0 else -1)
In [79]:
# drop census tract number
hmda_df.drop(['census_tract_number'], axis=1, inplace=True)

6. Modeling

Decision Tree is simple, thus, it is often used as a baseline model. We will build a Decision Tree model with all features included, and then test other models like Logistics, Random Forest (bagging), Naive Bayes (independence), Gradient Boosting (boosting). Also, we will manipulate a feature set using Recursive Feature Elimination and Pinciple Component Analysis.

In [80]:
# reduce the most class
second_large_supp = len(hmda_df[hmda_df['action_taken_name']=="Denied"])
size_drop = len(hmda_df[hmda_df['action_taken_name']=="Approved"]) - second_large_supp
drop_indice = np.random.choice(list(hmda_df[hmda_df['action_taken_name']=="Approved"].index),
                               size_drop, replace=False)

hmda_df = hmda_df.loc[~hmda_df.index.isin(drop_indice)]
In [81]:
def split_X_y(df, features, y):
    cate_cols = df.select_dtypes(include='category').columns
    for col in cate_cols:
        le = ppr.LabelEncoder()
        le.fit(df[col])
        df[col] = le.transform(df[col])

    X = np.array(df[features])
    y = np.array(df[y])
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33)

    # normalize for faster calculation
    X_train = ppr.normalize(X_train)
    X_test = ppr.normalize(X_test)

    return X_train, X_test, y_train, y_test
In [82]:
# all features
features = list(hmda_df.columns)
features.remove('action_taken_name')

# X, y
X_train, X_test, y_train, y_test = split_X_y(hmda_df, features, 'action_taken_name')

In [83]:
# train and show scores
def train_evaluate(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    print("5-fold Cross Validation Accuracy:", round(cross_val_score(clf, X_train, y_train, cv=5).mean(), 2))
    print("\nConfusion Matrix\n", confusion_matrix(y_test, pred))
    print("\n", classification_report(y_test, pred))

    return clf

• Baseline: Decision Tree

In [84]:
clf_dt = train_evaluate(tree.DecisionTreeClassifier(), X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.61

Confusion Matrix
 [[5718 3691]
 [3524 5863]]

              precision    recall  f1-score   support

          0       0.62      0.61      0.61      9409
          1       0.61      0.62      0.62      9387

avg / total       0.62      0.62      0.62     18796

The result is not so impressive, however, the errors are uniformly distributed at least.

• Logistic Regression

In [85]:
clf_lr = train_evaluate(LogisticRegression(), X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.61

Confusion Matrix
 [[7944 1465]
 [5835 3552]]

              precision    recall  f1-score   support

          0       0.58      0.84      0.69      9409
          1       0.71      0.38      0.49      9387

avg / total       0.64      0.61      0.59     18796

Unlike the Decision Tree result, the errors are concentrated on Approval cases.

• Bagging: Random Forest

In [86]:
clf_rf = train_evaluate(RandomForestClassifier(max_depth=20), X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.66

Confusion Matrix
 [[6745 2664]
 [3460 5927]]

              precision    recall  f1-score   support

          0       0.66      0.72      0.69      9409
          1       0.69      0.63      0.66      9387

avg / total       0.68      0.67      0.67     18796

The result is better than the Decision Tree, but it is still so good.

• Naive Bayes

In [87]:
clf_nb = train_evaluate(GaussianNB(), X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.58

Confusion Matrix
 [[8723  686]
 [6969 2418]]

              precision    recall  f1-score   support

          0       0.56      0.93      0.70      9409
          1       0.78      0.26      0.39      9387

avg / total       0.67      0.59      0.54     18796

Like the Logistic Regression, it is biased to predict Denied only.


• Stochastic: Gradient Boosting

Gradient Boosting is a very sophisticated model to configurate parameters for the best performance. The following will seek the optimal parameters based on AUC scores.

In [88]:
def roc_auc(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    train_pred = clf.predict(X_train)
    fpr, tpr, thrd = roc_curve(y_train, train_pred)
    roc_auc_train = auc(fpr, tpr)

    test_pred = clf.predict(X_test)
    fpr, tpr, thrd = roc_curve(y_test, test_pred)
    roc_auc_test = auc(fpr, tpr)

    return roc_auc_train, roc_auc_test
In [89]:
# learning rate
learning_rates = [1, 0.5, 0.25, 0.1, 0.05, 0.01]
lr_train = []
lr_test = []
for lr in learning_rates:
    clf_gb = GradientBoostingClassifier(learning_rate=lr)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    lr_train.append(roc_auc_train)
    lr_test.append(roc_auc_test)

# number of estimators
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 150, 200]
est_train = []
est_test = []
for n in n_estimators:
    clf_gb = GradientBoostingClassifier(n_estimators=n)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    est_train.append(roc_auc_train)
    est_test.append(roc_auc_test)

# max depth
max_depths = np.arange(1, 16)
md_train = []
md_test = []
for d in max_depths:
    clf_gb = GradientBoostingClassifier(max_depth=d)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    md_train.append(roc_auc_train)
    md_test.append(roc_auc_test)

# max_features
max_features = list(range(1, len(features)))
mf_train = []
mf_test = []
for f in max_features:
    clf_gb = GradientBoostingClassifier(max_features=f)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    mf_train.append(roc_auc_train)
    mf_test.append(roc_auc_test)
In [90]:
fig, axes = plt.subplots(2, 2, figsize=(9, 9))

axes[0,0].plot(learning_rates, lr_train, 'b')
axes[0,0].plot(learning_rates, lr_test, 'r')
axes[0,0].set_title("Learning Rate")

axes[0,1].plot(n_estimators, est_train, 'b')
axes[0,1].plot(n_estimators, est_test, 'r')
axes[0,1].set_title("N Estimators")

axes[1,0].plot(max_depths, md_train, 'b')
axes[1,0].plot(max_depths, md_test, 'r')
axes[1,0].set_title("Max Depth")

axes[1,1].plot(max_features, mf_train, 'b')
axes[1,1].plot(max_features, mf_test, 'r')
axes[1,1].set_title("Max Features")
Out[90]:
Text(0.5,1,'Max Features')
In [91]:
clf_gb = train_evaluate(GradientBoostingClassifier(
        learning_rate=0.2,
        n_estimators=150,
        max_depth=7,
        max_features=5
    ), X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.68

Confusion Matrix
 [[6833 2576]
 [3280 6107]]

              precision    recall  f1-score   support

          0       0.68      0.73      0.70      9409
          1       0.70      0.65      0.68      9387

avg / total       0.69      0.69      0.69     18796



• Neural Network

Similar with Gradient Boosting, Neural Network need a certain set of parameters for tuning. This can be done using the Grid Search method.

In [92]:
params = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant','adaptive'],
}

clf = GridSearchCV(MLPClassifier(), params, n_jobs=-1, cv=3)
clf.fit(X_train, y_train)

print('Best parameters found:', clf.best_params_)
Best parameters found: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}
In [93]:
clf_mlp = train_evaluate(MLPClassifier(
                    alpha=0.001,
                    activation='relu',
                    solver='adam',
                    hidden_layer_sizes=(50,100,50),
                    learning_rate='constant'),
                X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.65

Confusion Matrix
 [[8102 1307]
 [5343 4044]]

              precision    recall  f1-score   support

          0       0.60      0.86      0.71      9409
          1       0.76      0.43      0.55      9387

avg / total       0.68      0.65      0.63     18796


• Feature Manipulation: RFE, Heuristic

In [94]:
model = LogisticRegression()
rfe = RFE(model, 9).fit(X_train, y_train)

feature_rank = []
features_rfe = []

for feature, rank in zip(features, rfe.ranking_):
    feature_rank.append((feature, rank))
    if rank == 1:
        features_rfe.append(feature)

X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe = split_X_y(
    hmda_df, features_rfe, 'action_taken_name')

sorted(feature_rank, key=lambda tup : tup[1])
Out[94]:
[('agency_abbr', 1),
 ('applicant_race_name_1', 1),
 ('minority_population', 1),
 ('income_sqrt', 1),
 ('loan_amount_sqrt', 1),
 ('tract_income_to_loan', 1),
 ('income_to_loan', 1),
 ('census_hier', 1),
 ('census_dbscan_1', 1),
 ('census_kmeans', 2),
 ('population', 3),
 ('loan_purpose_name', 4),
 ('co_applicant', 5),
 ('census_dbscan_2', 6),
 ('loan_type_name', 7),
 ('number_of_1_to_4_family_units', 8),
 ('number_of_owner_occupied_units', 9),
 ('applicant_sex_name', 10)]
In [95]:
clf_lr_rfe = train_evaluate(LogisticRegression(), X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe)
5-fold Cross Validation Accuracy: 0.65

Confusion Matrix
 [[6621 2692]
 [3971 5512]]

              precision    recall  f1-score   support

          0       0.63      0.71      0.67      9313
          1       0.67      0.58      0.62      9483

avg / total       0.65      0.65      0.64     18796

Selecting 9 of half the features, the f1-score of Logistic Regression is improved from 0.59. This indicates that the features had too much interaction to interfere with our target classes. Now, we can take this into consideration to select features in a heuristic way. The rule is to include features that care less correlated and use the Naive Bayes model which assumes independence among features.

In [96]:
features_comb = ['agency_abbr', 'applicant_sex_name', 'minority_population', 'income_to_loan',
                 'loan_purpose_name', 'loan_type_name', 'co_applicant',
                 'census_kmeans', 'census_dbscan_1', 'census_hier',
                 'number_of_1_to_4_family_units', 'number_of_owner_occupied_units']

X_train_comb, X_test_comb, y_train_comb, y_test_comb = split_X_y(
    hmda_df, features_comb, 'action_taken_name')
In [97]:
clf_nb_comb = train_evaluate(GaussianNB(), X_train_comb, X_test_comb, y_train_comb, y_test_comb)
5-fold Cross Validation Accuracy: 0.6

Confusion Matrix
 [[6179 3188]
 [4227 5202]]

              precision    recall  f1-score   support

          0       0.59      0.66      0.62      9367
          1       0.62      0.55      0.58      9429

avg / total       0.61      0.61      0.60     18796



• Principle Component Analysis

PCA takes a totally differnt look at features in that RFE and other feature selection methods simply include/exclude features. For that reason, we can hardly say which feature makes how much contributions.

In [98]:
reduced_df = df_reducer(hmda_df, 10000)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = split_X_y(
    reduced_df, features_rfe, 'action_taken_name')

pca = PCA(n_components=5)
pca.fit(X_train_pca)

X_train_pca = pca.transform(X_train_pca)
X_test_pca = pca.transform(X_test_pca)

clf_pca = SVC(C=5, gamma=10)
clf_pca.fit(X_train_pca, y_train_pca)

print(pca.explained_variance_ratio_)

train_evaluate(clf_pca, X_train_pca, X_test_pca, y_train_pca, y_test_pca)
[0.77424267 0.10446486 0.05925091 0.03537464 0.01358645]
5-fold Cross Validation Accuracy: 0.66

Confusion Matrix
 [[1326  342]
 [ 809  823]]

              precision    recall  f1-score   support

          0       0.62      0.79      0.70      1668
          1       0.71      0.50      0.59      1632

avg / total       0.66      0.65      0.64      3300

Out[98]:
SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=10, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



• Voting Ensemble

So far, we have found out that Gradient Boosting works the best followed by Random Forest. What we can do with these is mixing them together to make them vote. Also, we can add Naive Bayes because the model is fundamentally different. Voting Ensemble is useful when a certain level of diversity is secured.

In [99]:
estimators = []

estimators.append(('nb', clf_nb))
estimators.append(('rf', clf_rf))
estimators.append(('gb', clf_gb))

ensemble = VotingClassifier(estimators)
clf_ens = train_evaluate(ensemble, X_train, X_test, y_train, y_test)
5-fold Cross Validation Accuracy: 0.68

Confusion Matrix
 [[7470 1939]
 [4048 5339]]

              precision    recall  f1-score   support

          0       0.65      0.79      0.71      9409
          1       0.73      0.57      0.64      9387

avg / total       0.69      0.68      0.68     18796


7. Conclusion

Through the data analysis, we learnt the following patterns.

  • Income and loan amount are correlated, and the larger they are the higher chance to get approval
  • Higher minority population rate affects mortgage loan results negatively
  • Low minority population rate has a weak influence on approval
  • A white male is more likely to get approval then a black female
  • Racial/Sex factors become insignificant as loan amount and income increase
  • Home purchase is the best purpose for approval

However, these patterns are not so strong that it was difficult to clearly cut the groups. There were such interactions among features that classifying observations in one way would be no longer the same if in another way. Figuratively speaking, observations were tangled even in a finely separated group. For example, two cases having almost or exact the same condition would be classified differently; Approved and Denied, repectively.

To solve this, we used sophisticated models like Gradient Boosting and MLP. Also, we tried a different feature set as well as PCA. However, the entanglement was at a serious level making the best model performance around 0.7 accuracy and f1-score.


Reference