HMDA Data Analysis and Prediction Model¶

By Kyoosik Kim

Home Mortgage Disclosure Act (HMDA) is a federal law in the US that requires financial institutions to provide mortgage data to the public. The data contains basic information regarding mortgage loans such as applicant's income and race, loan amount, regions, and many others. What we focus is the relationship between these factors and the result of evaluation. There are 6 different available results, among which approval and denial are the major counts. Through the analysis, we will find patterns in numerical and categorical features, and eventually apply them to building a predictive model.

Setup and Query Dataset

Preparation

Data Cleaning: Category Integration, Type Conversion

Data Analysis: Scatterplot, Piechart, Boxplot

Clustering: Hierarchical, K-means, Density-based

Modeling: Decision Tree, Naive Bayes, Random Forest, Gradient Boosting, RFE, PCA, Voting Ensemble

Conclusion

1. Setup and Query Dataset¶

The data is pulled directly from the CFPB API. Although downloading in CSV is also available on Custom Dataset page at the same website, API has a benefit of flexibility like querying data. The data set is HMDA data of Detroit Metropolitan Area where social issues often arise such as racial discrimination or high crime rate. The data format is JSON to be able to fit MongoDB so that we can query the data fast. The purpose of querying is to briefly learn about the features as a preliminary process before detailed analysis. The following are libraries to be used for the entire project.

# file control, api, mongodb
import json
import os
import urllib3
from pymongo import MongoClient

# dataframe
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

# clustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans, DBSCAN
import scipy.cluster.hierarchy as shc

# model preparation
from sklearn import preprocessing as ppr
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

# modeling
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFE

# model evaluation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import *

# others
import scipy.stats as stats
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')

""" 
year, state, county can be given as condition
pull all the features in JSON
save the result as a file for later use
"""
def pull_dataset(file_name, year, msamd, code):
    http = urllib3.PoolManager()
    # request if file doesn't exist
    if not os.path.isfile(file_name):
        cfpb_url = ('https://api.consumerfinance.gov/data/hmda/slice/hmda_lar.json?&'
            '$where=as_of_year+IN+('+year+')+AND+action_taken+IN+(1,3)+AND+(msamd=%'+msamd+'%'+code+')&'
            '$select=tract_to_msamd_income,'
            'population,'
            'minority_population,'
            'number_of_owner_occupied_units,'
            'number_of_1_to_4_family_units,'
            'loan_amount_000s,'
            'applicant_income_000s,'
            'loan_type_name,'
            'loan_purpose_name,'
            'co_applicant_race_name_1,'
            'census_tract_number,'
            'applicant_sex_name,'
            'applicant_race_name_1,'
            'agency_abbr,'
            'action_taken_name&$limit=0')

        # receive data and store in JSON
        response = http.request('GET', cfpb_url)
        json_data = json.loads(response.data.decode('utf-8'))
        results = json_data['results']
        # write file
        with open(file_name, 'w', encoding='utf-8') as outfile:
            json_string = json.dumps(results)
            outfile.write(json_string)
        outfile.close()

    # read file if file already exists
    with open(file_name, 'rb') as infile:
        data = json.loads(infile.read())
    infile.close()

    return data

# recent 3 years to local DB
detroit = pull_dataset("detroit.json", "2017,2016,2015", "2219804", "22")

# do not write if db already exists
client = MongoClient('localhost:27017')
if "hmda" not in client.list_database_names():
    db = client.hmda
    db['detroit'].insert_many(detroit)

db = client.hmda
query_count = db.detroit.count()
query_find_one = db.detroit.find_one()

print("Number of Cases:", query_count)
print("\n<Case Example>")
pprint(query_find_one)

Number of Cases: 119794

<Case Example>
{'_id': ObjectId('5c38f906d9973504dc7db2e4'),
 'action_taken_name': 'Loan originated',
 'agency_abbr': 'HUD',
 'applicant_income_000s': 43,
 'applicant_race_name_1': 'White',
 'applicant_sex_name': 'Female',
 'census_tract_number': '5833.00',
 'co_applicant_race_name_1': 'No co-applicant',
 'loan_amount_000s': 66,
 'loan_purpose_name': 'Refinancing',
 'loan_type_name': 'FHA-insured',
 'minority_population': 11.329999923706055,
 'number_of_1_to_4_family_units': 1031,
 'number_of_owner_occupied_units': 862,
 'population': 2550,
 'tract_to_msamd_income': 96.79000091552734}

# screen meaningless features
def check_disticnt(keys):
    one_value_keys = []
    for key in keys:
        query_distinct = db.detroit.distinct(key)
        if len(query_distinct) == 1:
            one_value_keys.append(key)

    return one_value_keys

keys = query_find_one
print("Number of Keys:", len(keys))
print("\nKeys to be dropped for having the same value:")
pprint(check_disticnt(keys))

Number of Keys: 16

Keys to be dropped for having the same value:
[]

def convert_to_list(query_result, roundup_keys):
    doc_list = []
    for doc in query_result:
        for key in roundup_keys:
            try:
                doc[key] = round(doc[key], 2)
            except:
                doc[key] = "NaN"
        doc_list.append(doc)

    return doc_list

query_action_taken = db.detroit.aggregate([
    {"$group":
     {
        "_id": "$action_taken_name",
        "count": {"$sum": 1},
        "amount": {"$avg": "$loan_amount_000s"},
        "income": {"$avg": "$applicant_income_000s"}
     }
    },
    {"$sort": {"count": -1}}
])

pprint(convert_to_list(query_action_taken, ['amount', 'income']))

[{'_id': 'Loan originated', 'amount': 159.8, 'count': 88522, 'income': 90.57},
 {'_id': 'Application denied by financial institution',
  'amount': 109.23,
  'count': 31272,
  'income': 70.96}]

As the target is loan evaluation results; approved vs. denied, the query result above help us understand how the classes are distributed. In addition, it is also telling that loan amount and income are different across the loan results. Therefore, these features are suspected to play a major role for analysis.

# query loan accepted
field_val = "Loan originated"
total_count = db.detroit.find({"action_taken_name": field_val}).count()

query_loan_originated = db.detroit.aggregate([
    {"$match": {"action_taken_name": field_val}},
    {"$group": {"_id": "$applicant_race_name_1",
                "count": {"$sum": 1},
                "amount": {"$avg": "$loan_amount_000s"},
                "income": {"$avg": "$applicant_income_000s"}}
    },
    {"$sort": {"count": -1}},
    {"$addFields": {"case_share": {"$divide": ["$count", total_count]}}}
])

acceptance = convert_to_list(query_loan_originated, ['amount', 'income', 'case_share'])

# query loan denied
field_val = "Application denied by financial institution"
total_count = db.detroit.find({"action_taken_name": field_val}).count()

query_loan_originated = db.detroit.aggregate([
    {"$match": {"action_taken_name": field_val}},
    {"$group": {"_id": "$applicant_race_name_1",
                "count": {"$sum": 1},
                "amount": {"$avg": "$loan_amount_000s"},
                "income": {"$avg": "$applicant_income_000s"}}
    },
    {"$sort": {"count": -1}},
    {"$addFields": {"case_share": {"$divide": ["$count", total_count]}}}
])

denial = convert_to_list(query_loan_originated, ['amount', 'income', 'case_share'])

pd.concat({'Accepted': pd.DataFrame(acceptance), 'Denied': pd.DataFrame(denial)})

Another query result of racial factors seem to be useful. It is quite clear that there are differences among different races, thus, race-related features should be significant for analysis. Along with these, some other features like loan type, agency, loan purpose will be put together with aforementioned features to find patterns.

2. Preparation¶

The next step becomes turning JSON data on MongoDB into a CSV file so we can make use of analytical tools on Python. Once this step is done, we should handle missing values first before moving forward.

""" Read from Mongo and Store into DataFrame """
def read_mongo(collection):
    # Connect to MongoDB
    client = MongoClient('localhost:27017')
    db = client.hmda
    # Make a query to the specific DB and Collection
    cursor = db[collection].find({})
    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))
    df.drop(['_id'], axis=1, inplace=True)

    return df

hmda_df = read_mongo("detroit")
hmda_df.shape

(119794, 15)

hmda_df.head()

hmda_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119794 entries, 0 to 119793
Data columns (total 15 columns):
action_taken_name                 119794 non-null object
agency_abbr                       119794 non-null object
applicant_income_000s             113106 non-null float64
applicant_race_name_1             119794 non-null object
applicant_sex_name                119794 non-null object
census_tract_number               119794 non-null object
co_applicant_race_name_1          119794 non-null object
loan_amount_000s                  119794 non-null int64
loan_purpose_name                 119794 non-null object
loan_type_name                    119794 non-null object
minority_population               119787 non-null float64
number_of_1_to_4_family_units     119787 non-null float64
number_of_owner_occupied_units    119784 non-null float64
population                        119794 non-null int64
tract_to_msamd_income             119787 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 13.7+ MB

Fortunately, most of the features are complete or missing a very few. For those features with missing values, the following can be considered for handling.

Drop the cases
Fill with statistics such as mean/median (numeric), or mode (categorical)
Estimate from other features

While there could be other methods, the above are largely suitable for the situation. To decide which, we need to see if the missing values are systematic or random. The latter is more flexible for both imputation methods while the former case needs further investigation. Since there does not seem to have a pattern in missing values, we can simply drop them when filling with statistics could make it more complicated.

hmda_df.dropna(inplace=True)
hmda_df.shape

(113101, 15)

3. Data Cleaning: Categorical Data¶

The last step before analysis is cleaning the data. Since we have many categorical features, it is necessary to look at them to see if any of them is problematic. This is a safer way because some analysis would later need to use categorical features together with numerical features like boxplot as an instance. It would be a good idea to organize categories first before doing such a thing.

# handle a trivial error
def fix_error_message(df, cols):
    error_message = "Information not provided by applicant in mail, Internet, or telephone application"
    for col in cols:
        df[col] = df[col].apply(lambda x :
                    "Not available" if x==error_message else x)
    return df

# fix error values
cols = list(hmda_df.select_dtypes('object').columns)
hmda_df = fix_error_message(hmda_df, cols)

# long name to shorter version
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="Native Hawaiian or Other Pacific Islander",
                        value="Islander")
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="American Indian or Alaska Native",
                        value="Native American")
hmda_df = hmda_df.apply(pd.Series.replace, to_replace="Black or African American",
                        value="Black")

# horizontal barplot
def value_counts_bar(col, title):
    vc = col.value_counts()
    plt.figure(figsize=(6, len(vc.keys())/2))
    plt.barh(vc.keys(), vc.values)
    plt.title(title, fontsize=15)
    plt.xticks(rotation='vertical', fontsize=10)
    plt.show()

• Loan Result

value_counts_bar(hmda_df.action_taken_name, "Action Taken")

# change class value names
hmda_df['action_taken_name'] = hmda_df['action_taken_name'].apply(lambda x : "Approved"
                                    if x == "Loan originated" else "Denied")

• Agency

value_counts_bar(hmda_df.agency_abbr, "Agency")

hmda_df['agency_abbr'] = hmda_df['agency_abbr'].apply(lambda x : "Others"
                                if x not in ['CFPB', 'HUD', 'NCUA'] else x)

While the top 3 values (HUD/CFPB/NCUA) account for the most, the rest are only a fraction. Since having too many values for such a small count could lead to overfitting, they can be merged together as Others.

• Applicant Race

value_counts_bar(hmda_df.applicant_race_name_1, "Applicant Race")

# White and Black are major two
hmda_df['applicant_race_name_1'] = hmda_df['applicant_race_name_1'].apply(lambda x : 'Others'
                                        if x not in ['White', 'Black'] else x)

• Applicant Sex

value_counts_bar(hmda_df.applicant_sex_name, "Applicant Sex")

hmda_df['applicant_sex_name'] = hmda_df['applicant_sex_name'].apply(lambda x : "Others"
                                        if x not in ['Male', 'Female'] else x)

• Co-applicant Race

value_counts_bar(hmda_df.co_applicant_race_name_1, "Co-applicant Race")

# with or without co-applicant
hmda_df['co_applicant_race_name_1'] = hmda_df['co_applicant_race_name_1'].apply(lambda x : "With co-applicant"
                                            if x != 'No co-applicant' else x)
hmda_df.rename(columns={'co_applicant_race_name_1': 'co_applicant'}, inplace=True)

• Loan Purpose

value_counts_bar(hmda_df.loan_purpose_name, "Loan Purpose")

• Loan Type

value_counts_bar(hmda_df.loan_type_name, "Loan Type")

hmda_df['loan_type_name'] = hmda_df['loan_type_name'].apply(lambda x : "Non-conventional"
                                    if x != "Conventional" else x)

• Census Tract Number

hmda_df = hmda_df[hmda_df['census_tract_number'] != ""]
print("Number of Census Tract:", len(hmda_df['census_tract_number'].unique()))

Number of Census Tract: 598

Census tract numbers are not numerical data as their zipcode-like numbers do not mean anything. Detroit Census Information can help to understand the basic statistics with a visual support. Census tract numbers are basically deemed as categorical data, but there are over a hundred different values. In fact, we might still use it as categorical data but this could cause overfitting issues, missing general trends. The best way to deal with this is clustering. This technique, though, requires us to know about the dataset first as we have yet to know what based we should cluster them on.

• Type Conversion and Summary

# object to category
object_cols = list(hmda_df.select_dtypes('object').columns)
object_cols.remove('census_tract_number')
hmda_df[object_cols] = hmda_df[object_cols].astype('category')

hmda_df.describe(include=['category']).drop('count').T

3. Data Analysis¶

As plotting data always comes in handy to feel the data, scatterplot and boxplot will be used as data visualization. Before them, we need to know the distribution for numerical data. If they are overly skewed or they have strange patterns, they should be handled first.

hmda_df.describe().drop('count').T

The statistics are already showing some problems with dollar-related features that the max is too much compared with 75 quantile. This means the observations are highly skewed. The details can be revealed by plotting them.

• Histogram

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

axes[0].hist(hmda_df.applicant_income_000s, bins=1000)
axes[0].axvline(hmda_df.applicant_income_000s.mean(), color='red', linestyle='--')
axes[0].set_xlim([0, 300])
axes[0].set_title("Income", fontsize=15)
axes[0].set_xlabel("Thousand $")

axes[1].hist(hmda_df.loan_amount_000s, bins=300)
axes[1].axvline(hmda_df.loan_amount_000s.mean(), color='red', linestyle='--')
axes[1].set_xlim([0, 500])
axes[1].set_title("Loan Amount", fontsize=15)
axes[1].set_xlabel("Thousand $")

axes[2].hist(hmda_df.tract_to_msamd_income, bins=40)
axes[2].axvline(hmda_df.tract_to_msamd_income.mean(), color='red', linestyle='--')
axes[2].set_xlim([0, 300])
axes[2].set_title("Income by MSA Tract", fontsize=15)
axes[2].set_xlabel("Thousand $")

plt.tight_layout()
plt.show()

As expected from statistics, Income and Loan Amount are positively skewed, having a lot of small dollar amounts. Before considering transformation, we need to note that there are 0 amount cases which doesn't make sense. Especially, there are so many cases of loan amount equal to zero (possibly just less than a $1,000). We could suspect that these are administrative issues such as applications filed for fixing errors or any kind of paper jobs for any reason. Therefore, we would consider excluding these zero amount cases.

hmda_df = hmda_df[(hmda_df['applicant_income_000s'] > 0) & (hmda_df['applicant_income_000s'] < 250)]
hmda_df = hmda_df[(hmda_df['loan_amount_000s'] > 0) & (hmda_df['loan_amount_000s'] < 400)]
hmda_df = hmda_df[(hmda_df['tract_to_msamd_income'] > 0) & (hmda_df['tract_to_msamd_income'] < 280)]

hmda_df.shape

(105712, 15)

print("Average Income: $", round(hmda_df['applicant_income_000s'].mean(), 2), "K")
print("Average Loan Amount: $", round(hmda_df['loan_amount_000s'].mean(), 2), "K")

Average Income: $ 73.27 K
Average Loan Amount: $ 123.88 K

Now, we need to decide whether to transform the distributions of the numerical variables. Our goal is classification, which often times requires sparsity among cases throughout features. Although we need more investigation, the skewed distributions suggest the close distance that will make it difficult to classify. Transforming the numerical features will help to spread out the observations across the range.

hmda_df['income_sqrt'] = np.sqrt(hmda_df['applicant_income_000s'])
hmda_df['loan_amount_sqrt'] = np.sqrt(hmda_df['loan_amount_000s'])

fig, axes = plt.subplots(1, 2, figsize=(10,4))

for action in hmda_df['action_taken_name'].unique():
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].income_sqrt,
                 hist=False, kde=True, ax=axes[0])
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].loan_amount_sqrt,
                            hist=False, kde=True, label=action, ax=axes[1])

axes[1].legend(bbox_to_anchor=(1,0.6))
plt.show()

dist_0 = hmda_df[hmda_df['action_taken_name']=='Approved']['income_sqrt']
dist_1 = hmda_df[hmda_df['action_taken_name']=='Denied']['income_sqrt']

stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)
print("Stat/P-value: (", round(stat, 2), ',', p_value, ")")

Stat/P-value: ( 61.47 , 0.0 )

Even if the t-test is telling the two distributions are significantly different. That is, an application of small income and small loan amount is more likely to be denied than that of large income and large amount. there is a large overlap shared by two classes. Our goal then becomes utilizing other features to separate the observations even more clearly. The following distributions are from the rest of numerical features.

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

axes[0].hist(hmda_df.population, bins=40)
axes[0].axvline(hmda_df.population.mean(), color='red', linestyle='--')
axes[0].set_xlim([0, 8000])
axes[0].set_title("Population per Census Tract", fontsize=12)
axes[0].set_xlabel("Number of Residents")

axes[1].hist(hmda_df.minority_population, bins=40)
axes[1].axvline(hmda_df.minority_population.mean(), color='red', linestyle='--')
axes[1].set_xlim([0, 100])
axes[1].set_title("Minority per Census Tract", fontsize=12)
axes[1].set_xlabel("Ratio (%)")

axes[2].hist(hmda_df.number_of_1_to_4_family_units, bins=40)
axes[2].axvline(hmda_df.number_of_1_to_4_family_units.mean(), color='red', linestyle='--')
axes[2].set_xlim([0, 3200])
axes[2].set_title("1~4 Family Units", fontsize=12)
axes[2].set_xlabel("Units")

axes[3].hist(hmda_df.number_of_owner_occupied_units, bins=40)
axes[3].axvline(hmda_df.number_of_owner_occupied_units.mean(), color='red', linestyle='--')
axes[3].set_xlim([0, 2500])
axes[3].set_title("Number of Owner Occupied Units", fontsize=12)
axes[3].set_xlabel("Units")

plt.tight_layout()
plt.show()

They are roughly bell-shaped except for Minority Population (%). The Minority Population feature indicates that the neighborhoods are severely segregated. For the rest, they don't seem to need transformation.

fig, axes = plt.subplots(1, 3, figsize=(15,4.5))

for action in hmda_df['action_taken_name'].unique():
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].population,
                 hist=False, kde=True, ax=axes[0])
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].minority_population,
                            hist=False, kde=True, label=str(action), ax=axes[1])
    sns.distplot(hmda_df[hmda_df['action_taken_name']==action].number_of_1_to_4_family_units,
                            hist=False, kde=True, ax=axes[2])

axes[1].legend(bbox_to_anchor=(0.7,-0.2))
plt.show()

From another view, an application from high minority population rate is more likely to be denied. Again, the rest don't quite seem to have differences by class. However, histogram can't tell everything about the differences between classes until many other features possibly reveal underlying properties.

# add more ratio features
hmda_df['tract_income_to_loan'] = hmda_df['tract_to_msamd_income'] / hmda_df['applicant_income_000s']
hmda_df['income_to_loan'] = hmda_df['loan_amount_000s'] / hmda_df['applicant_income_000s']

corr = hmda_df.corr()
corr

It is natural that sqrt features are highly correlated with their original features. Other than them, we see high correlations between dollar amount features. Another thing to note is that census-based feature are also meaningfully related. This can be easily seen on heatmap.

plt.figure(figsize=(8,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, square=True, cmap="PiYG", vmin=-0.7, vmax=0.7)

<matplotlib.axes._subplots.AxesSubplot at 0x1f03e5e0828>

Population is related with Number of 1~4 Family Units as well as Number of Owner Occupied Units, and this is because the number of units are absolute counts. These numbers are supposed to increase as population is bigger. What we can do is turning these into relative numbers by dividing the number of units with population. Also, we need to drop the original features as we transformed them. This is important because we want features to be independent each other for better classification.

# transform to ratio
hmda_df['number_of_1_to_4_family_units'] = hmda_df['number_of_1_to_4_family_units'] / hmda_df['population']
hmda_df['number_of_owner_occupied_units'] = hmda_df['number_of_owner_occupied_units'] / hmda_df['population']

# drop old features
hmda_df.drop(columns=['applicant_income_000s', 'loan_amount_000s', 'tract_to_msamd_income'], inplace=True)

# updated heatmap
corr = hmda_df.corr()
plt.figure(figsize=(8,8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, square=True, cmap="PiYG", vmin=-0.7, vmax=0.7)

<matplotlib.axes._subplots.AxesSubplot at 0x1f03e840fd0>

• Scatterplot

# for better visualization or faster calculation
def df_reducer(df, size):
    indice = np.random.choice(list(df.index), size, replace=False)
    return df.loc[indice]

- Income vs. Loan Amount

reduced_df = df_reducer(hmda_df, 5000)

# income vs. loan amount by sex and race
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
                data=reduced_df, hue="action_taken_name", s=15, ax=axes[0])
axes[0].set_title("Income vs. Loan Amount by Loan Result", fontsize=12)
axes[0].legend(bbox_to_anchor=(0.5, -0.6), loc=8)

sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
                data=reduced_df, hue="applicant_sex_name", s=15, ax=axes[1])
axes[1].set_title("Income vs. Loan Amount by Sex", fontsize=12)
axes[1].legend(bbox_to_anchor=(0.5, -0.6), loc=8)

sns.scatterplot(x="income_sqrt", y="loan_amount_sqrt",
                data=reduced_df, hue="applicant_race_name_1", s=15, ax=axes[2])
axes[2].set_title("Income vs. Loan Amount by Race", fontsize=12)
axes[2].legend(bbox_to_anchor=(0.5, -0.6), loc=8)

plt.show()

Again, larger income and larger loan amount help applicants get their mortgage loans. There is a pattern that women are less likely to get their loans approved than men, even if the visualization is not balanced for categories. Also, black population has weaker chances to get their mortgage loans. However, all these categories are not decisively strong.

- 3D Plot: Income vs. Loan Amount vs. Minority Population

reduced_df = df_reducer(hmda_df, 500)

# comment out for interactive plotting
# %matplotlib notebook

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')

actions = reduced_df['action_taken_name'].unique()
ranges = [('r', actions[0]), ('b', actions[1])]

for c, action in ranges:
    xs = reduced_df[reduced_df['action_taken_name']==action]['minority_population']
    ys = reduced_df[reduced_df['action_taken_name']==action]['income_sqrt']
    zs = reduced_df[reduced_df['action_taken_name']==action]['loan_amount_sqrt']
    ax.scatter(xs, ys, zs, c=c, s=5)

ax.set_title('Mortgage Approval (red) vs. Denial (blue)\n', fontsize=15)
ax.set_xlabel('Minority Population (%)')
ax.set_ylabel('Income')
ax.set_zlabel('Loan Amount')

plt.show()

As for high minority population rate, it is quite clear to group certain classes. For example, an application of high minority population rate/low income/low loan amount is very likely to be denied. The problem is when the minority population rate is low where dots are mixed together. In other words, Loan Amount and Income lose its sharpness to cut the classes into groups when the racial factor becomes mild.

- Minority Ratio by Loan Result

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')

actions = reduced_df['action_taken_name'].unique()
ranges = [('r', actions[0]), ('b', actions[1])]

for c, action in ranges:
    xs = reduced_df[reduced_df['action_taken_name']==action]['minority_population']
    ys = reduced_df[reduced_df['action_taken_name']==action]['number_of_1_to_4_family_units']
    zs = reduced_df[reduced_df['action_taken_name']==action]['number_of_owner_occupied_units']
    ax.scatter(xs, ys, zs, c=c, s=5)

ax.set_xlabel('Minority Population (%)')
ax.set_ylabel('Number of 1~4 Family Units')
ax.set_zlabel('Number of Owner Occupied Units')

plt.show()

Unit related features have the same problem that it becomes unrecognizable as minority population rate is lower. This may or may not be solved by categorical features.

• Pie Chart

%matplotlib inline

def cate_pie(col):
    fig, axes = plt.subplots(1, 2, figsize=(8, 4))

    actions = list(hmda_df['action_taken_name'].unique())
    indice = np.arange(len(actions))
    for action, i in zip(actions, indice):
        category = hmda_df[hmda_df['action_taken_name']==action][col]
        vc = category.value_counts(sort=False)
        axes[i].pie(vc.values, labels=vc.keys(), autopct='%1.1f%%')
        axes[i].set_title(action, fontsize=14)
        axes[i].axis('equal')

- Applicant Race

cate_pie('applicant_race_name_1')

The pie chart reveals very clearly that white applicants have more chances to earn the approval.

- Applicant Sex

cate_pie("applicant_sex_name")

Sex is not as strong, but still there is discrepany between men and women.

- Co-applicant

cate_pie("co_applicant")

Chances are high that a sole applicant is a single, reversely, applicant with somebody is married. A married applicant would probably be financially more established than a single who is probably younger and early in his/her career, which could explain the two different charts.

- Agency

cate_pie("agency_abbr")

- Loan Purpose

cate_pie('loan_purpose_name')

It seems like mortgage loans are more open to home purchase purpose. Put it another way, loan officers are hard to find reasons to deny home purchase applications or encouraged to give them approval for their benefits. Clearly, home improvement is not always necessary for a living.

- Loan Type

cate_pie('loan_type_name')

This categorical feature doesn't look like it will help classification from the pies. However, boxplot might uncover some differences.

• Boxplot

The purpose of boxplot for this project is to see if categorical data can work for classification by interacting with numerical data. Therefore, a categorical value measured differently by numerical data is likely helpful to distinguish classes. We want to go through some of the categorical data that showed certain patterns.

""" draw boxplot by category and loan results """
def category_boxplot(cate, title):
    sns.set(style="whitegrid")
    plt.figure(figsize=(8, 4))
    # income
    ax = sns.boxplot(x="income_sqrt", y="action_taken_name", hue=cate,
                     data=hmda_df)
    ax.set_title("Income by " + title, fontsize=15)
    ax.set_ylabel("")
    ax.legend(bbox_to_anchor=(1.05, 1), loc=2)

    plt.show()

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="applicant_sex_name",
            data=hmda_df, ax=axes[0])
axes[0].set_title("Loan Amount by Sex", fontsize=15)
axes[0].set_xlabel("")
axes[0].legend(bbox_to_anchor=(0.5, -0.4), loc=8)

sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="agency_abbr",
            data=hmda_df, ax=axes[1])
axes[1].set_title("Loan Amount by Agency", fontsize=15)
axes[1].set_xlabel("")
axes[1].legend(bbox_to_anchor=(0.5, -0.4), loc=8)

sns.boxplot(x="action_taken_name", y="loan_amount_sqrt", hue="loan_type_name",
            data=hmda_df, ax=axes[2])
axes[2].set_title("Loan Amount by Loan Type", fontsize=15)
axes[2].set_xlabel("")
axes[2].legend(bbox_to_anchor=(0.5, -0.35), loc=8)

<matplotlib.legend.Legend at 0x1f03dcd9128>

None of them are impressively separating classes. But they are still plotting two classes more or less differently.

5. Clustering¶

Now that we have found out some patterns, we can base them for clustering census tract numbers. To visualize about a hundred of them, we can draw yet another scatterplots using numerical features. By doing so, we can decide which clustering methods to use. Since often there is no clear standard when it comes to clustering, it is critical to plot data and visually judge what method is most suitable. But we can always go with trial and error.

# criteria 1 based on loan approval rate
census_minor = hmda_df.groupby(['census_tract_number'])['minority_population'].mean()
census_ownership = hmda_df.groupby(['census_tract_number'])['number_of_owner_occupied_units'].mean()
census_df_1 = pd.DataFrame({'minority': census_minor, 'ownership': census_ownership})
census_df_1.reset_index(inplace=True)

# criteria 2 based on income vs. loan amount
census_income = hmda_df.groupby(['census_tract_number'])['income_sqrt'].mean()
census_loan_amount = hmda_df.groupby(['census_tract_number'])['loan_amount_sqrt'].mean()
census_df_2 = pd.DataFrame({'income': census_income, 'loan_amount': census_loan_amount})
census_df_2.reset_index(inplace=True)

# merge criteria 1 and 2
census_df = pd.merge(census_df_1, census_df_2)

# combination of numerical data
fig, axes = plt.subplots(2, 2, figsize=(12, 11))

sns.scatterplot(x="minority", y="ownership", data=census_df, ax=axes[0,0])
axes[0,0].set_title("Minority vs. Ownership", fontsize=15)

sns.scatterplot(x="minority", y="loan_amount", data=census_df, ax=axes[0,1])
axes[0,1].set_title("Minority vs. Loan Amount", fontsize=15)

sns.scatterplot(x="income", y="loan_amount", data=census_df, ax=axes[1,0])
axes[1,0].set_title("Income vs. Loan Amount", fontsize=15)

sns.scatterplot(x="ownership", y="loan_amount", data=census_df, ax=axes[1,1])
axes[1,1].set_title("Ownership vs. Loan Amount", fontsize=15)

plt.tight_layout()

• Hierarchical Clustering: Dendrogram

The utmost advantage of hierarchical clustering is that it gives a general idea of how many groups in the data. Especially, dendrogram is very useful to see it visually. First, we can use the approval rate plot (first from the previous plots) for this method as we focus on distance.

def hier_clustering_labels(X, n):
    clustering = AgglomerativeClustering(
        n_clusters=n, affinity='euclidean', linkage='ward')
    clustering.fit_predict(X)

    return clustering.labels_

X = [[x[0], x[1]] for x in zip(census_df['minority'], census_df['loan_amount'])]
X = np.array(X)

sns.set(style='white')
plt.figure(figsize=(12,6))
plt.title("Census Tract Dendrogram")
dend = shc.dendrogram(shc.linkage(X, method='ward'))

It is hard to which census tract number is related to another from the dendrogram, but we can still figure out the number to group them depending on where to cut off. The bset way is plotting them and iterate some numbers for clustering to decide the number of clusters.

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 2), cmap='rainbow')
axes[0,0].set_title("Cluster n=2", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 3), cmap='rainbow')
axes[0,1].set_title("Cluster n=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 4), cmap='rainbow')
axes[0,2].set_title("Cluster n=4", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 5), cmap='rainbow')
axes[1,0].set_title("Cluster n=5", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 6), cmap='rainbow')
axes[1,1].set_title("Cluster n=6", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=hier_clustering_labels(X, 7), cmap='rainbow')
axes[1,2].set_title("Cluster n=7", fontsize=15)

plt.tight_layout()

Unfortunately, the clusters are cut vertically, which means that only x-axis may be enough to group them. So, the clustering might be meaningless because an existing feature already can do the job. However, clustering with n=3 seems to make some sense. We can add this feature and always remove it later when modeling.

# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], hier_clustering_labels(X, 3)))
hmda_df['census_hier'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

• Centroid Based Clustering: K-Means

def kmeans_labels(X, n_clusters):
    kmeans = KMeans(n_clusters=n_clusters).fit(X)

    return kmeans.labels_

X = [[x[0], x[1]] for x in zip(census_df['income'], census_df['loan_amount'])]
X = np.array(X)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 2), cmap='rainbow')
axes[0,0].set_title("Cluster n=2", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 3), cmap='rainbow')
axes[0,1].set_title("Cluster n=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 4), cmap='rainbow')
axes[0,2].set_title("Cluster n=4", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 5), cmap='rainbow')
axes[1,0].set_title("Cluster n=5", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 6), cmap='rainbow')
axes[1,1].set_title("Cluster n=6", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=kmeans_labels(X, 7), cmap='rainbow')
axes[1,2].set_title("Cluster n=7", fontsize=15)

Text(0.5,1,'Cluster n=7')

# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], kmeans_labels(X, 5)))
hmda_df['census_kmeans'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

This could be useful because the clusters are naturally separated by income level and loan amount. It could strengthen the pattern we have already found by increasing the sparsity.

• Density Based Clustering: DBSCAN

DBSCAN takes a different course than the hierarchical clustering does as previously shown. This method is based on density, which means how closely the dots are located from each other. It needs scailing for better processing beforehand, but the rest are very similar steps as done for hierarchical clustering. That is, plotting and iteration are heavily used to find the best number of clusters. Specifically speaking, finding the best epilon and minimum number of samples is our goal.

def db_scan_labels(X, eps, min_samples):
    db_scan = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    return db_scan.labels_

- Income vs. Loan Amount: Noise Removal

X = [[x[0], x[1]] for x in zip(census_df['minority'], census_df['ownership'])]
X = np.array(X)
X = ppr.StandardScaler().fit_transform(X)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 2), cmap='rainbow')
axes[0,0].set_title("eps=0.15, min_sample=2", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 3), cmap='rainbow')
axes[0,1].set_title("eps=0.15, min_sample=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 4), cmap='rainbow')
axes[0,2].set_title("eps=0.2, min_sample=4", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.25, 4), cmap='rainbow')
axes[1,0].set_title("eps=0.25, min_sample=5", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 4), cmap='rainbow')
axes[1,1].set_title("eps=0.3, min_sample=4", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 5), cmap='rainbow')
axes[1,2].set_title("eps=0.3, min_sample=5", fontsize=15)

Text(0.5,1,'eps=0.3, min_sample=5')

Aware that purple dots are noises, the result of epsilon equal to 0.15 and minimum samples 3 should probably be the best clustering as the others have a couple or more flaws. However, there are so many unneccesary small clusters which we can get rid of once clustering process is done.

# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], db_scan_labels(X, 0.15, 3)))
hmda_df['census_dbscan_1'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

# 2 clusters and all noises
hmda_df['census_dbscan_1'] = hmda_df['census_dbscan_1'].apply(lambda x : -1
                                    if x not in [0, 10] else x)

- Ownership vs. Loan Amount: Grouping

X = [[x[0], x[1]] for x in zip(census_df['ownership'], census_df['loan_amount'])]
X = np.array(X)
X = ppr.StandardScaler().fit_transform(X)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
sns.despine()

axes[0,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.15, 5), cmap='rainbow')
axes[0,0].set_title("eps=0.15, min_sample=4", fontsize=15)

axes[0,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 6), cmap='rainbow')
axes[0,1].set_title("eps=0.2, min_sample=3", fontsize=15)

axes[0,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.2, 7), cmap='rainbow')
axes[0,2].set_title("eps=0.2, min_sample=7", fontsize=15)

axes[1,0].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.25, 6), cmap='rainbow')
axes[1,0].set_title("eps=0.25, min_sample=6", fontsize=15)

axes[1,1].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.3, 6), cmap='rainbow')
axes[1,1].set_title("eps=0.3, min_sample=6", fontsize=15)

axes[1,2].scatter(X[:,0], X[:,1], c=db_scan_labels(X, 0.35, 5), cmap='rainbow')
axes[1,2].set_title("eps=0.35, min_sample=5", fontsize=15)

Text(0.5,1,'eps=0.35, min_sample=5')

It should be reasonable to have only one large cluster and tag all the rest as noise.

# create a cluster feature
clusters = dict(zip(census_df['census_tract_number'], db_scan_labels(X, 0.2, 7)))
hmda_df['census_dbscan_2'] = hmda_df['census_tract_number'].apply(lambda x : clusters[x])

# 1 cluster and all noises
hmda_df['census_dbscan_2'] = hmda_df['census_dbscan_2'].apply(lambda x : x if x == 0 else -1)

# drop census tract number
hmda_df.drop(['census_tract_number'], axis=1, inplace=True)

6. Modeling¶

Decision Tree is simple, thus, it is often used as a baseline model. We will build a Decision Tree model with all features included, and then test other models like Logistics, Random Forest (bagging), Naive Bayes (independence), Gradient Boosting (boosting). Also, we will manipulate a feature set using Recursive Feature Elimination and Pinciple Component Analysis.

# reduce the most class
second_large_supp = len(hmda_df[hmda_df['action_taken_name']=="Denied"])
size_drop = len(hmda_df[hmda_df['action_taken_name']=="Approved"]) - second_large_supp
drop_indice = np.random.choice(list(hmda_df[hmda_df['action_taken_name']=="Approved"].index),
                               size_drop, replace=False)

hmda_df = hmda_df.loc[~hmda_df.index.isin(drop_indice)]

def split_X_y(df, features, y):
    cate_cols = df.select_dtypes(include='category').columns
    for col in cate_cols:
        le = ppr.LabelEncoder()
        le.fit(df[col])
        df[col] = le.transform(df[col])

    X = np.array(df[features])
    y = np.array(df[y])
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33)

    # normalize for faster calculation
    X_train = ppr.normalize(X_train)
    X_test = ppr.normalize(X_test)

    return X_train, X_test, y_train, y_test

# all features
features = list(hmda_df.columns)
features.remove('action_taken_name')

# X, y
X_train, X_test, y_train, y_test = split_X_y(hmda_df, features, 'action_taken_name')

# train and show scores
def train_evaluate(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

    print("5-fold Cross Validation Accuracy:", round(cross_val_score(clf, X_train, y_train, cv=5).mean(), 2))
    print("\nConfusion Matrix\n", confusion_matrix(y_test, pred))
    print("\n", classification_report(y_test, pred))

    return clf

• Baseline: Decision Tree

clf_dt = train_evaluate(tree.DecisionTreeClassifier(), X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.61

Confusion Matrix
 [[5718 3691]
 [3524 5863]]

              precision    recall  f1-score   support

          0       0.62      0.61      0.61      9409
          1       0.61      0.62      0.62      9387

avg / total       0.62      0.62      0.62     18796

The result is not so impressive, however, the errors are uniformly distributed at least.

• Logistic Regression

clf_lr = train_evaluate(LogisticRegression(), X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.61

Confusion Matrix
 [[7944 1465]
 [5835 3552]]

              precision    recall  f1-score   support

          0       0.58      0.84      0.69      9409
          1       0.71      0.38      0.49      9387

avg / total       0.64      0.61      0.59     18796

Unlike the Decision Tree result, the errors are concentrated on Approval cases.

• Bagging: Random Forest

clf_rf = train_evaluate(RandomForestClassifier(max_depth=20), X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.66

Confusion Matrix
 [[6745 2664]
 [3460 5927]]

              precision    recall  f1-score   support

          0       0.66      0.72      0.69      9409
          1       0.69      0.63      0.66      9387

avg / total       0.68      0.67      0.67     18796

The result is better than the Decision Tree, but it is still so good.

• Naive Bayes

clf_nb = train_evaluate(GaussianNB(), X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.58

Confusion Matrix
 [[8723  686]
 [6969 2418]]

              precision    recall  f1-score   support

          0       0.56      0.93      0.70      9409
          1       0.78      0.26      0.39      9387

avg / total       0.67      0.59      0.54     18796

Like the Logistic Regression, it is biased to predict Denied only.

• Stochastic: Gradient Boosting

Gradient Boosting is a very sophisticated model to configurate parameters for the best performance. The following will seek the optimal parameters based on AUC scores.

def roc_auc(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    train_pred = clf.predict(X_train)
    fpr, tpr, thrd = roc_curve(y_train, train_pred)
    roc_auc_train = auc(fpr, tpr)

    test_pred = clf.predict(X_test)
    fpr, tpr, thrd = roc_curve(y_test, test_pred)
    roc_auc_test = auc(fpr, tpr)

    return roc_auc_train, roc_auc_test

# learning rate
learning_rates = [1, 0.5, 0.25, 0.1, 0.05, 0.01]
lr_train = []
lr_test = []
for lr in learning_rates:
    clf_gb = GradientBoostingClassifier(learning_rate=lr)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    lr_train.append(roc_auc_train)
    lr_test.append(roc_auc_test)

# number of estimators
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 150, 200]
est_train = []
est_test = []
for n in n_estimators:
    clf_gb = GradientBoostingClassifier(n_estimators=n)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    est_train.append(roc_auc_train)
    est_test.append(roc_auc_test)

# max depth
max_depths = np.arange(1, 16)
md_train = []
md_test = []
for d in max_depths:
    clf_gb = GradientBoostingClassifier(max_depth=d)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    md_train.append(roc_auc_train)
    md_test.append(roc_auc_test)

# max_features
max_features = list(range(1, len(features)))
mf_train = []
mf_test = []
for f in max_features:
    clf_gb = GradientBoostingClassifier(max_features=f)
    roc_auc_train, roc_auc_test = roc_auc(
        clf_gb, X_train, X_test, y_train, y_test)
    mf_train.append(roc_auc_train)
    mf_test.append(roc_auc_test)

fig, axes = plt.subplots(2, 2, figsize=(9, 9))

axes[0,0].plot(learning_rates, lr_train, 'b')
axes[0,0].plot(learning_rates, lr_test, 'r')
axes[0,0].set_title("Learning Rate")

axes[0,1].plot(n_estimators, est_train, 'b')
axes[0,1].plot(n_estimators, est_test, 'r')
axes[0,1].set_title("N Estimators")

axes[1,0].plot(max_depths, md_train, 'b')
axes[1,0].plot(max_depths, md_test, 'r')
axes[1,0].set_title("Max Depth")

axes[1,1].plot(max_features, mf_train, 'b')
axes[1,1].plot(max_features, mf_test, 'r')
axes[1,1].set_title("Max Features")

Text(0.5,1,'Max Features')

clf_gb = train_evaluate(GradientBoostingClassifier(
        learning_rate=0.2,
        n_estimators=150,
        max_depth=7,
        max_features=5
    ), X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.68

Confusion Matrix
 [[6833 2576]
 [3280 6107]]

              precision    recall  f1-score   support

          0       0.68      0.73      0.70      9409
          1       0.70      0.65      0.68      9387

avg / total       0.69      0.69      0.69     18796

• Neural Network

Similar with Gradient Boosting, Neural Network need a certain set of parameters for tuning. This can be done using the Grid Search method.

params = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant','adaptive'],
}

clf = GridSearchCV(MLPClassifier(), params, n_jobs=-1, cv=3)
clf.fit(X_train, y_train)

print('Best parameters found:', clf.best_params_)

Best parameters found: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}

clf_mlp = train_evaluate(MLPClassifier(
                    alpha=0.001,
                    activation='relu',
                    solver='adam',
                    hidden_layer_sizes=(50,100,50),
                    learning_rate='constant'),
                X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.65

Confusion Matrix
 [[8102 1307]
 [5343 4044]]

              precision    recall  f1-score   support

          0       0.60      0.86      0.71      9409
          1       0.76      0.43      0.55      9387

avg / total       0.68      0.65      0.63     18796

• Feature Manipulation: RFE, Heuristic

model = LogisticRegression()
rfe = RFE(model, 9).fit(X_train, y_train)

feature_rank = []
features_rfe = []

for feature, rank in zip(features, rfe.ranking_):
    feature_rank.append((feature, rank))
    if rank == 1:
        features_rfe.append(feature)

X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe = split_X_y(
    hmda_df, features_rfe, 'action_taken_name')

sorted(feature_rank, key=lambda tup : tup[1])

[('agency_abbr', 1),
 ('applicant_race_name_1', 1),
 ('minority_population', 1),
 ('income_sqrt', 1),
 ('loan_amount_sqrt', 1),
 ('tract_income_to_loan', 1),
 ('income_to_loan', 1),
 ('census_hier', 1),
 ('census_dbscan_1', 1),
 ('census_kmeans', 2),
 ('population', 3),
 ('loan_purpose_name', 4),
 ('co_applicant', 5),
 ('census_dbscan_2', 6),
 ('loan_type_name', 7),
 ('number_of_1_to_4_family_units', 8),
 ('number_of_owner_occupied_units', 9),
 ('applicant_sex_name', 10)]

clf_lr_rfe = train_evaluate(LogisticRegression(), X_train_rfe, X_test_rfe, y_train_rfe, y_test_rfe)

5-fold Cross Validation Accuracy: 0.65

Confusion Matrix
 [[6621 2692]
 [3971 5512]]

              precision    recall  f1-score   support

          0       0.63      0.71      0.67      9313
          1       0.67      0.58      0.62      9483

avg / total       0.65      0.65      0.64     18796

Selecting 9 of half the features, the f1-score of Logistic Regression is improved from 0.59. This indicates that the features had too much interaction to interfere with our target classes. Now, we can take this into consideration to select features in a heuristic way. The rule is to include features that care less correlated and use the Naive Bayes model which assumes independence among features.

features_comb = ['agency_abbr', 'applicant_sex_name', 'minority_population', 'income_to_loan',
                 'loan_purpose_name', 'loan_type_name', 'co_applicant',
                 'census_kmeans', 'census_dbscan_1', 'census_hier',
                 'number_of_1_to_4_family_units', 'number_of_owner_occupied_units']

X_train_comb, X_test_comb, y_train_comb, y_test_comb = split_X_y(
    hmda_df, features_comb, 'action_taken_name')

clf_nb_comb = train_evaluate(GaussianNB(), X_train_comb, X_test_comb, y_train_comb, y_test_comb)

5-fold Cross Validation Accuracy: 0.6

Confusion Matrix
 [[6179 3188]
 [4227 5202]]

              precision    recall  f1-score   support

          0       0.59      0.66      0.62      9367
          1       0.62      0.55      0.58      9429

avg / total       0.61      0.61      0.60     18796

• Principle Component Analysis

PCA takes a totally differnt look at features in that RFE and other feature selection methods simply include/exclude features. For that reason, we can hardly say which feature makes how much contributions.

reduced_df = df_reducer(hmda_df, 10000)

X_train_pca, X_test_pca, y_train_pca, y_test_pca = split_X_y(
    reduced_df, features_rfe, 'action_taken_name')

pca = PCA(n_components=5)
pca.fit(X_train_pca)

X_train_pca = pca.transform(X_train_pca)
X_test_pca = pca.transform(X_test_pca)

clf_pca = SVC(C=5, gamma=10)
clf_pca.fit(X_train_pca, y_train_pca)

print(pca.explained_variance_ratio_)

train_evaluate(clf_pca, X_train_pca, X_test_pca, y_train_pca, y_test_pca)

[0.77424267 0.10446486 0.05925091 0.03537464 0.01358645]
5-fold Cross Validation Accuracy: 0.66

Confusion Matrix
 [[1326  342]
 [ 809  823]]

              precision    recall  f1-score   support

          0       0.62      0.79      0.70      1668
          1       0.71      0.50      0.59      1632

avg / total       0.66      0.65      0.64      3300

SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=10, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

• Voting Ensemble

So far, we have found out that Gradient Boosting works the best followed by Random Forest. What we can do with these is mixing them together to make them vote. Also, we can add Naive Bayes because the model is fundamentally different. Voting Ensemble is useful when a certain level of diversity is secured.

estimators = []

estimators.append(('nb', clf_nb))
estimators.append(('rf', clf_rf))
estimators.append(('gb', clf_gb))

ensemble = VotingClassifier(estimators)
clf_ens = train_evaluate(ensemble, X_train, X_test, y_train, y_test)

5-fold Cross Validation Accuracy: 0.68

Confusion Matrix
 [[7470 1939]
 [4048 5339]]

              precision    recall  f1-score   support

          0       0.65      0.79      0.71      9409
          1       0.73      0.57      0.64      9387

avg / total       0.69      0.68      0.68     18796

7. Conclusion¶

Through the data analysis, we learnt the following patterns.

Income and loan amount are correlated, and the larger they are the higher chance to get approval
Higher minority population rate affects mortgage loan results negatively
Low minority population rate has a weak influence on approval
A white male is more likely to get approval then a black female
Racial/Sex factors become insignificant as loan amount and income increase
Home purchase is the best purpose for approval

However, these patterns are not so strong that it was difficult to clearly cut the groups. There were such interactions among features that classifying observations in one way would be no longer the same if in another way. Figuratively speaking, observations were tangled even in a finely separated group. For example, two cases having almost or exact the same condition would be classified differently; Approved and Denied, repectively.

To solve this, we used sophisticated models like Gradient Boosting and MLP. Also, we tried a different feature set as well as PCA. However, the entanglement was at a serious level making the best model performance around 0.7 accuracy and f1-score.

Reference¶

Dataset

Theory and Syntax

		_id	amount	case_share	count	income
Accepted	0	White	153.18	0.72	64065	92.32
	1	Information not provided by applicant in mail,...	142.64	0.12	10753	89.92
	2	Black or African American	115.17	0.10	8973	69.56
	3	Asian	230.07	0.04	3497	112.08
	4	Not applicable	1179.68	0.01	758	119.50
	5	American Indian or Alaska Native	122.13	0.00	327	78.69
	6	Native Hawaiian or Other Pacific Islander	178.23	0.00	149	91.58
Denied	0	White	125.41	0.51	15893	76.92
	1	Black or African American	73.23	0.24	7607	57.51
	2	Information not provided by applicant in mail,...	91.27	0.21	6417	70.92
	3	Asian	168.86	0.03	877	84.43
	4	American Indian or Alaska Native	90.10	0.01	296	56.45
	5	Not applicable	925.04	0.00	105	43.75
	6	Native Hawaiian or Other Pacific Islander	102.83	0.00	77	78.18

	mean	std	min	25%	50%	75%	max
applicant_income_000s	85.417123	98.959746	1.00	43.000000	66.000000	102.000000	8998.000000
loan_amount_000s	139.464921	112.363212	1.00	69.000000	118.000000	184.000000	3934.000000
minority_population	27.860051	26.912933	1.74	9.760000	17.049999	33.970001	100.000000
number_of_1_to_4_family_units	1413.824246	475.310410	30.00	1088.000000	1388.000000	1686.000000	2638.000000
number_of_owner_occupied_units	1152.064173	464.993325	6.00	823.000000	1115.000000	1450.000000	2551.000000
population	3957.019911	1383.486978	436.00	2929.000000	3794.000000	4781.000000	7968.000000
tract_to_msamd_income	143.221244	58.357710	0.00	106.239998	132.679993	176.369995	351.420013

	applicant_income_000s	loan_amount_000s	minority_population	number_of_1_to_4_family_units	number_of_owner_occupied_units	population	tract_to_msamd_income	income_sqrt	loan_amount_sqrt	tract_income_to_loan	income_to_loan
applicant_income_000s	1.000000	0.517786	-0.154108	0.016629	0.135023	0.109391	0.412028	0.983278	0.476673	-0.241609	-0.124608
loan_amount_000s	0.517786	1.000000	-0.271881	0.032788	0.209857	0.178856	0.565910	0.530721	0.968685	-0.055753	0.131545
minority_population	-0.154108	-0.271881	1.000000	-0.152041	-0.404635	-0.189936	-0.463538	-0.176079	-0.329330	-0.024006	-0.043099
number_of_1_to_4_family_units	0.016629	0.032788	-0.152041	1.000000	0.882745	0.854852	0.143533	0.016996	0.028809	0.018445	0.002901
number_of_owner_occupied_units	0.135023	0.209857	-0.404635	0.882745	1.000000	0.820659	0.432146	0.147526	0.220513	0.031800	0.024365
population	0.109391	0.178856	-0.189936	0.854852	0.820659	1.000000	0.219182	0.115367	0.175825	0.003819	0.020837
tract_to_msamd_income	0.412028	0.565910	-0.463538	0.143533	0.432146	0.219182	1.000000	0.426615	0.546941	0.049301	0.049926
income_sqrt	0.983278	0.530721	-0.176079	0.016996	0.147526	0.115367	0.426615	1.000000	0.498302	-0.307051	-0.169688
loan_amount_sqrt	0.476673	0.968685	-0.329330	0.028809	0.220513	0.175825	0.546941	0.498302	1.000000	-0.052419	0.145504
tract_income_to_loan	-0.241609	-0.055753	-0.024006	0.018445	0.031800	0.003819	0.049301	-0.307051	-0.052419	1.000000	0.868928
income_to_loan	-0.124608	0.131545	-0.043099	0.002901	0.024365	0.020837	0.049926	-0.169688	0.145504	0.868928	1.000000

	action_taken_name	agency_abbr	applicant_income_000s	applicant_race_name_1	applicant_sex_name	census_tract_number	co_applicant_race_name_1	loan_amount_000s	loan_purpose_name	loan_type_name	minority_population	number_of_1_to_4_family_units	number_of_owner_occupied_units	population	tract_to_msamd_income
0	Loan originated	HUD	43.0	White	Female	5833.00	No co-applicant	66	Refinancing	FHA-insured	11.330000	1031.0	862.0	2550	96.790001
1	Loan originated	FDIC	32.0	White	Female	5837.00	No co-applicant	100	Home purchase	Conventional	14.890000	1417.0	1216.0	3721	99.059998
2	Loan originated	HUD	147.0	White	Male	5970.00	No co-applicant	250	Refinancing	Conventional	7.350000	1995.0	1540.0	4656	132.679993
3	Loan originated	CFPB	57.0	White	Male	5761.00	White	71	Home purchase	VA-guaranteed	16.049999	2463.0	2141.0	6082	111.169998
4	Loan originated	NCUA	104.0	White	Female	5645.01	White	219	Refinancing	Conventional	26.900000	1186.0	1116.0	4246	244.990005

	unique	top	freq
action_taken_name	2	Approved	83400
agency_abbr	4	HUD	58077
applicant_race_name_1	3	White	76333
applicant_sex_name	3	Male	67355
co_applicant	2	No co-applicant	77015
loan_purpose_name	3	Home purchase	52227
loan_type_name	2	Conventional	84550