When Will It Rain In Philadelphia?

By Kyoosik Kim on Oct 2018

Predict Rainy Weather from the Past Pattern

This project analyzes actual weather data from NOAA of the recent two years and builds various predictive models to predict rainy weather. The main purpose is to understand the fair steps to build a prediction model such as data preparation, data exploration, modeling and cross validation. Especially, the project focuses on evaluating the performances of different models to justify a certain model's advantages over the others.

1. Preparation: Data Load, Missing Values
2. Data Exploration: Lineplot, Correlation Matrix
3. Balancing Classes: Bootstrap
4. Modeling: Logistics, SVM, Random Forest, Gradient Boosting, Multi-layer Perceptron
5. Evaluation and Model Selection: Accuracy, Precision, Recall, F1-score, Log-loss, ROC-AUC
6. Performance Comparison: Full Dataset, Balanced Dataset


1. Preparation


This section is to read and format the dataset before analysis with the focus on dropping meaningless columns, handling missing values, converting data types if necessary. The following libraries will be used over the project, and the modeling libraries will be called on the way for clarity.

In [1]:
# dataframe
import numpy as np
import pandas as pd

# visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.3)
sns.set_style("whitegrid")

# others
import re
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings('ignore')
In [2]:
%%bash
ls
LCD_documentation.pdf
climatological_data_philadelphia.csv
rain_prediction_models.ipynb
temp_prcp_jonesboro_2016-2017_selected.csv
In [3]:
weather_df = pd.read_csv("climatological_data_philadelphia.csv")
weather_df.head()
Out[3]:
STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE REPORTTPYE HOURLYSKYCONDITIONS HOURLYVISIBILITY HOURLYPRSENTWEATHERTYPE ... MonthlyMaxSeaLevelPressureTime MonthlyMinSeaLevelPressureValue MonthlyMinSeaLevelPressureDate MonthlyMinSeaLevelPressureTime MonthlyTotalHeatingDegreeDays MonthlyTotalCoolingDegreeDays MonthlyDeptFromNormalHeatingDD MonthlyDeptFromNormalCoolingDD MonthlyTotalSeasonToDateHeatingDD MonthlyTotalSeasonToDateCoolingDD
0 WBAN:13739 PHILADELPHIA INTERNATIONAL AIRPORT PA US 3 39.87327 -75.22678 2016-09-01 00:54 FM-15 SCT:04 46 BKN:07 70 OVC:08 100 8.00 -RA:02 |RA:61 | ... -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
1 WBAN:13739 PHILADELPHIA INTERNATIONAL AIRPORT PA US 3 39.87327 -75.22678 2016-09-01 01:00 FM-12 41 NaN ||RA:61 ... -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
2 WBAN:13739 PHILADELPHIA INTERNATIONAL AIRPORT PA US 3 39.87327 -75.22678 2016-09-01 01:14 FM-16 SCT:04 19 BKN:07 43 OVC:08 95 6.00 -RA:02 BR:1 |RA:61 | ... -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
3 WBAN:13739 PHILADELPHIA INTERNATIONAL AIRPORT PA US 3 39.87327 -75.22678 2016-09-01 01:31 FM-16 BKN:07 27 OVC:08 42 4.00 RA:02 BR:1 |RA:62 | ... -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
4 WBAN:13739 PHILADELPHIA INTERNATIONAL AIRPORT PA US 3 39.87327 -75.22678 2016-09-01 01:37 FM-16 SCT:04 12 BKN:07 17 OVC:08 40 3.00 +RA:02 BR:1 |RA:63 | ... -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN

5 rows × 90 columns

In [4]:
weather_df.shape
Out[4]:
(26837, 90)

There are just too many columns that do not seem to be useful due to exceeding number of NaN values. Also, I can ignore some series of columns related with the station since the data set is from the same station. I choose Date, Temperature, Humidity, and Pressure for predictors. Of course, we leave Precipitation as the goal is to predict 'Rain' or 'No Rain'.

In [5]:
col_keep = ['DATE', 'HOURLYDRYBULBTEMPF', 'HOURLYRelativeHumidity', 'HOURLYStationPressure', 'HOURLYPrecip']
col_drop = set(weather_df.columns) - set(col_keep)

weather_df.drop(col_drop, axis=1, inplace=True)
weather_df.columns
Out[5]:
Index(['DATE', 'HOURLYDRYBULBTEMPF', 'HOURLYRelativeHumidity',
       'HOURLYStationPressure', 'HOURLYPrecip'],
      dtype='object')
In [6]:
weather_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26837 entries, 0 to 26836
Data columns (total 5 columns):
DATE                      26837 non-null object
HOURLYDRYBULBTEMPF        26108 non-null float64
HOURLYRelativeHumidity    26108 non-null float64
HOURLYStationPressure     26094 non-null object
HOURLYPrecip              20784 non-null object
dtypes: float64(2), object(3)
memory usage: 1.0+ MB

As for the independent variable, we need to process it a little further because they are basically continous values plus some categorical values. For example, T stands for Trace as a little bit of rain, but we will classify this as 'None'. Finally, all the numerical values other than zero become a precipitation in float and the rest become zero meaning no rain.

In [7]:
def isPrecip(x):
    try:
        x = float(x)
    except:
        return 0
    
    if x > 0:
        return x
    else:
        return 0
    
weather_df['HOURLYPrecip'] = weather_df['HOURLYPrecip'].apply(lambda x : isPrecip(x))

The predictor Pressure has some problem with type being 'object' instead of float. In addition, some of them has alphabet character like '22.45s' when I only need the numbers. Then, the task becomes extract numbers and convert the type to 'float'.

In [8]:
def convPres(x):
    try:
        x = float(re.findall("\d+\.\d+", x)[0])
    except:
        return x
    
    return x
    
weather_df['HOURLYStationPressure'] = weather_df['HOURLYStationPressure'].apply(lambda x : convPres(x))

Lastly, we would like to point out that the Date column is too precise. It will face an overfitting trouble if I go by minute or hour. Also, the type is string so it will be difficult to manipulate them. The following converts it to 'timestamp' to date level.

In [9]:
# into timestamp (%Y-%m-%d)
weather_df['DATE'] = weather_df['DATE'].apply(lambda d : d[:10])
weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])


• Handling Missing Values

In [10]:
weather_df.describe().T
Out[10]:
count mean std min 25% 50% 75% max
HOURLYDRYBULBTEMPF 26108.0 57.205493 17.488205 6.00 43.00 58.00 72.00 97.00
HOURLYRelativeHumidity 26108.0 68.965873 20.302824 13.00 52.00 71.00 88.00 100.00
HOURLYStationPressure 26094.0 29.997287 0.231539 28.99 29.86 29.99 30.13 30.81
HOURLYPrecip 26837.0 0.007818 0.048327 0.00 0.00 0.00 0.00 2.52

Except for the independent variable Hourly Precipitation, the rest are mssing values. By the nature of the weather data, a missing value should stay similar with the previous or next status. However, there will be a problem to refer to these if missing values are consecutively seated. On the contrary, it will be fine if they are spread out.

In [11]:
""" Count the longest consecutive rows with missing values """
def longestNull(x_index):
    length = 1
    max_length = length
    prev_item = x_index[0]
    for item in x_index[1:]:
        if item == prev_item+1: # consecutive
            length += 1
        else:
            max_length = max(max_length, length)
            length = 1
        prev_item = item
            
    return max_length

print("Longest Consecutive NaN")
print("Temperature:", longestNull(weather_df[weather_df['HOURLYDRYBULBTEMPF'].isnull()].index))
print("Humidity:", longestNull(weather_df[weather_df['HOURLYRelativeHumidity'].isnull()].index))
print("Pressure:", longestNull(weather_df[weather_df['HOURLYStationPressure'].isnull()].index))
Longest Consecutive NaN
Temperature: 1
Humidity: 1
Pressure: 3

It seems like most of the missing values are spread out. Again, this is time series weather data that observations would not jump up and down. For example, it would hardly pour suddenly after sunny hours.

In [12]:
# filling missing values
weather_df.fillna(method='ffill', inplace=True)

2. Data Exploration


Since the goal is predict Rain, it would be wise to look into the independent variable first. How to split the dataset for training and validation is fundamentally dependent on how the values are distributed.

In [13]:
rain_ratio = (weather_df['HOURLYPrecip']!=0).sum()/len(weather_df)
print("RAIN Ratio:", rain_ratio.round(4)*100, "%")
RAIN Ratio: 11.31 %
In [14]:
sns.kdeplot(weather_df[weather_df['HOURLYPrecip']!=0]['HOURLYPrecip'], 
            shade=True, label="Hourly Precip")
plt.title("Distribution of Precipitation")
plt.xlabel("Inches")
plt.show()

Rainy days account for just over 11%, and precipitation is usually very low on these days. We can label '0' for dry days and '1' for rainy days to make it into a classification problem. This way, it would be convenient to handle classes in terms of programming.

In [15]:
weather_df['HOURLYPrecip'] = weather_df['HOURLYPrecip'].apply(lambda x : 1 if x!=0 else 0)


• Line Plot

To plot the features over time by each class, we need to divide the dataset as such. Also, we need to bind the hourly observations by day since we want to focus on general trends at this stage.

In [16]:
# divide it by class
none_df = weather_df[weather_df['HOURLYPrecip']==0]
rain_df = weather_df[weather_df['HOURLYPrecip']==1]
print("Number of Class Rain:", len(rain_df))
print("Number of Class None:", len(none_df))
Number of Class Rain: 3036
Number of Class None: 23801
In [17]:
# group them by day for easier plotting
none_df = none_df.groupby('DATE').mean()
none_df = none_df.reset_index(level=[0])
rain_df = rain_df.groupby('DATE').mean()
rain_df = rain_df.reset_index(level=[0])
In [18]:
# plot temperature
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYDRYBULBTEMPF", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYDRYBULBTEMPF", label="Rain")
plt.title("Dry Bulb Temperature (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()
Out[18]:
<matplotlib.legend.Legend at 0x2540eb18f98>

Dry bulb temperature on rainy days seems to be lower than dry days in summer, and the other way around in winter. Although this pattern is not so clear, it makes sense because rainy days is relatively cooler in summer and warmer in winter.

In [19]:
# plot relative humidity
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYRelativeHumidity", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYRelativeHumidity", label="Rain")
plt.title("Relative Humidity (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()
Out[19]:
<matplotlib.legend.Legend at 0x2540eb5e160>

It is quite obvious that rainy days are generally more humid than dry days over all seasons. The result could be a bit different than someone might expect for winder days, but this scientific fact explains why snowy days should be humid.

In [20]:
# plot pressure
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYStationPressure", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYStationPressure", label="Rain")
plt.title("Pressure (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()
Out[20]:
<matplotlib.legend.Legend at 0x2540ee1c2e8>

It looks like the pressure is lower when it is rainy. This is also a scientific fact that it is likely to rain when pressure is lower, according to this explanation. In summary, humidity and pressure seem to be useful to recognize rainy days.


• Correlation Map

In [21]:
# extract months as string type (categorical: 01, 02, ..., 11, 12)
weather_df['DATE'] = weather_df['DATE'].apply(lambda x : x.strftime('%m'))
weather_df.rename(columns={'DATE': 'MONTH'}, inplace=True)
In [22]:
# change the labels into digits
weather_df['HOURLYPrecip'] = (weather_df['HOURLYPrecip']==1).astype(int)

corr = weather_df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, mask=mask,
            cmap='YlGnBu', vmax=0.5, vmin=-0.5)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x2540f339668>

As seen on the plot, Relative Humidity has the highest correlation with Precipitation while the rest are moderately or not quite related. Another noticible correlation is among features such as humidity vs. pressure. This is collinearity that might hinder modeling, however, it should not be a concern because the goal is to build classification models. Although collinearity could be a serious problem for regression models, classification models wouldn't be bothered that much.


3. Balancing Classes


The original dataset is biased in class, having relatively small number of 'Rain' classes. Our predictive model will be given a certain condition to predict either 'Rain' or 'No Rain' (binary). We would want to ensure that the model has fair training data. One of the ways to do so is balancing them out by giving the same number of classes.

In [23]:
""" generate random indice without replacement """
np.random.seed(123)
def random_index(idx_range, size):
    return np.random.choice(idx_range, size, replace=False)
In [24]:
# divide it by class again for balanced set
rain_df = weather_df[weather_df['HOURLYPrecip']==1]
none_df = weather_df[weather_df['HOURLYPrecip']==0]
In [25]:
# create balanced dataset
none_df = none_df.iloc[random_index(len(none_df), len(rain_df))]
balanced_df = pd.concat([rain_df, none_df])
balanced_df = shuffle(balanced_df)

print("Number of Class Rain:", len(balanced_df[balanced_df['HOURLYPrecip']==1]))
print("Number of Class None:", len(balanced_df[balanced_df['HOURLYPrecip']==0]))
balanced_df.head(10) # classes randomly distributed
Number of Class Rain: 3036
Number of Class None: 3036
Out[25]:
MONTH HOURLYDRYBULBTEMPF HOURLYRelativeHumidity HOURLYStationPressure HOURLYPrecip
22657 05 54.0 97.0 30.03 1
17983 01 63.0 97.0 29.73 1
1362 10 66.0 90.0 29.97 1
16369 11 42.0 49.0 30.01 0
10177 06 59.0 90.0 29.83 0
13557 09 62.0 90.0 29.86 1
25788 08 79.0 88.0 30.09 1
12982 08 76.0 91.0 30.04 0
8528 04 53.0 93.0 29.97 1
21635 04 42.0 92.0 29.86 1
In [26]:
# isolate validation set from training set
val_size = int(len(balanced_df)*0.2) # 20% from balanced set
validation_set = balanced_df.iloc[random_index(len(balanced_df), val_size)]
training_set = balanced_df[~balanced_df.index.isin(validation_set.index)]

print("Validation Set Size:", len(validation_set))
print("Training Set Size:", len(training_set))
Validation Set Size: 1214
Training Set Size: 4858
In [27]:
predictors = set(training_set.columns.values) - set(['HOURLYPrecip'])

# X, y for training set
X_train = training_set[list(predictors)].values
y_train = training_set['HOURLYPrecip']

# X, y for validation set
X_val = validation_set[list(predictors)].values
y_val = validation_set['HOURLYPrecip']

4. Modeling


The weather dataset is basically time series data, which means the cross validation method might not be the best. However, the following models I will be using are not quite time series suitable. In short, they do not really recognize time property in the dataset. For that reason, we will conduct 5-fold cross validation to measure a general performance of each model. Finally, we will evaluate their performances against the validation set we will have set aside.

In [28]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import *
In [29]:
def grid_CV(model, params, X_train, y_train, X_val, y_val):
    clf = GridSearchCV(model, params, cv=5)
    clf.fit(X_train, y_train)
    
    # grid search
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    print("Score / Parameters")
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) / %r" % (mean, std*2, params))
    print("Best Parameters:", clf.best_params_)

    # confusion matrix, report
    pred = clf.predict(X_val)
    accu = accuracy_score(y_val, pred)
    print("\n[Confusion Matrix]\n", confusion_matrix(y_val, pred))
    print("\n[Report]\n", classification_report(y_val, pred))
    
    try:
        # some model does not have 'predict_proba'
        prob = clf.predict_proba(X_val)[:,1]
    except:
        print("This model does not have the method 'predict_proba'")
        
    # ROC, AUC
    fpr, tpr, threshold = roc_curve(y_val, prob)
    auc_score = auc(fpr, tpr)
    
    #log loss
    ll = log_loss(y_val, prob)
    
    return fpr, tpr, accu, auc_score, ll


• Logistics Regression

In [30]:
from sklearn.linear_model import LogisticRegression
In [31]:
logis_params = {'C': [1,10,100], 
                'solver': ['liblinear', 'lbfgs']}
logis_fpr, logis_tpr, logis_accu, logis_auc, logis_lloss = \
    grid_CV(LogisticRegression(), logis_params, X_train, y_train, X_val, y_val)
Score / Parameters
0.819 (+/-0.008) / {'C': 1, 'solver': 'liblinear'}
0.823 (+/-0.012) / {'C': 1, 'solver': 'lbfgs'}
0.820 (+/-0.011) / {'C': 10, 'solver': 'liblinear'}
0.822 (+/-0.011) / {'C': 10, 'solver': 'lbfgs'}
0.822 (+/-0.013) / {'C': 100, 'solver': 'liblinear'}
0.822 (+/-0.011) / {'C': 100, 'solver': 'lbfgs'}
Best Parameters: {'C': 1, 'solver': 'lbfgs'}

[Confusion Matrix]
 [[451 136]
 [ 70 557]]

[Report]
              precision    recall  f1-score   support

          0       0.87      0.77      0.81       587
          1       0.80      0.89      0.84       627

avg / total       0.83      0.83      0.83      1214

• Support Vector Machine

In [32]:
from sklearn.svm import SVC
In [33]:
svc_params = {'C': [1,10,100]}
svc_fpr, svc_tpr, svc_accu, svc_auc, svc_lloss = \
    grid_CV(SVC(probability=True), svc_params, X_train, y_train, X_val, y_val)
Score / Parameters
0.828 (+/-0.013) / {'C': 1}
0.815 (+/-0.018) / {'C': 10}
0.811 (+/-0.017) / {'C': 100}
Best Parameters: {'C': 1}

[Confusion Matrix]
 [[466 121]
 [ 76 551]]

[Report]
              precision    recall  f1-score   support

          0       0.86      0.79      0.83       587
          1       0.82      0.88      0.85       627

avg / total       0.84      0.84      0.84      1214

• Random Forest

In [34]:
from sklearn.ensemble import RandomForestClassifier
In [35]:
rf_params = {'n_estimators': [100,500,1000], 
             'max_depth': [2, 5, 10]}
rf_fpr, rf_tpr, rf_accu, rf_auc, rf_lloss = \
    grid_CV(RandomForestClassifier(), rf_params, X_train, y_train, X_val, y_val)
Score / Parameters
0.807 (+/-0.010) / {'max_depth': 2, 'n_estimators': 100}
0.808 (+/-0.013) / {'max_depth': 2, 'n_estimators': 500}
0.808 (+/-0.012) / {'max_depth': 2, 'n_estimators': 1000}
0.828 (+/-0.007) / {'max_depth': 5, 'n_estimators': 100}
0.829 (+/-0.006) / {'max_depth': 5, 'n_estimators': 500}
0.828 (+/-0.006) / {'max_depth': 5, 'n_estimators': 1000}
0.845 (+/-0.010) / {'max_depth': 10, 'n_estimators': 100}
0.846 (+/-0.009) / {'max_depth': 10, 'n_estimators': 500}
0.846 (+/-0.009) / {'max_depth': 10, 'n_estimators': 1000}
Best Parameters: {'max_depth': 10, 'n_estimators': 500}

[Confusion Matrix]
 [[472 115]
 [ 60 567]]

[Report]
              precision    recall  f1-score   support

          0       0.89      0.80      0.84       587
          1       0.83      0.90      0.87       627

avg / total       0.86      0.86      0.86      1214

• Gradient Boosting

In [36]:
from sklearn.ensemble import GradientBoostingClassifier
In [37]:
gb_params = {'n_estimator': [100,500,1000], 
             'learning_rate': [0.1, 0.2, 0.3], 
             'max_depth': [3, 5, 10]}
gb_fpr, gb_tpr, gb_accu, gb_auc, gb_lloss = \
    grid_CV(GradientBoostingClassifier(), rf_params, X_train, y_train, X_val, y_val)
Score / Parameters
0.829 (+/-0.005) / {'max_depth': 2, 'n_estimators': 100}
0.835 (+/-0.011) / {'max_depth': 2, 'n_estimators': 500}
0.839 (+/-0.013) / {'max_depth': 2, 'n_estimators': 1000}
0.846 (+/-0.012) / {'max_depth': 5, 'n_estimators': 100}
0.850 (+/-0.010) / {'max_depth': 5, 'n_estimators': 500}
0.847 (+/-0.016) / {'max_depth': 5, 'n_estimators': 1000}
0.846 (+/-0.016) / {'max_depth': 10, 'n_estimators': 100}
0.850 (+/-0.016) / {'max_depth': 10, 'n_estimators': 500}
0.850 (+/-0.020) / {'max_depth': 10, 'n_estimators': 1000}
Best Parameters: {'max_depth': 10, 'n_estimators': 500}

[Confusion Matrix]
 [[498  89]
 [ 81 546]]

[Report]
              precision    recall  f1-score   support

          0       0.86      0.85      0.85       587
          1       0.86      0.87      0.87       627

avg / total       0.86      0.86      0.86      1214

• Multi-Layer Perception

In [38]:
from sklearn.neural_network import MLPClassifier
In [39]:
mlp_params = {'learning_rate': ['constant', 'adaptive'], 
              'alpha': 10.0**-np.arange(1,7), 
              'activation': ['logistic', 'tanh']}
mlp_fpr, mlp_tpr, mlp_accu, mlp_auc, mlp_lloss = \
    grid_CV(MLPClassifier(), mlp_params, X_train, y_train, X_val, y_val)
Score / Parameters
0.819 (+/-0.005) / {'activation': 'logistic', 'alpha': 0.1, 'learning_rate': 'constant'}
0.816 (+/-0.015) / {'activation': 'logistic', 'alpha': 0.1, 'learning_rate': 'adaptive'}
0.821 (+/-0.010) / {'activation': 'logistic', 'alpha': 0.01, 'learning_rate': 'constant'}
0.819 (+/-0.010) / {'activation': 'logistic', 'alpha': 0.01, 'learning_rate': 'adaptive'}
0.819 (+/-0.012) / {'activation': 'logistic', 'alpha': 0.001, 'learning_rate': 'constant'}
0.819 (+/-0.008) / {'activation': 'logistic', 'alpha': 0.001, 'learning_rate': 'adaptive'}
0.817 (+/-0.009) / {'activation': 'logistic', 'alpha': 0.0001, 'learning_rate': 'constant'}
0.820 (+/-0.007) / {'activation': 'logistic', 'alpha': 0.0001, 'learning_rate': 'adaptive'}
0.820 (+/-0.008) / {'activation': 'logistic', 'alpha': 1e-05, 'learning_rate': 'constant'}
0.818 (+/-0.016) / {'activation': 'logistic', 'alpha': 1e-05, 'learning_rate': 'adaptive'}
0.821 (+/-0.008) / {'activation': 'logistic', 'alpha': 1e-06, 'learning_rate': 'constant'}
0.817 (+/-0.007) / {'activation': 'logistic', 'alpha': 1e-06, 'learning_rate': 'adaptive'}
0.820 (+/-0.016) / {'activation': 'tanh', 'alpha': 0.1, 'learning_rate': 'constant'}
0.818 (+/-0.013) / {'activation': 'tanh', 'alpha': 0.1, 'learning_rate': 'adaptive'}
0.822 (+/-0.005) / {'activation': 'tanh', 'alpha': 0.01, 'learning_rate': 'constant'}
0.818 (+/-0.014) / {'activation': 'tanh', 'alpha': 0.01, 'learning_rate': 'adaptive'}
0.823 (+/-0.007) / {'activation': 'tanh', 'alpha': 0.001, 'learning_rate': 'constant'}
0.821 (+/-0.013) / {'activation': 'tanh', 'alpha': 0.001, 'learning_rate': 'adaptive'}
0.821 (+/-0.010) / {'activation': 'tanh', 'alpha': 0.0001, 'learning_rate': 'constant'}
0.815 (+/-0.017) / {'activation': 'tanh', 'alpha': 0.0001, 'learning_rate': 'adaptive'}
0.820 (+/-0.011) / {'activation': 'tanh', 'alpha': 1e-05, 'learning_rate': 'constant'}
0.817 (+/-0.015) / {'activation': 'tanh', 'alpha': 1e-05, 'learning_rate': 'adaptive'}
0.818 (+/-0.009) / {'activation': 'tanh', 'alpha': 1e-06, 'learning_rate': 'constant'}
0.818 (+/-0.009) / {'activation': 'tanh', 'alpha': 1e-06, 'learning_rate': 'adaptive'}
Best Parameters: {'activation': 'tanh', 'alpha': 0.001, 'learning_rate': 'constant'}

[Confusion Matrix]
 [[445 142]
 [ 56 571]]

[Report]
              precision    recall  f1-score   support

          0       0.89      0.76      0.82       587
          1       0.80      0.91      0.85       627

avg / total       0.84      0.84      0.84      1214


5. Evaluation and Model Selection


Even though libraries can handle virtually everything, it will be a good for understanding how these evaluation functions work for better using them. First, Precision and Recall, F1-score are derived from the same concept. The following are based on binary classification.

$$Precision=\frac{True\ Positive}{True\ Positive\ +\ False\ Positive}\ \ Recall=\frac{True\ Positive}{True\ Positive\ +\ False\ Negative}$$

Precision cares about predicting positive classes, therefore, it will be perfect (=*1.0*) if there is no false positive case. This is why its another name is *Positive Predictive Value (PPV)*. Recall is about how well actual positive cases were classified as positive. Thus, it will be perfect if no positive case is labeled as negative.

$$F1\text{-}score=\frac{2\times\ Precision\ \times\ Recall}{Precision\ +\ Recall}$$

F1\-score is designed to find the harmonic mean, which implies that this is good for uneven class distribution. In this case, we have already evened them out.

$$Log\ Loss=-\frac{1}{N}\sum_{i=1}^{N}[y_i\log p_i\ +\ (1-y_i)\log (1-p_i)]$$

Unlike the others, the smaller Log Loss is the better model is. It could be between 0 to infinity, theoretically. $y_i$ is always 0 or 1 in binary case, and $p_i$ is the probability for each observation. It measure how closely a model classified cases. For example, a model predicting class value 1.0 with probability 0.96 is much better performing than another model predicting the same but with probability 0.51.

The ROC-AUC to be shown below is basically like Accuracy with different cut-off points. Accuracy is always based on hald chance (50%) to decide on a class, on the contrary, ROC-AUC changes the decision boundaries to measure the performances. Therefore, ROC-AUC is more useful for evaluating overall performances.

In [40]:
plt.figure(figsize=(10, 10))
plt.title("Recevier Operationg Characteristic (ROC)", size=20)
plt.xlabel("False Positive Rate $(1 - Specificity)$", size=15)
plt.ylabel("True Positive Rate $(Sensitivity)$", size=15)

plt.plot([0, 1], [0, 1], 'r--', lw=1.5, alpha=0.5, label='Baseline')
plt.plot(logis_fpr, logis_tpr, lw=1.5, label='Logistics')
plt.plot(svc_fpr, svc_tpr, lw=1.5, label='SVC')
plt.plot(rf_fpr, rf_tpr, lw=1.5, label='Random Forest')
plt.plot(gb_fpr, gb_tpr, lw=1.5, label='Gradient Boosting')
plt.plot(mlp_fpr, mlp_tpr, lw=1.5, label='Multi-Layer Perceptron')

plt.legend()
Out[40]:
<matplotlib.legend.Legend at 0x2540ec1bbe0>
In [41]:
model_names = ['Logistics', 'SVC', 'RandomForest', 'GradientBoosting', 'Multi-Layer Perceptron']
auc_dict = {'AUC': [logis_auc, svc_auc, rf_auc, gb_auc, mlp_auc],
            'Accuracy': [logis_accu, svc_accu, rf_accu, gb_accu, mlp_accu],
            'LogLoss': [logis_lloss, svc_lloss, rf_lloss, gb_lloss, mlp_lloss]}
auc_rank = pd.DataFrame(data=auc_dict, 
                        index=model_names)
auc_rank.sort_values(by='AUC', ascending=False, inplace=True)
auc_rank
Out[41]:
AUC Accuracy LogLoss
GradientBoosting 0.927491 0.859967 0.840931
RandomForest 0.924410 0.855848 0.330863
Logistics 0.893033 0.830313 0.379041
Multi-Layer Perceptron 0.884409 0.836903 0.387249
SVC 0.880395 0.837727 0.416869

6. Performance Comparison


The graph and the table suggest that Random Forest Classifier is the most successful in predicting. Interestingly enough, all the five models show the similar performances when it comes to Accuracy around 0.83, but they are pretty different when measured by AUC and Log-loss. This is because they have different false positive rate and false negative rate. Looking at each of the confusion matrix, the values falsely classified are all different by model. In conclusion, Random Forest should be selected since its AUC and Accuracy are similar with Gradient Boosting but Log-loss is much lower.

Best Predictive Model: Random Forest ('max_depth': 10, 'n_estimators': 100)

Now, we would like to measure the performance of the model selected. The goal should be measuring how well the model trained with the balanced dataset performs, compared with the model trained with all the data and the empty model always predicting 'No rain'.

In [42]:
# create test set considering the rain ratio
none_idx = set(weather_df.index) - set(balanced_df.index) # not from balanced set
extra_none_size = int(len(validation_set)/2/rain_ratio - len(validation_set)) # proportional "NONE"
none_idx = random_index(list(none_idx), extra_none_size)
test_set_idx = list(set(validation_set.index) | set(none_idx)) # partly from validation set
test_set = weather_df.loc[test_set_idx]

# train/test split
final_training_set = weather_df[~weather_df.index.isin(test_set_idx)]
final_test_set = weather_df.loc[test_set_idx]

# using all data
X_full_train, y_full_train = final_training_set[list(predictors)].values, \
                             final_training_set['HOURLYPrecip'].values
X_test, y_test = final_test_set[list(predictors)].values, \
                 final_test_set['HOURLYPrecip'].values

One thing to note here is that the X and y with all the data are much bigger than the X and y made from the balanced dataset. This might cause advantage and disadvantage to the models, respectively.

In [43]:
def auc_score(y_true, prob):
    fpr, tpr, threshold = roc_curve(y_true, prob)
    return auc(fpr, tpr)


• Actual Ratio: 1 Rain vs. 9 Dry

In [44]:
# always predict the larger class
none_pred = np.zeros(len(X_test)).T # No Rain (=0)

# train with full dataset
rf_full_clf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42)
rf_full_clf.fit(X_full_train, y_full_train)
rf_full_pred = rf_full_clf.predict(X_test)
rf_full_prob = rf_full_clf.predict_proba(X_test)[:,1]

# train with balanced dataset
rf_clf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_prob = rf_clf.predict_proba(X_test)[:,1]

accu_auc = {'One_Class': [round(accuracy_score(y_test, none_pred), 3), 
                          round(auc_score(y_test, none_pred), 3)],
            'Unbalanced': [round(accuracy_score(y_test, rf_full_pred), 3), 
                           round(auc_score(y_test, rf_full_prob), 3)],
            'Balanced': [round(accuracy_score(y_test, rf_pred), 3),
                         round(auc_score(y_test, rf_prob), 3)]}
general_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
general_result
Out[44]:
One_Class Unbalanced Balanced
Accuracy 0.883 0.906 0.809
AUC 0.500 0.931 0.923

If a model always says "It's not rainy" in any case, it will be about 88% correct because the weather is not rainy most times. The model trained with all the data (unbalanced) shows high Accuracy and high AUC. In fact, its AUC is a bit higher than the prediction from the balanced model. This is thought to be due to the size of training set.


• Half Rain vs. Half Dry

In [45]:
# always predict the larger class
none_pred = np.zeros(len(X_val)).T # No Rain (=0)

# train with full dataset
rf_full_pred = rf_full_clf.predict(X_val)
rf_full_prob = rf_full_clf.predict_proba(X_val)[:,1]

# train with balanced dataset
rf_pred = rf_clf.predict(X_val)
rf_prob = rf_clf.predict_proba(X_val)[:,1]

accu_auc = {'One_Class': [round(accuracy_score(y_val, none_pred), 3), 
                          round(auc_score(y_val, none_pred), 3)],
            'Unbalanced': [round(accuracy_score(y_val, rf_full_pred), 3), 
                           round(auc_score(y_val, rf_full_prob), 3)],
            'Balanced': [round(accuracy_score(y_val, rf_pred), 3),
                         round(auc_score(y_val, rf_prob), 3)]}
equal_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
equal_result
Out[45]:
One_Class Unbalanced Balanced
Accuracy 0.484 0.624 0.856
AUC 0.500 0.935 0.923

Here is where the model trained with balanced data outperforms the model with all the data. This is a situation when the weather is quickly changing. The AUC of the unbalanced model, however, is still higher than the balanced model's. This is again because the unbalanced model had the advantage of more data, approximately 10 times, enough to rank each observation in the order of probability even if many the probabilities were low to result in 'No rain'.


• All Rain

In [46]:
# all Rain
r_list = []
for i, j in zip(X_val, y_val):
    if j == 1:
        r_list.append(i)
X_val_rain = np.asarray(r_list)
y_val_rain = np.ones(len(r_list), dtype=int)
y_val_rain[0] = 0

# always predict the larger class
none_pred = np.zeros(len(X_val_rain)).T # No Rain (=0)

# train with full dataset
rf_full_pred = rf_full_clf.predict(X_val_rain)
rf_full_prob = rf_full_clf.predict_proba(X_val_rain)[:,1]

# train with balanced dataset
rf_pred = rf_clf.predict(X_val_rain)
rf_prob = rf_clf.predict_proba(X_val_rain)[:,1]

accu_auc = {'One_Class': [round(accuracy_score(y_val_rain, none_pred), 3), 
                          round(auc_score(y_val_rain, none_pred), 3)],
            'Unbalanced': [round(accuracy_score(y_val_rain, rf_full_pred), 3), 
                           round(auc_score(y_val_rain, rf_full_prob), 3)],
            'Balanced': [round(accuracy_score(y_val_rain, rf_pred), 3),
                         round(auc_score(y_val_rain, rf_prob), 3)]}
rain_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
rain_result
Out[46]:
One_Class Unbalanced Balanced
Accuracy 0.002 0.285 0.900
AUC 0.500 0.845 0.757

If it rain all day like in a monsoon season, the balanced model does remarkably better jobs in accuracy. The reason why unbalanced model is dropped to less than 0.3 accuracy is because this model is biased to predict 'No rain'. To sum it up, we would like to choose the model trained with balanced dataset in that it is more stable regardless of which class to predict.


Conclusion


We analyzed the dataset and built the predictive model. In the course, we needed to handle the imbalanced class issue which we solved by randomly selecting the same number of 'Rain' and 'No rain' class. Each model was run by 5-fold cross validation with best parameters found with the grid search technique.

Then, the models were evaluated by Accuracy, Precision and Recall, F1-score, and AUC. The selected model, Random Forest, was again tested in different situations by chaning the 'Rain' class ratio. While the performances of the imbalanced model varied in situations, the balanced model performed with strong stability. Therefore, the balanced model should be selected as the final model. However, the model would perform even better if the following could be done.

1. Observations: More data is always plus for prediction
2. Features: More features could be helpful if related
3. Modeling: Tweaking parameters would improve the perfomance

In our case, we only had 4 different features and limited number of rainy days data. The model would do better if we extend the observed period to add more rainy classes or we find other predictors from the dataset. In this case, both Accuracy and AUC would be the highest in any situation.

Reference

Theory

Syntax