When Will It Rain In Philadelphia?¶

By Kyoosik Kim on Oct 2018¶

Predict Rainy Weather from the Past Pattern

This project analyzes actual weather data from NOAA of the recent two years and builds various predictive models to predict rainy weather. The main purpose is to understand the fair steps to build a prediction model such as data preparation, data exploration, modeling and cross validation. Especially, the project focuses on evaluating the performances of different models to justify a certain model's advantages over the others.

1. Preparation: Data Load, Missing Values
2. Data Exploration: Lineplot, Correlation Matrix
3. Balancing Classes: Bootstrap
4. Modeling: Logistics, SVM, Random Forest, Gradient Boosting, Multi-layer Perceptron
5. Evaluation and Model Selection: Accuracy, Precision, Recall, F1-score, Log-loss, ROC-AUC
6. Performance Comparison: Full Dataset, Balanced Dataset

1. Preparation

This section is to read and format the dataset before analysis with the focus on dropping meaningless columns, handling missing values, converting data types if necessary. The following libraries will be used over the project, and the modeling libraries will be called on the way for clarity.

# dataframe
import numpy as np
import pandas as pd

# visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.3)
sns.set_style("whitegrid")

# others
import re
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings('ignore')

%%bash
ls

LCD_documentation.pdf
climatological_data_philadelphia.csv
rain_prediction_models.ipynb
temp_prcp_jonesboro_2016-2017_selected.csv

weather_df = pd.read_csv("climatological_data_philadelphia.csv")
weather_df.head()

weather_df.shape

(26837, 90)

There are just too many columns that do not seem to be useful due to exceeding number of NaN values. Also, I can ignore some series of columns related with the station since the data set is from the same station. I choose Date, Temperature, Humidity, and Pressure for predictors. Of course, we leave Precipitation as the goal is to predict 'Rain' or 'No Rain'.

col_keep = ['DATE', 'HOURLYDRYBULBTEMPF', 'HOURLYRelativeHumidity', 'HOURLYStationPressure', 'HOURLYPrecip']
col_drop = set(weather_df.columns) - set(col_keep)

weather_df.drop(col_drop, axis=1, inplace=True)
weather_df.columns

Index(['DATE', 'HOURLYDRYBULBTEMPF', 'HOURLYRelativeHumidity',
       'HOURLYStationPressure', 'HOURLYPrecip'],
      dtype='object')

weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26837 entries, 0 to 26836
Data columns (total 5 columns):
DATE                      26837 non-null object
HOURLYDRYBULBTEMPF        26108 non-null float64
HOURLYRelativeHumidity    26108 non-null float64
HOURLYStationPressure     26094 non-null object
HOURLYPrecip              20784 non-null object
dtypes: float64(2), object(3)
memory usage: 1.0+ MB

As for the independent variable, we need to process it a little further because they are basically continous values plus some categorical values. For example, T stands for Trace as a little bit of rain, but we will classify this as 'None'. Finally, all the numerical values other than zero become a precipitation in float and the rest become zero meaning no rain.

def isPrecip(x):
    try:
        x = float(x)
    except:
        return 0
    
    if x > 0:
        return x
    else:
        return 0
    
weather_df['HOURLYPrecip'] = weather_df['HOURLYPrecip'].apply(lambda x : isPrecip(x))

The predictor Pressure has some problem with type being 'object' instead of float. In addition, some of them has alphabet character like '22.45s' when I only need the numbers. Then, the task becomes extract numbers and convert the type to 'float'.

def convPres(x):
    try:
        x = float(re.findall("\d+\.\d+", x)[0])
    except:
        return x
    
    return x
    
weather_df['HOURLYStationPressure'] = weather_df['HOURLYStationPressure'].apply(lambda x : convPres(x))

Lastly, we would like to point out that the Date column is too precise. It will face an overfitting trouble if I go by minute or hour. Also, the type is string so it will be difficult to manipulate them. The following converts it to 'timestamp' to date level.

# into timestamp (%Y-%m-%d)
weather_df['DATE'] = weather_df['DATE'].apply(lambda d : d[:10])
weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])

• Handling Missing Values

weather_df.describe().T

Except for the independent variable Hourly Precipitation, the rest are mssing values. By the nature of the weather data, a missing value should stay similar with the previous or next status. However, there will be a problem to refer to these if missing values are consecutively seated. On the contrary, it will be fine if they are spread out.

""" Count the longest consecutive rows with missing values """
def longestNull(x_index):
    length = 1
    max_length = length
    prev_item = x_index[0]
    for item in x_index[1:]:
        if item == prev_item+1: # consecutive
            length += 1
        else:
            max_length = max(max_length, length)
            length = 1
        prev_item = item
            
    return max_length

print("Longest Consecutive NaN")
print("Temperature:", longestNull(weather_df[weather_df['HOURLYDRYBULBTEMPF'].isnull()].index))
print("Humidity:", longestNull(weather_df[weather_df['HOURLYRelativeHumidity'].isnull()].index))
print("Pressure:", longestNull(weather_df[weather_df['HOURLYStationPressure'].isnull()].index))

Longest Consecutive NaN
Temperature: 1
Humidity: 1
Pressure: 3

It seems like most of the missing values are spread out. Again, this is time series weather data that observations would not jump up and down. For example, it would hardly pour suddenly after sunny hours.

# filling missing values
weather_df.fillna(method='ffill', inplace=True)

2. Data Exploration

Since the goal is predict Rain, it would be wise to look into the independent variable first. How to split the dataset for training and validation is fundamentally dependent on how the values are distributed.

rain_ratio = (weather_df['HOURLYPrecip']!=0).sum()/len(weather_df)
print("RAIN Ratio:", rain_ratio.round(4)*100, "%")

RAIN Ratio: 11.31 %

sns.kdeplot(weather_df[weather_df['HOURLYPrecip']!=0]['HOURLYPrecip'], 
            shade=True, label="Hourly Precip")
plt.title("Distribution of Precipitation")
plt.xlabel("Inches")
plt.show()

Rainy days account for just over 11%, and precipitation is usually very low on these days. We can label '0' for dry days and '1' for rainy days to make it into a classification problem. This way, it would be convenient to handle classes in terms of programming.

weather_df['HOURLYPrecip'] = weather_df['HOURLYPrecip'].apply(lambda x : 1 if x!=0 else 0)

• Line Plot

To plot the features over time by each class, we need to divide the dataset as such. Also, we need to bind the hourly observations by day since we want to focus on general trends at this stage.

# divide it by class
none_df = weather_df[weather_df['HOURLYPrecip']==0]
rain_df = weather_df[weather_df['HOURLYPrecip']==1]
print("Number of Class Rain:", len(rain_df))
print("Number of Class None:", len(none_df))

Number of Class Rain: 3036
Number of Class None: 23801

# group them by day for easier plotting
none_df = none_df.groupby('DATE').mean()
none_df = none_df.reset_index(level=[0])
rain_df = rain_df.groupby('DATE').mean()
rain_df = rain_df.reset_index(level=[0])

# plot temperature
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYDRYBULBTEMPF", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYDRYBULBTEMPF", label="Rain")
plt.title("Dry Bulb Temperature (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()

<matplotlib.legend.Legend at 0x2540eb18f98>

Dry bulb temperature on rainy days seems to be lower than dry days in summer, and the other way around in winter. Although this pattern is not so clear, it makes sense because rainy days is relatively cooler in summer and warmer in winter.

# plot relative humidity
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYRelativeHumidity", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYRelativeHumidity", label="Rain")
plt.title("Relative Humidity (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()

<matplotlib.legend.Legend at 0x2540eb5e160>

It is quite obvious that rainy days are generally more humid than dry days over all seasons. The result could be a bit different than someone might expect for winder days, but this scientific fact explains why snowy days should be humid.

# plot pressure
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYStationPressure", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYStationPressure", label="Rain")
plt.title("Pressure (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()

<matplotlib.legend.Legend at 0x2540ee1c2e8>

It looks like the pressure is lower when it is rainy. This is also a scientific fact that it is likely to rain when pressure is lower, according to this explanation. In summary, humidity and pressure seem to be useful to recognize rainy days.

• Correlation Map

# extract months as string type (categorical: 01, 02, ..., 11, 12)
weather_df['DATE'] = weather_df['DATE'].apply(lambda x : x.strftime('%m'))
weather_df.rename(columns={'DATE': 'MONTH'}, inplace=True)

# change the labels into digits
weather_df['HOURLYPrecip'] = (weather_df['HOURLYPrecip']==1).astype(int)

corr = weather_df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, mask=mask,
            cmap='YlGnBu', vmax=0.5, vmin=-0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x2540f339668>

As seen on the plot, Relative Humidity has the highest correlation with Precipitation while the rest are moderately or not quite related. Another noticible correlation is among features such as humidity vs. pressure. This is collinearity that might hinder modeling, however, it should not be a concern because the goal is to build classification models. Although collinearity could be a serious problem for regression models, classification models wouldn't be bothered that much.

3. Balancing Classes

The original dataset is biased in class, having relatively small number of 'Rain' classes. Our predictive model will be given a certain condition to predict either 'Rain' or 'No Rain' (binary). We would want to ensure that the model has fair training data. One of the ways to do so is balancing them out by giving the same number of classes.

""" generate random indice without replacement """
np.random.seed(123)
def random_index(idx_range, size):
    return np.random.choice(idx_range, size, replace=False)

# divide it by class again for balanced set
rain_df = weather_df[weather_df['HOURLYPrecip']==1]
none_df = weather_df[weather_df['HOURLYPrecip']==0]

# create balanced dataset
none_df = none_df.iloc[random_index(len(none_df), len(rain_df))]
balanced_df = pd.concat([rain_df, none_df])
balanced_df = shuffle(balanced_df)

print("Number of Class Rain:", len(balanced_df[balanced_df['HOURLYPrecip']==1]))
print("Number of Class None:", len(balanced_df[balanced_df['HOURLYPrecip']==0]))
balanced_df.head(10) # classes randomly distributed

Number of Class Rain: 3036
Number of Class None: 3036

# isolate validation set from training set
val_size = int(len(balanced_df)*0.2) # 20% from balanced set
validation_set = balanced_df.iloc[random_index(len(balanced_df), val_size)]
training_set = balanced_df[~balanced_df.index.isin(validation_set.index)]

print("Validation Set Size:", len(validation_set))
print("Training Set Size:", len(training_set))

Validation Set Size: 1214
Training Set Size: 4858

predictors = set(training_set.columns.values) - set(['HOURLYPrecip'])

# X, y for training set
X_train = training_set[list(predictors)].values
y_train = training_set['HOURLYPrecip']

# X, y for validation set
X_val = validation_set[list(predictors)].values
y_val = validation_set['HOURLYPrecip']

4. Modeling

The weather dataset is basically time series data, which means the cross validation method might not be the best. However, the following models I will be using are not quite time series suitable. In short, they do not really recognize time property in the dataset. For that reason, we will conduct 5-fold cross validation to measure a general performance of each model. Finally, we will evaluate their performances against the validation set we will have set aside.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import *

def grid_CV(model, params, X_train, y_train, X_val, y_val):
    clf = GridSearchCV(model, params, cv=5)
    clf.fit(X_train, y_train)
    
    # grid search
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    print("Score / Parameters")
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) / %r" % (mean, std*2, params))
    print("Best Parameters:", clf.best_params_)

    # confusion matrix, report
    pred = clf.predict(X_val)
    accu = accuracy_score(y_val, pred)
    print("\n[Confusion Matrix]\n", confusion_matrix(y_val, pred))
    print("\n[Report]\n", classification_report(y_val, pred))
    
    try:
        # some model does not have 'predict_proba'
        prob = clf.predict_proba(X_val)[:,1]
    except:
        print("This model does not have the method 'predict_proba'")
        
    # ROC, AUC
    fpr, tpr, threshold = roc_curve(y_val, prob)
    auc_score = auc(fpr, tpr)
    
    #log loss
    ll = log_loss(y_val, prob)
    
    return fpr, tpr, accu, auc_score, ll

• Logistics Regression

from sklearn.linear_model import LogisticRegression

logis_params = {'C': [1,10,100], 
                'solver': ['liblinear', 'lbfgs']}
logis_fpr, logis_tpr, logis_accu, logis_auc, logis_lloss = \
    grid_CV(LogisticRegression(), logis_params, X_train, y_train, X_val, y_val)

Score / Parameters
0.819 (+/-0.008) / {'C': 1, 'solver': 'liblinear'}
0.823 (+/-0.012) / {'C': 1, 'solver': 'lbfgs'}
0.820 (+/-0.011) / {'C': 10, 'solver': 'liblinear'}
0.822 (+/-0.011) / {'C': 10, 'solver': 'lbfgs'}
0.822 (+/-0.013) / {'C': 100, 'solver': 'liblinear'}
0.822 (+/-0.011) / {'C': 100, 'solver': 'lbfgs'}
Best Parameters: {'C': 1, 'solver': 'lbfgs'}

[Confusion Matrix]
 [[451 136]
 [ 70 557]]

[Report]
              precision    recall  f1-score   support

          0       0.87      0.77      0.81       587
          1       0.80      0.89      0.84       627

avg / total       0.83      0.83      0.83      1214

• Support Vector Machine

from sklearn.svm import SVC

svc_params = {'C': [1,10,100]}
svc_fpr, svc_tpr, svc_accu, svc_auc, svc_lloss = \
    grid_CV(SVC(probability=True), svc_params, X_train, y_train, X_val, y_val)

Score / Parameters
0.828 (+/-0.013) / {'C': 1}
0.815 (+/-0.018) / {'C': 10}
0.811 (+/-0.017) / {'C': 100}
Best Parameters: {'C': 1}

[Confusion Matrix]
 [[466 121]
 [ 76 551]]

[Report]
              precision    recall  f1-score   support

          0       0.86      0.79      0.83       587
          1       0.82      0.88      0.85       627

avg / total       0.84      0.84      0.84      1214

• Random Forest

from sklearn.ensemble import RandomForestClassifier

rf_params = {'n_estimators': [100,500,1000], 
             'max_depth': [2, 5, 10]}
rf_fpr, rf_tpr, rf_accu, rf_auc, rf_lloss = \
    grid_CV(RandomForestClassifier(), rf_params, X_train, y_train, X_val, y_val)

Score / Parameters
0.807 (+/-0.010) / {'max_depth': 2, 'n_estimators': 100}
0.808 (+/-0.013) / {'max_depth': 2, 'n_estimators': 500}
0.808 (+/-0.012) / {'max_depth': 2, 'n_estimators': 1000}
0.828 (+/-0.007) / {'max_depth': 5, 'n_estimators': 100}
0.829 (+/-0.006) / {'max_depth': 5, 'n_estimators': 500}
0.828 (+/-0.006) / {'max_depth': 5, 'n_estimators': 1000}
0.845 (+/-0.010) / {'max_depth': 10, 'n_estimators': 100}
0.846 (+/-0.009) / {'max_depth': 10, 'n_estimators': 500}
0.846 (+/-0.009) / {'max_depth': 10, 'n_estimators': 1000}
Best Parameters: {'max_depth': 10, 'n_estimators': 500}

[Confusion Matrix]
 [[472 115]
 [ 60 567]]

[Report]
              precision    recall  f1-score   support

          0       0.89      0.80      0.84       587
          1       0.83      0.90      0.87       627

avg / total       0.86      0.86      0.86      1214

• Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

gb_params = {'n_estimator': [100,500,1000], 
             'learning_rate': [0.1, 0.2, 0.3], 
             'max_depth': [3, 5, 10]}
gb_fpr, gb_tpr, gb_accu, gb_auc, gb_lloss = \
    grid_CV(GradientBoostingClassifier(), rf_params, X_train, y_train, X_val, y_val)

Score / Parameters
0.829 (+/-0.005) / {'max_depth': 2, 'n_estimators': 100}
0.835 (+/-0.011) / {'max_depth': 2, 'n_estimators': 500}
0.839 (+/-0.013) / {'max_depth': 2, 'n_estimators': 1000}
0.846 (+/-0.012) / {'max_depth': 5, 'n_estimators': 100}
0.850 (+/-0.010) / {'max_depth': 5, 'n_estimators': 500}
0.847 (+/-0.016) / {'max_depth': 5, 'n_estimators': 1000}
0.846 (+/-0.016) / {'max_depth': 10, 'n_estimators': 100}
0.850 (+/-0.016) / {'max_depth': 10, 'n_estimators': 500}
0.850 (+/-0.020) / {'max_depth': 10, 'n_estimators': 1000}
Best Parameters: {'max_depth': 10, 'n_estimators': 500}

[Confusion Matrix]
 [[498  89]
 [ 81 546]]

[Report]
              precision    recall  f1-score   support

          0       0.86      0.85      0.85       587
          1       0.86      0.87      0.87       627

avg / total       0.86      0.86      0.86      1214

• Multi-Layer Perception

from sklearn.neural_network import MLPClassifier

mlp_params = {'learning_rate': ['constant', 'adaptive'], 
              'alpha': 10.0**-np.arange(1,7), 
              'activation': ['logistic', 'tanh']}
mlp_fpr, mlp_tpr, mlp_accu, mlp_auc, mlp_lloss = \
    grid_CV(MLPClassifier(), mlp_params, X_train, y_train, X_val, y_val)

Score / Parameters
0.819 (+/-0.005) / {'activation': 'logistic', 'alpha': 0.1, 'learning_rate': 'constant'}
0.816 (+/-0.015) / {'activation': 'logistic', 'alpha': 0.1, 'learning_rate': 'adaptive'}
0.821 (+/-0.010) / {'activation': 'logistic', 'alpha': 0.01, 'learning_rate': 'constant'}
0.819 (+/-0.010) / {'activation': 'logistic', 'alpha': 0.01, 'learning_rate': 'adaptive'}
0.819 (+/-0.012) / {'activation': 'logistic', 'alpha': 0.001, 'learning_rate': 'constant'}
0.819 (+/-0.008) / {'activation': 'logistic', 'alpha': 0.001, 'learning_rate': 'adaptive'}
0.817 (+/-0.009) / {'activation': 'logistic', 'alpha': 0.0001, 'learning_rate': 'constant'}
0.820 (+/-0.007) / {'activation': 'logistic', 'alpha': 0.0001, 'learning_rate': 'adaptive'}
0.820 (+/-0.008) / {'activation': 'logistic', 'alpha': 1e-05, 'learning_rate': 'constant'}
0.818 (+/-0.016) / {'activation': 'logistic', 'alpha': 1e-05, 'learning_rate': 'adaptive'}
0.821 (+/-0.008) / {'activation': 'logistic', 'alpha': 1e-06, 'learning_rate': 'constant'}
0.817 (+/-0.007) / {'activation': 'logistic', 'alpha': 1e-06, 'learning_rate': 'adaptive'}
0.820 (+/-0.016) / {'activation': 'tanh', 'alpha': 0.1, 'learning_rate': 'constant'}
0.818 (+/-0.013) / {'activation': 'tanh', 'alpha': 0.1, 'learning_rate': 'adaptive'}
0.822 (+/-0.005) / {'activation': 'tanh', 'alpha': 0.01, 'learning_rate': 'constant'}
0.818 (+/-0.014) / {'activation': 'tanh', 'alpha': 0.01, 'learning_rate': 'adaptive'}
0.823 (+/-0.007) / {'activation': 'tanh', 'alpha': 0.001, 'learning_rate': 'constant'}
0.821 (+/-0.013) / {'activation': 'tanh', 'alpha': 0.001, 'learning_rate': 'adaptive'}
0.821 (+/-0.010) / {'activation': 'tanh', 'alpha': 0.0001, 'learning_rate': 'constant'}
0.815 (+/-0.017) / {'activation': 'tanh', 'alpha': 0.0001, 'learning_rate': 'adaptive'}
0.820 (+/-0.011) / {'activation': 'tanh', 'alpha': 1e-05, 'learning_rate': 'constant'}
0.817 (+/-0.015) / {'activation': 'tanh', 'alpha': 1e-05, 'learning_rate': 'adaptive'}
0.818 (+/-0.009) / {'activation': 'tanh', 'alpha': 1e-06, 'learning_rate': 'constant'}
0.818 (+/-0.009) / {'activation': 'tanh', 'alpha': 1e-06, 'learning_rate': 'adaptive'}
Best Parameters: {'activation': 'tanh', 'alpha': 0.001, 'learning_rate': 'constant'}

[Confusion Matrix]
 [[445 142]
 [ 56 571]]

[Report]
              precision    recall  f1-score   support

          0       0.89      0.76      0.82       587
          1       0.80      0.91      0.85       627

avg / total       0.84      0.84      0.84      1214

5. Evaluation and Model Selection

Even though libraries can handle virtually everything, it will be a good for understanding how these evaluation functions work for better using them. First, Precision and Recall, F1-score are derived from the same concept. The following are based on binary classification.

$$Precision=\frac{True\ Positive}{True\ Positive\ +\ False\ Positive}\ \ Recall=\frac{True\ Positive}{True\ Positive\ +\ False\ Negative}$$

Precision cares about predicting positive classes, therefore, it will be perfect (=*1.0*) if there is no false positive case. This is why its another name is *Positive Predictive Value (PPV)*. Recall is about how well actual positive cases were classified as positive. Thus, it will be perfect if no positive case is labeled as negative.

$$F1\text{-}score=\frac{2\times\ Precision\ \times\ Recall}{Precision\ +\ Recall}$$

F1\-score is designed to find the harmonic mean, which implies that this is good for uneven class distribution. In this case, we have already evened them out.

$$Log\ Loss=-\frac{1}{N}\sum_{i=1}^{N}[y_i\log p_i\ +\ (1-y_i)\log (1-p_i)]$$

Unlike the others, the smaller Log Loss is the better model is. It could be between 0 to infinity, theoretically. $y_i$ is always 0 or 1 in binary case, and $p_i$ is the probability for each observation. It measure how closely a model classified cases. For example, a model predicting class value 1.0 with probability 0.96 is much better performing than another model predicting the same but with probability 0.51.

The ROC-AUC to be shown below is basically like Accuracy with different cut-off points. Accuracy is always based on hald chance (50%) to decide on a class, on the contrary, ROC-AUC changes the decision boundaries to measure the performances. Therefore, ROC-AUC is more useful for evaluating overall performances.

plt.figure(figsize=(10, 10))
plt.title("Recevier Operationg Characteristic (ROC)", size=20)
plt.xlabel("False Positive Rate $(1 - Specificity)$", size=15)
plt.ylabel("True Positive Rate $(Sensitivity)$", size=15)

plt.plot([0, 1], [0, 1], 'r--', lw=1.5, alpha=0.5, label='Baseline')
plt.plot(logis_fpr, logis_tpr, lw=1.5, label='Logistics')
plt.plot(svc_fpr, svc_tpr, lw=1.5, label='SVC')
plt.plot(rf_fpr, rf_tpr, lw=1.5, label='Random Forest')
plt.plot(gb_fpr, gb_tpr, lw=1.5, label='Gradient Boosting')
plt.plot(mlp_fpr, mlp_tpr, lw=1.5, label='Multi-Layer Perceptron')

plt.legend()

<matplotlib.legend.Legend at 0x2540ec1bbe0>

model_names = ['Logistics', 'SVC', 'RandomForest', 'GradientBoosting', 'Multi-Layer Perceptron']
auc_dict = {'AUC': [logis_auc, svc_auc, rf_auc, gb_auc, mlp_auc],
            'Accuracy': [logis_accu, svc_accu, rf_accu, gb_accu, mlp_accu],
            'LogLoss': [logis_lloss, svc_lloss, rf_lloss, gb_lloss, mlp_lloss]}
auc_rank = pd.DataFrame(data=auc_dict, 
                        index=model_names)
auc_rank.sort_values(by='AUC', ascending=False, inplace=True)
auc_rank

6. Performance Comparison

The graph and the table suggest that Random Forest Classifier is the most successful in predicting. Interestingly enough, all the five models show the similar performances when it comes to Accuracy around 0.83, but they are pretty different when measured by AUC and Log-loss. This is because they have different false positive rate and false negative rate. Looking at each of the confusion matrix, the values falsely classified are all different by model. In conclusion, Random Forest should be selected since its AUC and Accuracy are similar with Gradient Boosting but Log-loss is much lower.

Best Predictive Model: Random Forest ('max_depth': 10, 'n_estimators': 100)

Now, we would like to measure the performance of the model selected. The goal should be measuring how well the model trained with the balanced dataset performs, compared with the model trained with all the data and the empty model always predicting 'No rain'.

# create test set considering the rain ratio
none_idx = set(weather_df.index) - set(balanced_df.index) # not from balanced set
extra_none_size = int(len(validation_set)/2/rain_ratio - len(validation_set)) # proportional "NONE"
none_idx = random_index(list(none_idx), extra_none_size)
test_set_idx = list(set(validation_set.index) | set(none_idx)) # partly from validation set
test_set = weather_df.loc[test_set_idx]

# train/test split
final_training_set = weather_df[~weather_df.index.isin(test_set_idx)]
final_test_set = weather_df.loc[test_set_idx]

# using all data
X_full_train, y_full_train = final_training_set[list(predictors)].values, \
                             final_training_set['HOURLYPrecip'].values
X_test, y_test = final_test_set[list(predictors)].values, \
                 final_test_set['HOURLYPrecip'].values

One thing to note here is that the X and y with all the data are much bigger than the X and y made from the balanced dataset. This might cause advantage and disadvantage to the models, respectively.

def auc_score(y_true, prob):
    fpr, tpr, threshold = roc_curve(y_true, prob)
    return auc(fpr, tpr)

• Actual Ratio: 1 Rain vs. 9 Dry

# always predict the larger class
none_pred = np.zeros(len(X_test)).T # No Rain (=0)

# train with full dataset
rf_full_clf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42)
rf_full_clf.fit(X_full_train, y_full_train)
rf_full_pred = rf_full_clf.predict(X_test)
rf_full_prob = rf_full_clf.predict_proba(X_test)[:,1]

# train with balanced dataset
rf_clf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_prob = rf_clf.predict_proba(X_test)[:,1]

accu_auc = {'One_Class': [round(accuracy_score(y_test, none_pred), 3), 
                          round(auc_score(y_test, none_pred), 3)],
            'Unbalanced': [round(accuracy_score(y_test, rf_full_pred), 3), 
                           round(auc_score(y_test, rf_full_prob), 3)],
            'Balanced': [round(accuracy_score(y_test, rf_pred), 3),
                         round(auc_score(y_test, rf_prob), 3)]}
general_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
general_result

If a model always says "It's not rainy" in any case, it will be about 88% correct because the weather is not rainy most times. The model trained with all the data (unbalanced) shows high Accuracy and high AUC. In fact, its AUC is a bit higher than the prediction from the balanced model. This is thought to be due to the size of training set.

• Half Rain vs. Half Dry

# always predict the larger class
none_pred = np.zeros(len(X_val)).T # No Rain (=0)

# train with full dataset
rf_full_pred = rf_full_clf.predict(X_val)
rf_full_prob = rf_full_clf.predict_proba(X_val)[:,1]

# train with balanced dataset
rf_pred = rf_clf.predict(X_val)
rf_prob = rf_clf.predict_proba(X_val)[:,1]

accu_auc = {'One_Class': [round(accuracy_score(y_val, none_pred), 3), 
                          round(auc_score(y_val, none_pred), 3)],
            'Unbalanced': [round(accuracy_score(y_val, rf_full_pred), 3), 
                           round(auc_score(y_val, rf_full_prob), 3)],
            'Balanced': [round(accuracy_score(y_val, rf_pred), 3),
                         round(auc_score(y_val, rf_prob), 3)]}
equal_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
equal_result

Here is where the model trained with balanced data outperforms the model with all the data. This is a situation when the weather is quickly changing. The AUC of the unbalanced model, however, is still higher than the balanced model's. This is again because the unbalanced model had the advantage of more data, approximately 10 times, enough to rank each observation in the order of probability even if many the probabilities were low to result in 'No rain'.

• All Rain

# all Rain
r_list = []
for i, j in zip(X_val, y_val):
    if j == 1:
        r_list.append(i)
X_val_rain = np.asarray(r_list)
y_val_rain = np.ones(len(r_list), dtype=int)
y_val_rain[0] = 0

# always predict the larger class
none_pred = np.zeros(len(X_val_rain)).T # No Rain (=0)

# train with full dataset
rf_full_pred = rf_full_clf.predict(X_val_rain)
rf_full_prob = rf_full_clf.predict_proba(X_val_rain)[:,1]

# train with balanced dataset
rf_pred = rf_clf.predict(X_val_rain)
rf_prob = rf_clf.predict_proba(X_val_rain)[:,1]

accu_auc = {'One_Class': [round(accuracy_score(y_val_rain, none_pred), 3), 
                          round(auc_score(y_val_rain, none_pred), 3)],
            'Unbalanced': [round(accuracy_score(y_val_rain, rf_full_pred), 3), 
                           round(auc_score(y_val_rain, rf_full_prob), 3)],
            'Balanced': [round(accuracy_score(y_val_rain, rf_pred), 3),
                         round(auc_score(y_val_rain, rf_prob), 3)]}
rain_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
rain_result

If it rain all day like in a monsoon season, the balanced model does remarkably better jobs in accuracy. The reason why unbalanced model is dropped to less than 0.3 accuracy is because this model is biased to predict 'No rain'. To sum it up, we would like to choose the model trained with balanced dataset in that it is more stable regardless of which class to predict.

Conclusion

We analyzed the dataset and built the predictive model. In the course, we needed to handle the imbalanced class issue which we solved by randomly selecting the same number of 'Rain' and 'No rain' class. Each model was run by 5-fold cross validation with best parameters found with the grid search technique.

Then, the models were evaluated by Accuracy, Precision and Recall, F1-score, and AUC. The selected model, Random Forest, was again tested in different situations by chaning the 'Rain' class ratio. While the performances of the imbalanced model varied in situations, the balanced model performed with strong stability. Therefore, the balanced model should be selected as the final model. However, the model would perform even better if the following could be done.

1. Observations: More data is always plus for prediction
2. Features: More features could be helpful if related
3. Modeling: Tweaking parameters would improve the perfomance

In our case, we only had 4 different features and limited number of rainy days data. The model would do better if we extend the observed period to add more rainy classes or we find other predictors from the dataset. In this case, both Accuracy and AUC would be the highest in any situation.

Reference

Theory¶

Precision and Recall: Wikipedia
F1-score: Tutorial Video
ROC Curve and AUC: Tutorial Video, Simulation, Article on Accuracy vs. AUC
Log Loss: Washington University in St.Louis

Syntax¶

Model Selection: Cross-validation, Train Test Split
Classifier Models: Logistics, SVC, Random Forest, Gradient Boosting, Multi-Layer Perceptron
ROC: ROC Curve Example
Plotting: Serborn Lineplot
Reshaping DataFrame: Pandas Examples

	STATION	STATION_NAME	ELEVATION	LATITUDE	LONGITUDE	DATE	REPORTTPYE	HOURLYSKYCONDITIONS	HOURLYVISIBILITY	HOURLYPRSENTWEATHERTYPE	...	MonthlyMaxSeaLevelPressureTime	MonthlyMinSeaLevelPressureValue	MonthlyMinSeaLevelPressureDate	MonthlyMinSeaLevelPressureTime	MonthlyTotalHeatingDegreeDays	MonthlyTotalCoolingDegreeDays	MonthlyDeptFromNormalHeatingDD	MonthlyDeptFromNormalCoolingDD	MonthlyTotalSeasonToDateHeatingDD	MonthlyTotalSeasonToDateCoolingDD
0	WBAN:13739	PHILADELPHIA INTERNATIONAL AIRPORT PA US	3	39.87327	-75.22678	2016-09-01 00:54	FM-15	SCT:04 46 BKN:07 70 OVC:08 100	8.00	-RA:02 \|RA:61 \|	...	-9999	NaN	-9999	-9999	NaN	NaN	NaN	NaN	NaN	NaN
1	WBAN:13739	PHILADELPHIA INTERNATIONAL AIRPORT PA US	3	39.87327	-75.22678	2016-09-01 01:00	FM-12	41	NaN	\|\|RA:61	...	-9999	NaN	-9999	-9999	NaN	NaN	NaN	NaN	NaN	NaN
2	WBAN:13739	PHILADELPHIA INTERNATIONAL AIRPORT PA US	3	39.87327	-75.22678	2016-09-01 01:14	FM-16	SCT:04 19 BKN:07 43 OVC:08 95	6.00	-RA:02 BR:1 \|RA:61 \|	...	-9999	NaN	-9999	-9999	NaN	NaN	NaN	NaN	NaN	NaN
3	WBAN:13739	PHILADELPHIA INTERNATIONAL AIRPORT PA US	3	39.87327	-75.22678	2016-09-01 01:31	FM-16	BKN:07 27 OVC:08 42	4.00	RA:02 BR:1 \|RA:62 \|	...	-9999	NaN	-9999	-9999	NaN	NaN	NaN	NaN	NaN	NaN
4	WBAN:13739	PHILADELPHIA INTERNATIONAL AIRPORT PA US	3	39.87327	-75.22678	2016-09-01 01:37	FM-16	SCT:04 12 BKN:07 17 OVC:08 40	3.00	+RA:02 BR:1 \|RA:63 \|	...	-9999	NaN	-9999	-9999	NaN	NaN	NaN	NaN	NaN	NaN

	count	mean	std	min	25%	50%	75%	max
HOURLYDRYBULBTEMPF	26108.0	57.205493	17.488205	6.00	43.00	58.00	72.00	97.00
HOURLYRelativeHumidity	26108.0	68.965873	20.302824	13.00	52.00	71.00	88.00	100.00
HOURLYStationPressure	26094.0	29.997287	0.231539	28.99	29.86	29.99	30.13	30.81
HOURLYPrecip	26837.0	0.007818	0.048327	0.00	0.00	0.00	0.00	2.52

	MONTH	HOURLYDRYBULBTEMPF	HOURLYRelativeHumidity	HOURLYStationPressure	HOURLYPrecip
22657	05	54.0	97.0	30.03	1
17983	01	63.0	97.0	29.73	1
1362	10	66.0	90.0	29.97	1
16369	11	42.0	49.0	30.01	0
10177	06	59.0	90.0	29.83	0
13557	09	62.0	90.0	29.86	1
25788	08	79.0	88.0	30.09	1
12982	08	76.0	91.0	30.04	0
8528	04	53.0	93.0	29.97	1
21635	04	42.0	92.0	29.86	1

	AUC	Accuracy	LogLoss
GradientBoosting	0.927491	0.859967	0.840931
RandomForest	0.924410	0.855848	0.330863
Logistics	0.893033	0.830313	0.379041
Multi-Layer Perceptron	0.884409	0.836903	0.387249
SVC	0.880395	0.837727	0.416869

	One_Class	Unbalanced	Balanced
Accuracy	0.883	0.906	0.809
AUC	0.500	0.931	0.923