Predict Rainy Weather from the Past Pattern
This project analyzes actual weather data from NOAA of the recent two years and builds various predictive models to predict rainy weather. The main purpose is to understand the fair steps to build a prediction model such as data preparation, data exploration, modeling and cross validation. Especially, the project focuses on evaluating the performances of different models to justify a certain model's advantages over the others.
1. Preparation: Data Load, Missing Values
2. Data Exploration: Lineplot, Correlation Matrix
3. Balancing Classes: Bootstrap
4. Modeling: Logistics, SVM, Random Forest, Gradient Boosting, Multi-layer Perceptron
5. Evaluation and Model Selection: Accuracy, Precision, Recall, F1-score, Log-loss, ROC-AUC
6. Performance Comparison: Full Dataset, Balanced Dataset
This section is to read and format the dataset before analysis with the focus on dropping meaningless columns, handling missing values, converting data types if necessary. The following libraries will be used over the project, and the modeling libraries will be called on the way for clarity.
# dataframe
import numpy as np
import pandas as pd
# visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.3)
sns.set_style("whitegrid")
# others
import re
from sklearn.utils import shuffle
import warnings
warnings.filterwarnings('ignore')
%%bash
ls
weather_df = pd.read_csv("climatological_data_philadelphia.csv")
weather_df.head()
weather_df.shape
There are just too many columns that do not seem to be useful due to exceeding number of NaN values. Also, I can ignore some series of columns related with the station since the data set is from the same station. I choose Date, Temperature, Humidity, and Pressure for predictors. Of course, we leave Precipitation as the goal is to predict 'Rain' or 'No Rain'.
col_keep = ['DATE', 'HOURLYDRYBULBTEMPF', 'HOURLYRelativeHumidity', 'HOURLYStationPressure', 'HOURLYPrecip']
col_drop = set(weather_df.columns) - set(col_keep)
weather_df.drop(col_drop, axis=1, inplace=True)
weather_df.columns
weather_df.info()
As for the independent variable, we need to process it a little further because they are basically continous values plus some categorical values. For example, T stands for Trace as a little bit of rain, but we will classify this as 'None'. Finally, all the numerical values other than zero become a precipitation in float and the rest become zero meaning no rain.
def isPrecip(x):
try:
x = float(x)
except:
return 0
if x > 0:
return x
else:
return 0
weather_df['HOURLYPrecip'] = weather_df['HOURLYPrecip'].apply(lambda x : isPrecip(x))
The predictor Pressure has some problem with type being 'object' instead of float. In addition, some of them has alphabet character like '22.45s' when I only need the numbers. Then, the task becomes extract numbers and convert the type to 'float'.
def convPres(x):
try:
x = float(re.findall("\d+\.\d+", x)[0])
except:
return x
return x
weather_df['HOURLYStationPressure'] = weather_df['HOURLYStationPressure'].apply(lambda x : convPres(x))
Lastly, we would like to point out that the Date column is too precise. It will face an overfitting trouble if I go by minute or hour. Also, the type is string so it will be difficult to manipulate them. The following converts it to 'timestamp' to date level.
# into timestamp (%Y-%m-%d)
weather_df['DATE'] = weather_df['DATE'].apply(lambda d : d[:10])
weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])
weather_df.describe().T
Except for the independent variable Hourly Precipitation, the rest are mssing values. By the nature of the weather data, a missing value should stay similar with the previous or next status. However, there will be a problem to refer to these if missing values are consecutively seated. On the contrary, it will be fine if they are spread out.
""" Count the longest consecutive rows with missing values """
def longestNull(x_index):
length = 1
max_length = length
prev_item = x_index[0]
for item in x_index[1:]:
if item == prev_item+1: # consecutive
length += 1
else:
max_length = max(max_length, length)
length = 1
prev_item = item
return max_length
print("Longest Consecutive NaN")
print("Temperature:", longestNull(weather_df[weather_df['HOURLYDRYBULBTEMPF'].isnull()].index))
print("Humidity:", longestNull(weather_df[weather_df['HOURLYRelativeHumidity'].isnull()].index))
print("Pressure:", longestNull(weather_df[weather_df['HOURLYStationPressure'].isnull()].index))
It seems like most of the missing values are spread out. Again, this is time series weather data that observations would not jump up and down. For example, it would hardly pour suddenly after sunny hours.
# filling missing values
weather_df.fillna(method='ffill', inplace=True)
rain_ratio = (weather_df['HOURLYPrecip']!=0).sum()/len(weather_df)
print("RAIN Ratio:", rain_ratio.round(4)*100, "%")
sns.kdeplot(weather_df[weather_df['HOURLYPrecip']!=0]['HOURLYPrecip'],
shade=True, label="Hourly Precip")
plt.title("Distribution of Precipitation")
plt.xlabel("Inches")
plt.show()
Rainy days account for just over 11%, and precipitation is usually very low on these days. We can label '0' for dry days and '1' for rainy days to make it into a classification problem. This way, it would be convenient to handle classes in terms of programming.
weather_df['HOURLYPrecip'] = weather_df['HOURLYPrecip'].apply(lambda x : 1 if x!=0 else 0)
To plot the features over time by each class, we need to divide the dataset as such. Also, we need to bind the hourly observations by day since we want to focus on general trends at this stage.
# divide it by class
none_df = weather_df[weather_df['HOURLYPrecip']==0]
rain_df = weather_df[weather_df['HOURLYPrecip']==1]
print("Number of Class Rain:", len(rain_df))
print("Number of Class None:", len(none_df))
# group them by day for easier plotting
none_df = none_df.groupby('DATE').mean()
none_df = none_df.reset_index(level=[0])
rain_df = rain_df.groupby('DATE').mean()
rain_df = rain_df.reset_index(level=[0])
# plot temperature
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYDRYBULBTEMPF", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYDRYBULBTEMPF", label="Rain")
plt.title("Dry Bulb Temperature (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()
Dry bulb temperature on rainy days seems to be lower than dry days in summer, and the other way around in winter. Although this pattern is not so clear, it makes sense because rainy days is relatively cooler in summer and warmer in winter.
# plot relative humidity
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYRelativeHumidity", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYRelativeHumidity", label="Rain")
plt.title("Relative Humidity (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()
It is quite obvious that rainy days are generally more humid than dry days over all seasons. The result could be a bit different than someone might expect for winder days, but this scientific fact explains why snowy days should be humid.
# plot pressure
plt.figure(figsize=(15, 8))
sns.lineplot(data=none_df, x="DATE", y="HOURLYStationPressure", label="No Rain")
sns.lineplot(data=rain_df, x="DATE", y="HOURLYStationPressure", label="Rain")
plt.title("Pressure (Daily Avg)", size=20)
plt.ylabel("")
plt.legend()
It looks like the pressure is lower when it is rainy. This is also a scientific fact that it is likely to rain when pressure is lower, according to this explanation. In summary, humidity and pressure seem to be useful to recognize rainy days.
# extract months as string type (categorical: 01, 02, ..., 11, 12)
weather_df['DATE'] = weather_df['DATE'].apply(lambda x : x.strftime('%m'))
weather_df.rename(columns={'DATE': 'MONTH'}, inplace=True)
# change the labels into digits
weather_df['HOURLYPrecip'] = (weather_df['HOURLYPrecip']==1).astype(int)
corr = weather_df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, mask=mask,
cmap='YlGnBu', vmax=0.5, vmin=-0.5)
As seen on the plot, Relative Humidity has the highest correlation with Precipitation while the rest are moderately or not quite related. Another noticible correlation is among features such as humidity vs. pressure. This is collinearity that might hinder modeling, however, it should not be a concern because the goal is to build classification models. Although collinearity could be a serious problem for regression models, classification models wouldn't be bothered that much.
""" generate random indice without replacement """
np.random.seed(123)
def random_index(idx_range, size):
return np.random.choice(idx_range, size, replace=False)
# divide it by class again for balanced set
rain_df = weather_df[weather_df['HOURLYPrecip']==1]
none_df = weather_df[weather_df['HOURLYPrecip']==0]
# create balanced dataset
none_df = none_df.iloc[random_index(len(none_df), len(rain_df))]
balanced_df = pd.concat([rain_df, none_df])
balanced_df = shuffle(balanced_df)
print("Number of Class Rain:", len(balanced_df[balanced_df['HOURLYPrecip']==1]))
print("Number of Class None:", len(balanced_df[balanced_df['HOURLYPrecip']==0]))
balanced_df.head(10) # classes randomly distributed
# isolate validation set from training set
val_size = int(len(balanced_df)*0.2) # 20% from balanced set
validation_set = balanced_df.iloc[random_index(len(balanced_df), val_size)]
training_set = balanced_df[~balanced_df.index.isin(validation_set.index)]
print("Validation Set Size:", len(validation_set))
print("Training Set Size:", len(training_set))
predictors = set(training_set.columns.values) - set(['HOURLYPrecip'])
# X, y for training set
X_train = training_set[list(predictors)].values
y_train = training_set['HOURLYPrecip']
# X, y for validation set
X_val = validation_set[list(predictors)].values
y_val = validation_set['HOURLYPrecip']
The weather dataset is basically time series data, which means the cross validation method might not be the best. However, the following models I will be using are not quite time series suitable. In short, they do not really recognize time property in the dataset. For that reason, we will conduct 5-fold cross validation to measure a general performance of each model. Finally, we will evaluate their performances against the validation set we will have set aside.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import *
def grid_CV(model, params, X_train, y_train, X_val, y_val):
clf = GridSearchCV(model, params, cv=5)
clf.fit(X_train, y_train)
# grid search
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
print("Score / Parameters")
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) / %r" % (mean, std*2, params))
print("Best Parameters:", clf.best_params_)
# confusion matrix, report
pred = clf.predict(X_val)
accu = accuracy_score(y_val, pred)
print("\n[Confusion Matrix]\n", confusion_matrix(y_val, pred))
print("\n[Report]\n", classification_report(y_val, pred))
try:
# some model does not have 'predict_proba'
prob = clf.predict_proba(X_val)[:,1]
except:
print("This model does not have the method 'predict_proba'")
# ROC, AUC
fpr, tpr, threshold = roc_curve(y_val, prob)
auc_score = auc(fpr, tpr)
#log loss
ll = log_loss(y_val, prob)
return fpr, tpr, accu, auc_score, ll
from sklearn.linear_model import LogisticRegression
logis_params = {'C': [1,10,100],
'solver': ['liblinear', 'lbfgs']}
logis_fpr, logis_tpr, logis_accu, logis_auc, logis_lloss = \
grid_CV(LogisticRegression(), logis_params, X_train, y_train, X_val, y_val)
from sklearn.svm import SVC
svc_params = {'C': [1,10,100]}
svc_fpr, svc_tpr, svc_accu, svc_auc, svc_lloss = \
grid_CV(SVC(probability=True), svc_params, X_train, y_train, X_val, y_val)
from sklearn.ensemble import RandomForestClassifier
rf_params = {'n_estimators': [100,500,1000],
'max_depth': [2, 5, 10]}
rf_fpr, rf_tpr, rf_accu, rf_auc, rf_lloss = \
grid_CV(RandomForestClassifier(), rf_params, X_train, y_train, X_val, y_val)
from sklearn.ensemble import GradientBoostingClassifier
gb_params = {'n_estimator': [100,500,1000],
'learning_rate': [0.1, 0.2, 0.3],
'max_depth': [3, 5, 10]}
gb_fpr, gb_tpr, gb_accu, gb_auc, gb_lloss = \
grid_CV(GradientBoostingClassifier(), rf_params, X_train, y_train, X_val, y_val)
from sklearn.neural_network import MLPClassifier
mlp_params = {'learning_rate': ['constant', 'adaptive'],
'alpha': 10.0**-np.arange(1,7),
'activation': ['logistic', 'tanh']}
mlp_fpr, mlp_tpr, mlp_accu, mlp_auc, mlp_lloss = \
grid_CV(MLPClassifier(), mlp_params, X_train, y_train, X_val, y_val)
Even though libraries can handle virtually everything, it will be a good for understanding how these evaluation functions work for better using them. First, Precision and Recall, F1-score are derived from the same concept. The following are based on binary classification.
$$Precision=\frac{True\ Positive}{True\ Positive\ +\ False\ Positive}\ \ Recall=\frac{True\ Positive}{True\ Positive\ +\ False\ Negative}$$
Precision cares about predicting positive classes, therefore, it will be perfect (=*1.0*) if there is no false positive case. This is why its another name is *Positive Predictive Value (PPV)*. Recall is about how well actual positive cases were classified as positive. Thus, it will be perfect if no positive case is labeled as negative.
$$F1\text{-}score=\frac{2\times\ Precision\ \times\ Recall}{Precision\ +\ Recall}$$
F1\-score is designed to find the harmonic mean, which implies that this is good for uneven class distribution. In this case, we have already evened them out.
$$Log\ Loss=-\frac{1}{N}\sum_{i=1}^{N}[y_i\log p_i\ +\ (1-y_i)\log (1-p_i)]$$
Unlike the others, the smaller Log Loss is the better model is. It could be between 0 to infinity, theoretically. $y_i$ is always 0 or 1 in binary case, and $p_i$ is the probability for each observation. It measure how closely a model classified cases. For example, a model predicting class value 1.0 with probability 0.96 is much better performing than another model predicting the same but with probability 0.51.
The ROC-AUC to be shown below is basically like Accuracy with different cut-off points. Accuracy is always based on hald chance (50%) to decide on a class, on the contrary, ROC-AUC changes the decision boundaries to measure the performances. Therefore, ROC-AUC is more useful for evaluating overall performances.
plt.figure(figsize=(10, 10))
plt.title("Recevier Operationg Characteristic (ROC)", size=20)
plt.xlabel("False Positive Rate $(1 - Specificity)$", size=15)
plt.ylabel("True Positive Rate $(Sensitivity)$", size=15)
plt.plot([0, 1], [0, 1], 'r--', lw=1.5, alpha=0.5, label='Baseline')
plt.plot(logis_fpr, logis_tpr, lw=1.5, label='Logistics')
plt.plot(svc_fpr, svc_tpr, lw=1.5, label='SVC')
plt.plot(rf_fpr, rf_tpr, lw=1.5, label='Random Forest')
plt.plot(gb_fpr, gb_tpr, lw=1.5, label='Gradient Boosting')
plt.plot(mlp_fpr, mlp_tpr, lw=1.5, label='Multi-Layer Perceptron')
plt.legend()
model_names = ['Logistics', 'SVC', 'RandomForest', 'GradientBoosting', 'Multi-Layer Perceptron']
auc_dict = {'AUC': [logis_auc, svc_auc, rf_auc, gb_auc, mlp_auc],
'Accuracy': [logis_accu, svc_accu, rf_accu, gb_accu, mlp_accu],
'LogLoss': [logis_lloss, svc_lloss, rf_lloss, gb_lloss, mlp_lloss]}
auc_rank = pd.DataFrame(data=auc_dict,
index=model_names)
auc_rank.sort_values(by='AUC', ascending=False, inplace=True)
auc_rank
The graph and the table suggest that Random Forest Classifier is the most successful in predicting. Interestingly enough, all the five models show the similar performances when it comes to Accuracy around 0.83, but they are pretty different when measured by AUC and Log-loss. This is because they have different false positive rate and false negative rate. Looking at each of the confusion matrix, the values falsely classified are all different by model. In conclusion, Random Forest should be selected since its AUC and Accuracy are similar with Gradient Boosting but Log-loss is much lower.
Best Predictive Model: Random Forest ('max_depth': 10, 'n_estimators': 100)
Now, we would like to measure the performance of the model selected. The goal should be measuring how well the model trained with the balanced dataset performs, compared with the model trained with all the data and the empty model always predicting 'No rain'.
# create test set considering the rain ratio
none_idx = set(weather_df.index) - set(balanced_df.index) # not from balanced set
extra_none_size = int(len(validation_set)/2/rain_ratio - len(validation_set)) # proportional "NONE"
none_idx = random_index(list(none_idx), extra_none_size)
test_set_idx = list(set(validation_set.index) | set(none_idx)) # partly from validation set
test_set = weather_df.loc[test_set_idx]
# train/test split
final_training_set = weather_df[~weather_df.index.isin(test_set_idx)]
final_test_set = weather_df.loc[test_set_idx]
# using all data
X_full_train, y_full_train = final_training_set[list(predictors)].values, \
final_training_set['HOURLYPrecip'].values
X_test, y_test = final_test_set[list(predictors)].values, \
final_test_set['HOURLYPrecip'].values
One thing to note here is that the X and y with all the data are much bigger than the X and y made from the balanced dataset. This might cause advantage and disadvantage to the models, respectively.
def auc_score(y_true, prob):
fpr, tpr, threshold = roc_curve(y_true, prob)
return auc(fpr, tpr)
# always predict the larger class
none_pred = np.zeros(len(X_test)).T # No Rain (=0)
# train with full dataset
rf_full_clf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42)
rf_full_clf.fit(X_full_train, y_full_train)
rf_full_pred = rf_full_clf.predict(X_test)
rf_full_prob = rf_full_clf.predict_proba(X_test)[:,1]
# train with balanced dataset
rf_clf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_prob = rf_clf.predict_proba(X_test)[:,1]
accu_auc = {'One_Class': [round(accuracy_score(y_test, none_pred), 3),
round(auc_score(y_test, none_pred), 3)],
'Unbalanced': [round(accuracy_score(y_test, rf_full_pred), 3),
round(auc_score(y_test, rf_full_prob), 3)],
'Balanced': [round(accuracy_score(y_test, rf_pred), 3),
round(auc_score(y_test, rf_prob), 3)]}
general_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
general_result
If a model always says "It's not rainy" in any case, it will be about 88% correct because the weather is not rainy most times. The model trained with all the data (unbalanced) shows high Accuracy and high AUC. In fact, its AUC is a bit higher than the prediction from the balanced model. This is thought to be due to the size of training set.
# always predict the larger class
none_pred = np.zeros(len(X_val)).T # No Rain (=0)
# train with full dataset
rf_full_pred = rf_full_clf.predict(X_val)
rf_full_prob = rf_full_clf.predict_proba(X_val)[:,1]
# train with balanced dataset
rf_pred = rf_clf.predict(X_val)
rf_prob = rf_clf.predict_proba(X_val)[:,1]
accu_auc = {'One_Class': [round(accuracy_score(y_val, none_pred), 3),
round(auc_score(y_val, none_pred), 3)],
'Unbalanced': [round(accuracy_score(y_val, rf_full_pred), 3),
round(auc_score(y_val, rf_full_prob), 3)],
'Balanced': [round(accuracy_score(y_val, rf_pred), 3),
round(auc_score(y_val, rf_prob), 3)]}
equal_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
equal_result
Here is where the model trained with balanced data outperforms the model with all the data. This is a situation when the weather is quickly changing. The AUC of the unbalanced model, however, is still higher than the balanced model's. This is again because the unbalanced model had the advantage of more data, approximately 10 times, enough to rank each observation in the order of probability even if many the probabilities were low to result in 'No rain'.
# all Rain
r_list = []
for i, j in zip(X_val, y_val):
if j == 1:
r_list.append(i)
X_val_rain = np.asarray(r_list)
y_val_rain = np.ones(len(r_list), dtype=int)
y_val_rain[0] = 0
# always predict the larger class
none_pred = np.zeros(len(X_val_rain)).T # No Rain (=0)
# train with full dataset
rf_full_pred = rf_full_clf.predict(X_val_rain)
rf_full_prob = rf_full_clf.predict_proba(X_val_rain)[:,1]
# train with balanced dataset
rf_pred = rf_clf.predict(X_val_rain)
rf_prob = rf_clf.predict_proba(X_val_rain)[:,1]
accu_auc = {'One_Class': [round(accuracy_score(y_val_rain, none_pred), 3),
round(auc_score(y_val_rain, none_pred), 3)],
'Unbalanced': [round(accuracy_score(y_val_rain, rf_full_pred), 3),
round(auc_score(y_val_rain, rf_full_prob), 3)],
'Balanced': [round(accuracy_score(y_val_rain, rf_pred), 3),
round(auc_score(y_val_rain, rf_prob), 3)]}
rain_result = pd.DataFrame(data=accu_auc, index=["Accuracy", "AUC"])
rain_result
If it rain all day like in a monsoon season, the balanced model does remarkably better jobs in accuracy. The reason why unbalanced model is dropped to less than 0.3 accuracy is because this model is biased to predict 'No rain'. To sum it up, we would like to choose the model trained with balanced dataset in that it is more stable regardless of which class to predict.
We analyzed the dataset and built the predictive model. In the course, we needed to handle the imbalanced class issue which we solved by randomly selecting the same number of 'Rain' and 'No rain' class. Each model was run by 5-fold cross validation with best parameters found with the grid search technique.
Then, the models were evaluated by Accuracy, Precision and Recall, F1-score, and AUC. The selected model, Random Forest, was again tested in different situations by chaning the 'Rain' class ratio. While the performances of the imbalanced model varied in situations, the balanced model performed with strong stability. Therefore, the balanced model should be selected as the final model. However, the model would perform even better if the following could be done.
1. Observations: More data is always plus for prediction
2. Features: More features could be helpful if related
3. Modeling: Tweaking parameters would improve the perfomance
In our case, we only had 4 different features and limited number of rainy days data. The model would do better if we extend the observed period to add more rainy classes or we find other predictors from the dataset. In this case, both Accuracy and AUC would be the highest in any situation.