Los Alamos National Laboratory Earthquake Prediction
This is my Bronze Medal submission to Kaggle competition LANL Earthquake Prediction 2019, which put me in top 9% out of 4,521 teams.
My solution was based on a couple of ideas shared on Kaggle:- - Vettejeep's idea of splitting data and generating 24k samples;
- - Andrew Lukyanenko approach to feature generation.
While working on this competition I tried quite a few models (boosting models and NN), ensembling and stacking, and different sets of features (from 981 down to 36).
But in the end the best solution was single GradientBoostingRegressor with 865 features, using KFold with 8 folds, and parameters set = {'max_depth':10, 'learning_rate':0.1, 'min_samples_split':2, 'min_samples_leaf' : 15}
After reading through all top solutions I realized there were a few more tweaks I didn't implement that could have boosted my result significantly, including joining 2 short earthquakes into a single 'long' one, adding random noise and subtracting median from the signal.
And probably most important was to choose the right set of features - the 1st place solution used only 6 features.
Project code
"""
Created on Mon May 6 14:55:13 2019
@author: alex
"""
from sklearn.model_selection import KFold,StratifiedKFold, RepeatedKFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
path='./new_input/'
df_train = pd.read_csv(path+'scaled_train_X.csv')
y = pd.read_csv(path+'train_y.csv', dtype={'target': np.float32})
df_test = pd.read_csv(path+'scaled_test_X.csv')
submission = pd.read_csv(path+'sample_submission.csv')
X_train, X_val, y_train, y_val = train_test_split(df_train,y,test_size=0.2, shuffle=True)
cols = df_train.columns.tolist()
params = {'max_depth':10,
'learning_rate':0.1,
'min_samples_split':2,
'min_samples_leaf' : 15
}
#
#-----------------------------------------
n_fold = 8
folds = KFold(n_splits=n_fold, shuffle=True, random_state=1970)
oof = np.zeros(len(X_train))
oof_val = np.zeros(len(X_train))
predictions = np.zeros(len(df_test))
#run model
for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train.values)):
strLog = "fold {}".format(fold_)
X_trn, X_valid = X_train.iloc[trn_idx], X_train.iloc[val_idx]
y_trn, y_valid = y_train.iloc[trn_idx], y_train.iloc[val_idx]
regressor = GradientBoostingRegressor(**params,
n_estimators=1000,
subsample = 1,
verbose = 1,
random_state = 1970,
validation_fraction = 0.2,
n_iter_no_change = 20
)
regressor.fit(X_trn, y_trn)
y_pred = regressor.predict(X_valid)
oof[val_idx] = y_pred
predictions += regressor.predict(df_test) / folds.n_splits
regressor.fit(X_train, y_train)
val_prediction = regressor.predict(X_val)
acc_09 = mean_absolute_error(y_val, val_prediction)
acc_10 = mean_absolute_error(oof, y_train)
test_prediction = regressor.predict(df_test)
submission.time_to_failure = test_prediction
submission.to_csv('GradientBoostingRegressor-24k-865f-2.csv',index=False)