Week 5. Kaggle "Catch Me If You Can" Competition
This week we will recall the concept of stochastic gradient descent and try the Scikit-learn SGDClassifier classifier, which works much faster on large samples than the algorithms we tested in week 4. We will also get acquainted with the data competition Kaggle for user identification and we will make the first parcels in it.
In this part of the project, videos of the following lectures of the course "Learning from marked data" may be useful to us:
We can also go back and view the task "Linear regression and stochastic gradient descent" of the 1st week of the 2nd course of specialization.
from __future__ import division, print_function
# disable any Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
import os
import pickle
import numpy as np
import pandas as pd
import scipy.sparse as sps
from time import time
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score
Counting the data competitions in the DataFrame train_df and test_df (training and test samples).
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/catch_me_if_you_can'
from google.colab import drive
drive.mount('/content/drive')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'),
index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'),
index_col='session_id')
train_df.head()
Let's combine the training and test samples – this will be needed to bring them together later to a sparse format.
train_test_df = pd.concat([train_df, test_df])
In the training sample we see the following signs:
- site1 - index of the first visited site in the session
- time1 – time of visiting the first site in the session
- ...
- site10 - index of the 10th visited site in the session
- time10 – time of visiting the 10th site in the session
- user_id - user ID
User sessions are allocated in such a way that they cannot be longer than half an hour or 10 sites. That is, the session is considered over either when the user visited 10 sites in a row, or when the session took more than 30 minutes.
Let's look at the statistics of the signs.
Skips occur where sessions are short (less than 10 sites). For example, if a person visited on January 1, 2015 vk.com at 20:01, then yandex.ru at 20:29, then google.com at 20:33, then its first session will consist of only two sites (site1 - site ID vk.com , time1 - 2015-01-01 20:01:00, site2 - site ID yandex.ru , time2 – 2015-01-01 20:29:00, other signs are NaN), and starting from google.com a new session will start because more than 30 minutes have already passed since the visit vk.com.
train_df.info()
test_df.head()
test_df.info()
In the training sample there are 2297 sessions of one user (Alice) and 251264 sessions of other users, not Alice. The class imbalance is very strong, and looking at the proportion of correct answers (accuracy) is not indicative.
train_df['target'].value_counts()
So far, we will use only the indexes of visited sites for the forecast. The indexes were numbered from 1, so we'll replace the omissions with zeros.
train_test_df_sites = train_test_df[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')
train_test_df_sites.head(10)
Let's create sparse matrices X_train_sparse and X_test_sparse in the same way as we did earlier. We use the combined matrix train_test_df_sites, then divide it back into the training and test parts.
It should be noted that we have zeros in less than 10 sites in sessions, so the first sign (how many times 0 was caught) is different in meaning from the rest (how many times a site with the index $i$ was caught). Therefore, we will delete the first column of the sparse matrix.
Let's separate the answers in the training sample into a separate vector y.
def to_sparse(X):
"""Transformation of the matrix from dense to sparse.
Args:
X (numpy.ndarray): The original (dense) matrix.
Returns:
scipy.sparse.csr.csr_matrix: Sparse matrix.
"""
return csr_matrix((np.ones(X.size, dtype=int),
X.reshape(-1),
np.arange(X.shape[0] + 1) * X.shape[1]))[:, 1:]
train_test_sparse = to_sparse(train_test_df_sites.values)
X_train_sparse = train_test_sparse[:train_df.shape[0]]
X_test_sparse = train_test_sparse[train_df.shape[0]:]
y = train_df.target
Question 1. Output the dimensions of the matrices X_train_sparse and X_test_sparse – 4 numbers on one line separated by a space: the number of rows and columns of the matrix X_train_sparse, then the number of rows and columns of the matrix X_test_sparse.
print(X_train_sparse.shape[0], " - ", X_train_sparse.shape[1]," - ", X_test_sparse.shape[0], " - ", X_test_sparse.shape[1])
Save the objects X_train_sparse, X_test_sparse and y to the pickle files (the latter to the file kaggle_data/train_target.pkl).
with open(os.path.join(PATH_TO_DATA, 'X_train_sparse.pkl'), 'wb') as X_train_sparse_pkl:
pickle.dump(X_train_sparse, X_train_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'X_test_sparse.pkl'), 'wb') as X_test_sparse_pkl:
pickle.dump(X_test_sparse, X_test_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'train_target.pkl'), 'wb') as train_target_pkl:
pickle.dump(y, train_target_pkl, protocol=2)
Let's divide the training sample into 2 parts in the proportion of 7/3, and without mixing. The initial data is ordered by time, the test sample is clearly separated by time from the training one, we will observe the same here.
train_share = int(.7 * X_train_sparse.shape[0])
X_train, y_train = X_train_sparse[:train_share, :], y[:train_share]
X_valid, y_valid = X_train_sparse[train_share:, :], y[train_share:]
Create an object 'sklearn.linear_model.SGDClassifier' with a logistic loss function and the parameter random_state=17. We will leave the other parameters by default. Let's train the model on the sample (X_train, y_train)
.
from sklearn import linear_model
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train, y_train)
Let's make a forecast in the form of predicted probabilities that this is Alice's session, on a deferred sample (X_valid, y_valid).
logit_valid_pred_proba = sgd_logit.predict_proba(X_valid)
Question 2. Let's calculate the ROC AUC of a logistic regression trained using stochastic gradient descent on a deferred sample. Round it up to 3 digits after the separator.
from sklearn.metrics import roc_auc_score
round(roc_auc_score(y_valid, logit_valid_pred_proba[:, 1]), 3)
Let's make a forecast in the form of predicted probabilities of being assigned to class 1 for the test sample using the same sgd_logit, trained already on the entire training sample (and not by 70%).
%%time
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train_sparse, y)
logit_test_pred_proba = sgd_logit.predict_proba(X_test_sparse)
We will write the answers to a file and send it to Kaggle. Let's give our team (of one person) on Kaggle a talking name - according to the template "[YDF & MIPT] Coursera_Username", so that we can easily identify our answer on the leaderboard.
def write_to_submission_file(predicted_labels, out_file,
target='target', index_label="session_id"):
# turn predictions into data frame and save as csv file
predicted_df = pd.DataFrame(predicted_labels,
index = np.arange(1, predicted_labels.shape[0] + 1),
columns=[target])
predicted_df.to_csv(out_file, index_label=index_label)
write_to_submission_file(logit_test_pred_proba[:, 1], os.path.join(PATH_TO_DATA, 'prediction_2.csv'))
Improving the model, building new features
Ways to improve
- Use previously constructed features to improve the model (you can check them on a smaller sample of 150 users by separating one of the users from the rest – it's faster)
- Adjust model parameters (for example, regularization coefficients)
- If the power allows (or you have enough patience), you can try mixing (blending) the responses of the boosting and linear model. Here one of the most famous tutorials on mixing algorithm responses, also good article Alexandra Diakonova
- Please note that the competition also provides the initial data on the web pages visited by Alice and the other 1557 users (train.zip ). Based on this data, you can form your own training sample.
Let's create such a sign, which will be a number of the form YYYY MM from the date when the session took place, for example 201407 -- 2014 and 7 month. Thus, we will take into account the monthly linear trend for the entire period of the data provided.
train_test_df_ver_2 = pd.concat([train_df, test_df])
train_test_df_sites_ver_2 = train_test_df_ver_2[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')
train_test_df_time_ver_2 = train_test_df_ver_2[['time%d' % i for i in range(1, 11)]].fillna(0).astype('datetime64[ns]')
new_feat = pd.DataFrame(index = train_test_df_time_ver_2.index)
def morning(hour):
if hour <= 11:
return 1
else:
return 0
new_feat['year_month'] = train_test_df_time_ver_2['time1'].apply(lambda ts: 100 * ts.year + ts.month)
new_feat.head()
scaler = StandardScaler()
new_feat['year_month_scaled'] = scaler.fit_transform(new_feat['year_month'].values.reshape(-1,1))
Let's add two new signs: start_hour and morning.
The start_hour attribute is the hour at which the session started (from 0 to 23), and the morning binary attribute is 1 if the session started in the morning and 0 if the session started later (we will assume that morning is if start_hour is 11 or less).
new_feat['start_hour'] = train_test_df_time_ver_2['time1'].apply(lambda ts: ts.hour)
new_feat['morning'] = new_feat['start_hour'].apply(morning)
new_feat = new_feat.drop(['year_month'], axis=1)
train_test_df_sites_new_feat = pd.concat([train_test_df_sites_ver_2, new_feat], axis=1)
train_test_df_sites_new_feat.head()
train_test_sparse = to_sparse(train_test_df_sites_new_feat.values)
X_train_sparse_new_feat = train_test_sparse[:train_df.shape[0]]
X_test_sparse_new_feat = train_test_sparse[train_df.shape[0]:]
train_share = int(.7 * X_train_sparse.shape[0])
X_train, y_train = X_train_sparse_new_feat[:train_share, :], y[:train_share]
X_valid, y_valid = X_train_sparse_new_feat[train_share:, :], y[train_share:]
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train, y_train)
logit_valid_pred_proba = sgd_logit.predict_proba(X_valid)
round(roc_auc_score(y_valid, logit_valid_pred_proba[:, 1]), 3)
%%time
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train_sparse_new_feat, y)
logit_test_pred_proba = sgd_logit.predict_proba(X_test_sparse_new_feat)
write_to_submission_file(logit_test_pred_proba[:, 1], os.path.join(PATH_TO_DATA, 'prediction_new_feat_2.csv'))
My best achievement last time on the leaderboard was score 0.92159. This time we have reached the score 0.91881