Week 5. Kaggle "Catch Me If You Can" Competition

This week we will recall the concept of stochastic gradient descent and try the Scikit-learn SGDClassifier classifier, which works much faster on large samples than the algorithms we tested in week 4. We will also get acquainted with the data competition Kaggle for user identification and we will make the first parcels in it.

In this part of the project, videos of the following lectures of the course "Learning from marked data" may be useful to us:

We can also go back and view the task "Linear regression and stochastic gradient descent" of the 1st week of the 2nd course of specialization.

from __future__ import division, print_function
# disable any Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
import os
import pickle
import numpy as np
import pandas as pd
import scipy.sparse as sps
from time import time
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score

Counting the data competitions in the DataFrame train_df and test_df (training and test samples).

PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/catch_me_if_you_can'
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'),
                       index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'),
                      index_col='session_id')
train_df.head()
site1 time1 site2 time2 site3 time3 site4 time4 site5 time5 site6 time6 site7 time7 site8 time8 site9 time9 site10 time10 target
session_id
1 718 2014-02-20 10:02:45 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0
2 890 2014-02-22 11:19:50 941.0 2014-02-22 11:19:50 3847.0 2014-02-22 11:19:51 941.0 2014-02-22 11:19:51 942.0 2014-02-22 11:19:51 3846.0 2014-02-22 11:19:51 3847.0 2014-02-22 11:19:52 3846.0 2014-02-22 11:19:52 1516.0 2014-02-22 11:20:15 1518.0 2014-02-22 11:20:16 0
3 14769 2013-12-16 16:40:17 39.0 2013-12-16 16:40:18 14768.0 2013-12-16 16:40:19 14769.0 2013-12-16 16:40:19 37.0 2013-12-16 16:40:19 39.0 2013-12-16 16:40:19 14768.0 2013-12-16 16:40:20 14768.0 2013-12-16 16:40:21 14768.0 2013-12-16 16:40:22 14768.0 2013-12-16 16:40:24 0
4 782 2014-03-28 10:52:12 782.0 2014-03-28 10:52:42 782.0 2014-03-28 10:53:12 782.0 2014-03-28 10:53:42 782.0 2014-03-28 10:54:12 782.0 2014-03-28 10:54:42 782.0 2014-03-28 10:55:12 782.0 2014-03-28 10:55:42 782.0 2014-03-28 10:56:12 782.0 2014-03-28 10:56:42 0
5 22 2014-02-28 10:53:05 177.0 2014-02-28 10:55:22 175.0 2014-02-28 10:55:22 178.0 2014-02-28 10:55:23 177.0 2014-02-28 10:55:23 178.0 2014-02-28 10:55:59 175.0 2014-02-28 10:55:59 177.0 2014-02-28 10:55:59 177.0 2014-02-28 10:57:06 178.0 2014-02-28 10:57:11 0

Let's combine the training and test samples – this will be needed to bring them together later to a sparse format.

train_test_df = pd.concat([train_df, test_df])

In the training sample we see the following signs:

  • site1 - index of the first visited site in the session
  • time1 – time of visiting the first site in the session
  • ...
  • site10 - index of the 10th visited site in the session
  • time10 – time of visiting the 10th site in the session
  • user_id - user ID

User sessions are allocated in such a way that they cannot be longer than half an hour or 10 sites. That is, the session is considered over either when the user visited 10 sites in a row, or when the session took more than 30 minutes.

Let's look at the statistics of the signs.

Skips occur where sessions are short (less than 10 sites). For example, if a person visited on January 1, 2015 vk.com at 20:01, then yandex.ru at 20:29, then google.com at 20:33, then its first session will consist of only two sites (site1 - site ID vk.com , time1 - 2015-01-01 20:01:00, site2 - site ID yandex.ru , time2 – 2015-01-01 20:29:00, other signs are NaN), and starting from google.com a new session will start because more than 30 minutes have already passed since the visit vk.com.

train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 253561 entries, 1 to 253561
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   site1   253561 non-null  int64  
 1   time1   253561 non-null  object 
 2   site2   250098 non-null  float64
 3   time2   250098 non-null  object 
 4   site3   246919 non-null  float64
 5   time3   246919 non-null  object 
 6   site4   244321 non-null  float64
 7   time4   244321 non-null  object 
 8   site5   241829 non-null  float64
 9   time5   241829 non-null  object 
 10  site6   239495 non-null  float64
 11  time6   239495 non-null  object 
 12  site7   237297 non-null  float64
 13  time7   237297 non-null  object 
 14  site8   235224 non-null  float64
 15  time8   235224 non-null  object 
 16  site9   233084 non-null  float64
 17  time9   233084 non-null  object 
 18  site10  231052 non-null  float64
 19  time10  231052 non-null  object 
 20  target  253561 non-null  int64  
dtypes: float64(9), int64(2), object(10)
memory usage: 42.6+ MB
test_df.head()
site1 time1 site2 time2 site3 time3 site4 time4 site5 time5 site6 time6 site7 time7 site8 time8 site9 time9 site10 time10
session_id
1 29 2014-10-04 11:19:53 35.0 2014-10-04 11:19:53 22.0 2014-10-04 11:19:54 321.0 2014-10-04 11:19:54 23.0 2014-10-04 11:19:54 2211.0 2014-10-04 11:19:54 6730.0 2014-10-04 11:19:54 21.0 2014-10-04 11:19:54 44582.0 2014-10-04 11:20:00 15336.0 2014-10-04 11:20:00
2 782 2014-07-03 11:00:28 782.0 2014-07-03 11:00:53 782.0 2014-07-03 11:00:58 782.0 2014-07-03 11:01:06 782.0 2014-07-03 11:01:09 782.0 2014-07-03 11:01:10 782.0 2014-07-03 11:01:23 782.0 2014-07-03 11:01:29 782.0 2014-07-03 11:01:30 782.0 2014-07-03 11:01:53
3 55 2014-12-05 15:55:12 55.0 2014-12-05 15:55:13 55.0 2014-12-05 15:55:14 55.0 2014-12-05 15:56:15 55.0 2014-12-05 15:56:16 55.0 2014-12-05 15:56:17 55.0 2014-12-05 15:56:18 55.0 2014-12-05 15:56:19 1445.0 2014-12-05 15:56:33 1445.0 2014-12-05 15:56:36
4 1023 2014-11-04 10:03:19 1022.0 2014-11-04 10:03:19 50.0 2014-11-04 10:03:20 222.0 2014-11-04 10:03:21 202.0 2014-11-04 10:03:21 3374.0 2014-11-04 10:03:22 50.0 2014-11-04 10:03:22 48.0 2014-11-04 10:03:22 48.0 2014-11-04 10:03:23 3374.0 2014-11-04 10:03:23
5 301 2014-05-16 15:05:31 301.0 2014-05-16 15:05:32 301.0 2014-05-16 15:05:33 66.0 2014-05-16 15:05:39 67.0 2014-05-16 15:05:40 69.0 2014-05-16 15:05:40 70.0 2014-05-16 15:05:40 68.0 2014-05-16 15:05:40 71.0 2014-05-16 15:05:40 167.0 2014-05-16 15:05:44
test_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 82797 entries, 1 to 82797
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   site1   82797 non-null  int64  
 1   time1   82797 non-null  object 
 2   site2   81308 non-null  float64
 3   time2   81308 non-null  object 
 4   site3   80075 non-null  float64
 5   time3   80075 non-null  object 
 6   site4   79182 non-null  float64
 7   time4   79182 non-null  object 
 8   site5   78341 non-null  float64
 9   time5   78341 non-null  object 
 10  site6   77566 non-null  float64
 11  time6   77566 non-null  object 
 12  site7   76840 non-null  float64
 13  time7   76840 non-null  object 
 14  site8   76151 non-null  float64
 15  time8   76151 non-null  object 
 16  site9   75484 non-null  float64
 17  time9   75484 non-null  object 
 18  site10  74806 non-null  float64
 19  time10  74806 non-null  object 
dtypes: float64(9), int64(1), object(10)
memory usage: 13.3+ MB

In the training sample there are 2297 sessions of one user (Alice) and 251264 sessions of other users, not Alice. The class imbalance is very strong, and looking at the proportion of correct answers (accuracy) is not indicative.

train_df['target'].value_counts()
0    251264
1      2297
Name: target, dtype: int64

So far, we will use only the indexes of visited sites for the forecast. The indexes were numbered from 1, so we'll replace the omissions with zeros.

train_test_df_sites = train_test_df[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')
train_test_df_sites.head(10)
site1 site2 site3 site4 site5 site6 site7 site8 site9 site10
session_id
1 718 0 0 0 0 0 0 0 0 0
2 890 941 3847 941 942 3846 3847 3846 1516 1518
3 14769 39 14768 14769 37 39 14768 14768 14768 14768
4 782 782 782 782 782 782 782 782 782 782
5 22 177 175 178 177 178 175 177 177 178
6 570 21 570 21 21 0 0 0 0 0
7 803 23 5956 17513 37 21 803 17514 17514 17514
8 22 21 29 5041 14422 23 21 5041 14421 14421
9 668 940 942 941 941 942 940 23 21 22
10 3700 229 570 21 229 21 21 21 2336 2044

Let's create sparse matrices X_train_sparse and X_test_sparse in the same way as we did earlier. We use the combined matrix train_test_df_sites, then divide it back into the training and test parts.

It should be noted that we have zeros in less than 10 sites in sessions, so the first sign (how many times 0 was caught) is different in meaning from the rest (how many times a site with the index $i$ was caught). Therefore, we will delete the first column of the sparse matrix.

Let's separate the answers in the training sample into a separate vector y.

def to_sparse(X):
    """Transformation of the matrix from dense to sparse.

    Args:
        X (numpy.ndarray): The original (dense) matrix.

    Returns:
        scipy.sparse.csr.csr_matrix: Sparse matrix.
    
    """
    return csr_matrix((np.ones(X.size, dtype=int),
                       X.reshape(-1),
                       np.arange(X.shape[0] + 1) * X.shape[1]))[:, 1:]


train_test_sparse = to_sparse(train_test_df_sites.values)
X_train_sparse = train_test_sparse[:train_df.shape[0]]
X_test_sparse = train_test_sparse[train_df.shape[0]:]
y = train_df.target

Question 1. Output the dimensions of the matrices X_train_sparse and X_test_sparse – 4 numbers on one line separated by a space: the number of rows and columns of the matrix X_train_sparse, then the number of rows and columns of the matrix X_test_sparse.

print(X_train_sparse.shape[0], " - ", X_train_sparse.shape[1]," - ", X_test_sparse.shape[0], " - ", X_test_sparse.shape[1])
253561  -  48371  -  82797  -  48371

Save the objects X_train_sparse, X_test_sparse and y to the pickle files (the latter to the file kaggle_data/train_target.pkl).

with open(os.path.join(PATH_TO_DATA, 'X_train_sparse.pkl'), 'wb') as X_train_sparse_pkl:
    pickle.dump(X_train_sparse, X_train_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'X_test_sparse.pkl'), 'wb') as X_test_sparse_pkl:
    pickle.dump(X_test_sparse, X_test_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'train_target.pkl'), 'wb') as train_target_pkl:
    pickle.dump(y, train_target_pkl, protocol=2)

Let's divide the training sample into 2 parts in the proportion of 7/3, and without mixing. The initial data is ordered by time, the test sample is clearly separated by time from the training one, we will observe the same here.

train_share = int(.7 * X_train_sparse.shape[0])
X_train, y_train = X_train_sparse[:train_share, :], y[:train_share]
X_valid, y_valid  = X_train_sparse[train_share:, :], y[train_share:]

Create an object 'sklearn.linear_model.SGDClassifier' with a logistic loss function and the parameter random_state=17. We will leave the other parameters by default. Let's train the model on the sample (X_train, y_train).

from sklearn import linear_model
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train, y_train)
SGDClassifier(loss='log', n_jobs=-1, random_state=17)

Let's make a forecast in the form of predicted probabilities that this is Alice's session, on a deferred sample (X_valid, y_valid).

logit_valid_pred_proba = sgd_logit.predict_proba(X_valid)

Question 2. Let's calculate the ROC AUC of a logistic regression trained using stochastic gradient descent on a deferred sample. Round it up to 3 digits after the separator.

from sklearn.metrics import roc_auc_score
round(roc_auc_score(y_valid, logit_valid_pred_proba[:, 1]), 3)
0.934

Let's make a forecast in the form of predicted probabilities of being assigned to class 1 for the test sample using the same sgd_logit, trained already on the entire training sample (and not by 70%).

%%time
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train_sparse, y)
logit_test_pred_proba = sgd_logit.predict_proba(X_test_sparse)
CPU times: user 775 ms, sys: 104 ms, total: 879 ms
Wall time: 777 ms

We will write the answers to a file and send it to Kaggle. Let's give our team (of one person) on Kaggle a talking name - according to the template "[YDF & MIPT] Coursera_Username", so that we can easily identify our answer on the leaderboard.

def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)
write_to_submission_file(logit_test_pred_proba[:, 1], os.path.join(PATH_TO_DATA, 'prediction_2.csv'))

Improving the model, building new features

Ways to improve

  • Use previously constructed features to improve the model (you can check them on a smaller sample of 150 users by separating one of the users from the rest – it's faster)
  • Adjust model parameters (for example, regularization coefficients)
  • If the power allows (or you have enough patience), you can try mixing (blending) the responses of the boosting and linear model. Here one of the most famous tutorials on mixing algorithm responses, also good article Alexandra Diakonova
  • Please note that the competition also provides the initial data on the web pages visited by Alice and the other 1557 users (train.zip ). Based on this data, you can form your own training sample.

Let's create such a sign, which will be a number of the form YYYY MM from the date when the session took place, for example 201407 -- 2014 and 7 month. Thus, we will take into account the monthly linear trend for the entire period of the data provided.

train_test_df_ver_2 = pd.concat([train_df, test_df])
train_test_df_sites_ver_2 = train_test_df_ver_2[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')
train_test_df_time_ver_2 = train_test_df_ver_2[['time%d' % i for i in range(1, 11)]].fillna(0).astype('datetime64[ns]')
new_feat = pd.DataFrame(index = train_test_df_time_ver_2.index)
def morning(hour):
    if hour <= 11:
        return 1
    else: 
        return 0
new_feat['year_month'] = train_test_df_time_ver_2['time1'].apply(lambda ts: 100 * ts.year + ts.month)
new_feat.head()
year_month
session_id
1 201402
2 201402
3 201312
4 201403
5 201402
scaler = StandardScaler()
new_feat['year_month_scaled'] = scaler.fit_transform(new_feat['year_month'].values.reshape(-1,1))

Let's add two new signs: start_hour and morning.

The start_hour attribute is the hour at which the session started (from 0 to 23), and the morning binary attribute is 1 if the session started in the morning and 0 if the session started later (we will assume that morning is if start_hour is 11 or less).

new_feat['start_hour'] = train_test_df_time_ver_2['time1'].apply(lambda ts: ts.hour)
new_feat['morning'] = new_feat['start_hour'].apply(morning)
new_feat = new_feat.drop(['year_month'], axis=1)
year_month_scaled start_hour morning
session_id
1 0.476232 10 1
2 0.476232 11 1
3 -1.800775 16 0
4 0.501532 10 1
5 0.476232 10 1
train_test_df_sites_new_feat = pd.concat([train_test_df_sites_ver_2, new_feat], axis=1)
train_test_df_sites_new_feat.head()
site1 site2 site3 site4 site5 site6 site7 site8 site9 site10 year_month_scaled start_hour morning
session_id
1 718 0 0 0 0 0 0 0 0 0 0.476232 10 1
2 890 941 3847 941 942 3846 3847 3846 1516 1518 0.476232 11 1
3 14769 39 14768 14769 37 39 14768 14768 14768 14768 -1.800775 16 0
4 782 782 782 782 782 782 782 782 782 782 0.501532 10 1
5 22 177 175 178 177 178 175 177 177 178 0.476232 10 1
train_test_sparse = to_sparse(train_test_df_sites_new_feat.values)
X_train_sparse_new_feat = train_test_sparse[:train_df.shape[0]]
X_test_sparse_new_feat = train_test_sparse[train_df.shape[0]:]
train_share = int(.7 * X_train_sparse.shape[0])
X_train, y_train = X_train_sparse_new_feat[:train_share, :], y[:train_share]
X_valid, y_valid  = X_train_sparse_new_feat[train_share:, :], y[train_share:]
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train, y_train)
SGDClassifier(loss='log', n_jobs=-1, random_state=17)
logit_valid_pred_proba = sgd_logit.predict_proba(X_valid)
round(roc_auc_score(y_valid, logit_valid_pred_proba[:, 1]), 3)
0.96
%%time
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train_sparse_new_feat, y)
logit_test_pred_proba = sgd_logit.predict_proba(X_test_sparse_new_feat)
CPU times: user 1.05 s, sys: 77.9 ms, total: 1.13 s
Wall time: 1.19 s
write_to_submission_file(logit_test_pred_proba[:, 1], os.path.join(PATH_TO_DATA, 'prediction_new_feat_2.csv'))

My best achievement last time on the leaderboard was score 0.92159. This time we have reached the score 0.91881