Week 5. Kaggle "Catch Me If You Can" Competition

This week we will recall the concept of stochastic gradient descent and try the Scikit-learn SGDClassifier classifier, which works much faster on large samples than the algorithms we tested in week 4. We will also get acquainted with the data competition Kaggle for user identification and we will make the first parcels in it.

In this part of the project, videos of the following lectures of the course "Learning from marked data" may be useful to us:

We can also go back and view the task "Linear regression and stochastic gradient descent" of the 1st week of the 2nd course of specialization.

from __future__ import division, print_function
# disable any Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
import os
import pickle
import numpy as np
import pandas as pd
import scipy.sparse as sps
from time import time
from scipy.sparse import csr_matrix, hstack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score

Counting the data competitions in the DataFrame train_df and test_df (training and test samples).

PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/catch_me_if_you_can'

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'),
                       index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'),
                      index_col='session_id')

train_df.head()

Let's combine the training and test samples – this will be needed to bring them together later to a sparse format.

train_test_df = pd.concat([train_df, test_df])

In the training sample we see the following signs:

site1 - index of the first visited site in the session
time1 – time of visiting the first site in the session
...
site10 - index of the 10th visited site in the session
time10 – time of visiting the 10th site in the session
user_id - user ID

User sessions are allocated in such a way that they cannot be longer than half an hour or 10 sites. That is, the session is considered over either when the user visited 10 sites in a row, or when the session took more than 30 minutes.

Let's look at the statistics of the signs.

Skips occur where sessions are short (less than 10 sites). For example, if a person visited on January 1, 2015 vk.com at 20:01, then yandex.ru at 20:29, then google.com at 20:33, then its first session will consist of only two sites (site1 - site ID vk.com , time1 - 2015-01-01 20:01:00, site2 - site ID yandex.ru , time2 – 2015-01-01 20:29:00, other signs are NaN), and starting from google.com a new session will start because more than 30 minutes have already passed since the visit vk.com.

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253561 entries, 1 to 253561
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   site1   253561 non-null  int64  
 1   time1   253561 non-null  object 
 2   site2   250098 non-null  float64
 3   time2   250098 non-null  object 
 4   site3   246919 non-null  float64
 5   time3   246919 non-null  object 
 6   site4   244321 non-null  float64
 7   time4   244321 non-null  object 
 8   site5   241829 non-null  float64
 9   time5   241829 non-null  object 
 10  site6   239495 non-null  float64
 11  time6   239495 non-null  object 
 12  site7   237297 non-null  float64
 13  time7   237297 non-null  object 
 14  site8   235224 non-null  float64
 15  time8   235224 non-null  object 
 16  site9   233084 non-null  float64
 17  time9   233084 non-null  object 
 18  site10  231052 non-null  float64
 19  time10  231052 non-null  object 
 20  target  253561 non-null  int64  
dtypes: float64(9), int64(2), object(10)
memory usage: 42.6+ MB

test_df.head()

test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82797 entries, 1 to 82797
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   site1   82797 non-null  int64  
 1   time1   82797 non-null  object 
 2   site2   81308 non-null  float64
 3   time2   81308 non-null  object 
 4   site3   80075 non-null  float64
 5   time3   80075 non-null  object 
 6   site4   79182 non-null  float64
 7   time4   79182 non-null  object 
 8   site5   78341 non-null  float64
 9   time5   78341 non-null  object 
 10  site6   77566 non-null  float64
 11  time6   77566 non-null  object 
 12  site7   76840 non-null  float64
 13  time7   76840 non-null  object 
 14  site8   76151 non-null  float64
 15  time8   76151 non-null  object 
 16  site9   75484 non-null  float64
 17  time9   75484 non-null  object 
 18  site10  74806 non-null  float64
 19  time10  74806 non-null  object 
dtypes: float64(9), int64(1), object(10)
memory usage: 13.3+ MB

In the training sample there are 2297 sessions of one user (Alice) and 251264 sessions of other users, not Alice. The class imbalance is very strong, and looking at the proportion of correct answers (accuracy) is not indicative.

train_df['target'].value_counts()

0    251264
1      2297
Name: target, dtype: int64

So far, we will use only the indexes of visited sites for the forecast. The indexes were numbered from 1, so we'll replace the omissions with zeros.

train_test_df_sites = train_test_df[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')

train_test_df_sites.head(10)

Let's create sparse matrices X_train_sparse and X_test_sparse in the same way as we did earlier. We use the combined matrix train_test_df_sites, then divide it back into the training and test parts.

It should be noted that we have zeros in less than 10 sites in sessions, so the first sign (how many times 0 was caught) is different in meaning from the rest (how many times a site with the index $i$ was caught). Therefore, we will delete the first column of the sparse matrix.

Let's separate the answers in the training sample into a separate vector y.

def to_sparse(X):
    """Transformation of the matrix from dense to sparse.

    Args:
        X (numpy.ndarray): The original (dense) matrix.

    Returns:
        scipy.sparse.csr.csr_matrix: Sparse matrix.
    
    """
    return csr_matrix((np.ones(X.size, dtype=int),
                       X.reshape(-1),
                       np.arange(X.shape[0] + 1) * X.shape[1]))[:, 1:]


train_test_sparse = to_sparse(train_test_df_sites.values)
X_train_sparse = train_test_sparse[:train_df.shape[0]]
X_test_sparse = train_test_sparse[train_df.shape[0]:]
y = train_df.target

Question 1. Output the dimensions of the matrices X_train_sparse and X_test_sparse – 4 numbers on one line separated by a space: the number of rows and columns of the matrix X_train_sparse, then the number of rows and columns of the matrix X_test_sparse.

print(X_train_sparse.shape[0], " - ", X_train_sparse.shape[1]," - ", X_test_sparse.shape[0], " - ", X_test_sparse.shape[1])

253561  -  48371  -  82797  -  48371

Save the objects X_train_sparse, X_test_sparse and y to the pickle files (the latter to the file kaggle_data/train_target.pkl).

with open(os.path.join(PATH_TO_DATA, 'X_train_sparse.pkl'), 'wb') as X_train_sparse_pkl:
    pickle.dump(X_train_sparse, X_train_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'X_test_sparse.pkl'), 'wb') as X_test_sparse_pkl:
    pickle.dump(X_test_sparse, X_test_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'train_target.pkl'), 'wb') as train_target_pkl:
    pickle.dump(y, train_target_pkl, protocol=2)

Let's divide the training sample into 2 parts in the proportion of 7/3, and without mixing. The initial data is ordered by time, the test sample is clearly separated by time from the training one, we will observe the same here.

train_share = int(.7 * X_train_sparse.shape[0])
X_train, y_train = X_train_sparse[:train_share, :], y[:train_share]
X_valid, y_valid  = X_train_sparse[train_share:, :], y[train_share:]

Create an object 'sklearn.linear_model.SGDClassifier' with a logistic loss function and the parameter random_state=17. We will leave the other parameters by default. Let's train the model on the sample (X_train, y_train).

from sklearn import linear_model

sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train, y_train)

SGDClassifier(loss='log', n_jobs=-1, random_state=17)

Let's make a forecast in the form of predicted probabilities that this is Alice's session, on a deferred sample (X_valid, y_valid).

logit_valid_pred_proba = sgd_logit.predict_proba(X_valid)

Question 2. Let's calculate the ROC AUC of a logistic regression trained using stochastic gradient descent on a deferred sample. Round it up to 3 digits after the separator.

from sklearn.metrics import roc_auc_score

round(roc_auc_score(y_valid, logit_valid_pred_proba[:, 1]), 3)

0.934

Let's make a forecast in the form of predicted probabilities of being assigned to class 1 for the test sample using the same sgd_logit, trained already on the entire training sample (and not by 70%).

%%time
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train_sparse, y)
logit_test_pred_proba = sgd_logit.predict_proba(X_test_sparse)

CPU times: user 775 ms, sys: 104 ms, total: 879 ms
Wall time: 777 ms

We will write the answers to a file and send it to Kaggle. Let's give our team (of one person) on Kaggle a talking name - according to the template "[YDF & MIPT] Coursera_Username", so that we can easily identify our answer on the leaderboard.

def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    # turn predictions into data frame and save as csv file
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)

write_to_submission_file(logit_test_pred_proba[:, 1], os.path.join(PATH_TO_DATA, 'prediction_2.csv'))

Improving the model, building new features

Ways to improve

Use previously constructed features to improve the model (you can check them on a smaller sample of 150 users by separating one of the users from the rest – it's faster)
Adjust model parameters (for example, regularization coefficients)
If the power allows (or you have enough patience), you can try mixing (blending) the responses of the boosting and linear model. Here one of the most famous tutorials on mixing algorithm responses, also good article Alexandra Diakonova
Please note that the competition also provides the initial data on the web pages visited by Alice and the other 1557 users (train.zip ). Based on this data, you can form your own training sample.

Let's create such a sign, which will be a number of the form YYYY MM from the date when the session took place, for example 201407 -- 2014 and 7 month. Thus, we will take into account the monthly linear trend for the entire period of the data provided.

train_test_df_ver_2 = pd.concat([train_df, test_df])

train_test_df_sites_ver_2 = train_test_df_ver_2[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')

train_test_df_time_ver_2 = train_test_df_ver_2[['time%d' % i for i in range(1, 11)]].fillna(0).astype('datetime64[ns]')

new_feat = pd.DataFrame(index = train_test_df_time_ver_2.index)

def morning(hour):
    if hour <= 11:
        return 1
    else: 
        return 0

new_feat['year_month'] = train_test_df_time_ver_2['time1'].apply(lambda ts: 100 * ts.year + ts.month)

new_feat.head()

scaler = StandardScaler()
new_feat['year_month_scaled'] = scaler.fit_transform(new_feat['year_month'].values.reshape(-1,1))

Let's add two new signs: start_hour and morning.

The start_hour attribute is the hour at which the session started (from 0 to 23), and the morning binary attribute is 1 if the session started in the morning and 0 if the session started later (we will assume that morning is if start_hour is 11 or less).

new_feat['start_hour'] = train_test_df_time_ver_2['time1'].apply(lambda ts: ts.hour)

new_feat['morning'] = new_feat['start_hour'].apply(morning)

new_feat = new_feat.drop(['year_month'], axis=1)

train_test_df_sites_new_feat = pd.concat([train_test_df_sites_ver_2, new_feat], axis=1)

train_test_df_sites_new_feat.head()

train_test_sparse = to_sparse(train_test_df_sites_new_feat.values)
X_train_sparse_new_feat = train_test_sparse[:train_df.shape[0]]
X_test_sparse_new_feat = train_test_sparse[train_df.shape[0]:]

train_share = int(.7 * X_train_sparse.shape[0])
X_train, y_train = X_train_sparse_new_feat[:train_share, :], y[:train_share]
X_valid, y_valid  = X_train_sparse_new_feat[train_share:, :], y[train_share:]

sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train, y_train)

SGDClassifier(loss='log', n_jobs=-1, random_state=17)

logit_valid_pred_proba = sgd_logit.predict_proba(X_valid)

round(roc_auc_score(y_valid, logit_valid_pred_proba[:, 1]), 3)

0.96

%%time
sgd_logit = SGDClassifier(loss='log', random_state=17, n_jobs=-1)
sgd_logit.fit(X_train_sparse_new_feat, y)
logit_test_pred_proba = sgd_logit.predict_proba(X_test_sparse_new_feat)

CPU times: user 1.05 s, sys: 77.9 ms, total: 1.13 s
Wall time: 1.19 s

write_to_submission_file(logit_test_pred_proba[:, 1], os.path.join(PATH_TO_DATA, 'prediction_new_feat_2.csv'))

My best achievement last time on the leaderboard was score 0.92159. This time we have reached the score 0.91881

	site1	time1	site2	time2	site3	time3	site4	time4	site5	time5	site6	time6	site7	time7	site8	time8	site9	time9	site10	time10	target
session_id
1	718	2014-02-20 10:02:45	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
2	890	2014-02-22 11:19:50	941.0	2014-02-22 11:19:50	3847.0	2014-02-22 11:19:51	941.0	2014-02-22 11:19:51	942.0	2014-02-22 11:19:51	3846.0	2014-02-22 11:19:51	3847.0	2014-02-22 11:19:52	3846.0	2014-02-22 11:19:52	1516.0	2014-02-22 11:20:15	1518.0	2014-02-22 11:20:16	0
3	14769	2013-12-16 16:40:17	39.0	2013-12-16 16:40:18	14768.0	2013-12-16 16:40:19	14769.0	2013-12-16 16:40:19	37.0	2013-12-16 16:40:19	39.0	2013-12-16 16:40:19	14768.0	2013-12-16 16:40:20	14768.0	2013-12-16 16:40:21	14768.0	2013-12-16 16:40:22	14768.0	2013-12-16 16:40:24	0
4	782	2014-03-28 10:52:12	782.0	2014-03-28 10:52:42	782.0	2014-03-28 10:53:12	782.0	2014-03-28 10:53:42	782.0	2014-03-28 10:54:12	782.0	2014-03-28 10:54:42	782.0	2014-03-28 10:55:12	782.0	2014-03-28 10:55:42	782.0	2014-03-28 10:56:12	782.0	2014-03-28 10:56:42	0
5	22	2014-02-28 10:53:05	177.0	2014-02-28 10:55:22	175.0	2014-02-28 10:55:22	178.0	2014-02-28 10:55:23	177.0	2014-02-28 10:55:23	178.0	2014-02-28 10:55:59	175.0	2014-02-28 10:55:59	177.0	2014-02-28 10:55:59	177.0	2014-02-28 10:57:06	178.0	2014-02-28 10:57:11	0

	site1	time1	site2	time2	site3	time3	site4	time4	site5	time5	site6	time6	site7	time7	site8	time8	site9	time9	site10	time10
session_id
1	29	2014-10-04 11:19:53	35.0	2014-10-04 11:19:53	22.0	2014-10-04 11:19:54	321.0	2014-10-04 11:19:54	23.0	2014-10-04 11:19:54	2211.0	2014-10-04 11:19:54	6730.0	2014-10-04 11:19:54	21.0	2014-10-04 11:19:54	44582.0	2014-10-04 11:20:00	15336.0	2014-10-04 11:20:00
2	782	2014-07-03 11:00:28	782.0	2014-07-03 11:00:53	782.0	2014-07-03 11:00:58	782.0	2014-07-03 11:01:06	782.0	2014-07-03 11:01:09	782.0	2014-07-03 11:01:10	782.0	2014-07-03 11:01:23	782.0	2014-07-03 11:01:29	782.0	2014-07-03 11:01:30	782.0	2014-07-03 11:01:53
3	55	2014-12-05 15:55:12	55.0	2014-12-05 15:55:13	55.0	2014-12-05 15:55:14	55.0	2014-12-05 15:56:15	55.0	2014-12-05 15:56:16	55.0	2014-12-05 15:56:17	55.0	2014-12-05 15:56:18	55.0	2014-12-05 15:56:19	1445.0	2014-12-05 15:56:33	1445.0	2014-12-05 15:56:36
4	1023	2014-11-04 10:03:19	1022.0	2014-11-04 10:03:19	50.0	2014-11-04 10:03:20	222.0	2014-11-04 10:03:21	202.0	2014-11-04 10:03:21	3374.0	2014-11-04 10:03:22	50.0	2014-11-04 10:03:22	48.0	2014-11-04 10:03:22	48.0	2014-11-04 10:03:23	3374.0	2014-11-04 10:03:23
5	301	2014-05-16 15:05:31	301.0	2014-05-16 15:05:32	301.0	2014-05-16 15:05:33	66.0	2014-05-16 15:05:39	67.0	2014-05-16 15:05:40	69.0	2014-05-16 15:05:40	70.0	2014-05-16 15:05:40	68.0	2014-05-16 15:05:40	71.0	2014-05-16 15:05:40	167.0	2014-05-16 15:05:44

	site1	site2	site3	site4	site5	site6	site7	site8	site9	site10
session_id
1	718	0	0	0	0	0	0	0	0	0
2	890	941	3847	941	942	3846	3847	3846	1516	1518
3	14769	39	14768	14769	37	39	14768	14768	14768	14768
4	782	782	782	782	782	782	782	782	782	782
5	22	177	175	178	177	178	175	177	177	178
6	570	21	570	21	21	0	0	0	0	0
7	803	23	5956	17513	37	21	803	17514	17514	17514
8	22	21	29	5041	14422	23	21	5041	14421	14421
9	668	940	942	941	941	942	940	23	21	22
10	3700	229	570	21	229	21	21	21	2336	2044

	year_month
session_id
1	201402
2	201402
3	201312
4	201403
5	201402

	site1	site2	site3	site4	site5	site6	site7	site8	site9	site10	year_month_scaled	start_hour	morning
session_id
1	718	0	0	0	0	0	0	0	0	0	0.476232	10	1
2	890	941	3847	941	942	3846	3847	3846	1516	1518	0.476232	11	1
3	14769	39	14768	14769	37	39	14768	14768	14768	14768	-1.800775	16	0
4	782	782	782	782	782	782	782	782	782	782	0.501532	10	1
5	22	177	175	178	177	178	175	177	177	178	0.476232	10	1