Week 6. Vowpal Wabbit. Tutorial + Programming Assignment
This week we will get acquainted with the popular Vowpal Wabbit library and try it on site visit data.
6 week plan:
- Part 1. Article on Vowpal Wabbit
- Part 2. Application of Vowpal Wabbit to Site Visit data
- 2.1. Data Preparation
- 2.2. Validation by Deferred Sampling
- 2.3. Validation by test Sampling (Public Leaderboard)
In this part of the project, videos of the following lectures of the course "Learning from marked data" may be useful to us:
[Presentation] will also be useful(https://github.com/esokolov/ml-course-msu/blob/master/ML15/lecture-notes/Sem08_vw.pdf ) lecturer of specialization Evgeny Sokolov. And, of course, documentation Vowpal Wabbit.
Part 1. Article about Vowpal Wabbit
Let's read the article about Vowpal Wabbit on Habra from the OpenDataScience open course series on machine learning. We can download notebook, attached to the article, view the code, study it and change it. This is the only way to deal with Vowpal Wabbit.
Next, let's look at Vowpal Wabbit in action. However, in the task of our competition for binary classification of web sessions, we will not notice a difference - both in quality and in speed (although you can check). Therefore, we will demonstrate all the agility of VW in the task of classification into 400 classes. The initial data is still the same, but 400 users have been allocated, and the task of identifying them is being solved. Download the data from here, and here we will fill in the result - files train_sessions_400users.csv
and test_sessions_400users.csv
.
import os
import pandas as pd
import numpy as np
import scipy.sparse as sps
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/Kaggle'
Let's upload the training and test samples. It can be noticed that the test sessions here are clearly separated in time from the sessions in the training sample.
train_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'train_sessions_400users.csv'),
index_col='session_id')
test_df_400 = pd.read_csv(os.path.join(PATH_TO_DATA,'test_sessions_400users.csv'),
index_col='session_id')
test_df_400.shape
train_df_400.head()
We see that there are 182793 sessions in the training sample, 46473 in the test sample, and the sessions really belong to 400 different users.
train_df_400.shape, test_df_400.shape, train_df_400['user_id'].nunique()
Vowpal Wabbit likes class labels to be distributed from 1 to K, where K is the number of classes in the classification problem (in our case, 400). Therefore, we will have to use LabelEncoder, and then add +1 (Label Encoder translates labels into the range from 0 to K-1). Then it will be necessary to apply the reverse transformation.
y = train_df_400.user_id
class_encoder = preprocessing.LabelEncoder()
y_for_vw = class_encoder.fit_transform(y)+1
Next, we will compare VW with SGDClassifier and with logistic regression. All these models need input data preprocessing. Let's prepare sparse matrices for sklearn models, as we did in part 5:
- combine training and test samples
- we will select only sites (signs from 'site1' to 'site10')
- replace the omissions with zeros (our sites were numbered from 0)
- we will translate into a sparse csr_matrix format
- let's break back into the training and test parts
train_test_df = pd.concat([train_df_400, test_df_400])
sites = ['site' + str(i) for i in range(1, 11)]
train_test_df_sites = train_test_df[sites]
train_test_df_sites.isnull().sum().sum()
train_test_df_sites = train_test_df_sites.fillna(0)
idx_split = train_df_400.shape[0]
train_test_sparse = csr_matrix((np.ones(train_test_df_sites.values.size, dtype=np.uint8),
train_test_df_sites.values.reshape(-1),
np.arange(train_test_df_sites.values.shape[0] + 1) * train_test_df_sites.values.shape[1]))[:, 1:]
X_train_sparse = train_test_sparse[:idx_split, :]
X_test_sparse = train_test_sparse[idx_split:, :]
y = train_df_400['user_id'].values
Let's select the training (70%) and deferred (30%) parts of the original training sample. We do not mix the data, we take into account that the sessions are sorted by time.
train_share = int(.7 * train_df_400.shape[0])
train_df_part = train_df_400[sites].iloc[:train_share, :]
valid_df = train_df_400[sites].iloc[train_share:, :]
X_train_part_sparse = X_train_sparse[:train_share, :]
X_valid_sparse = X_train_sparse[train_share:, :]
y_train_part = y[:train_share]
y_valid = y[train_share:]
y_train_part_for_vw = y_for_vw[:train_share]
y_valid_for_vw = y_for_vw[train_share:]
We implement a function, arrays_to_vw
, which translates the training sample into the Vowpal Wabbit format.
Entrance:
- X - matrix `NumPy' (training sample)
- y (optional) - response vector (
NumPy
). Optional, since we will process the test matrix with the same function - train - flag, True in the case of a training sample, False in the case of a test sample
- out_file – the path to the file .vw to which the recording will be made
Details:
- it is necessary to go through all the rows of the matrix
X
and write down all the values separated by a space, first adding the necessary class label from the vectory
and the separator sign|
- in the test sample, in place of the labels of the target class, you can write arbitrary, for example, 1
def arrays_to_vw(X, y=None, train=True, out_file='tmp.vw'):
X = np.nan_to_num(X)
X = X.astype(int)
with open(out_file, 'w') as f:
print(X.shape)
for i in range(X.shape[0]):
string = ' '.join([str(x) for x in X[i]])
if y is None:
f.write(str(1) + " | " + string + "\n")
else:
f.write(str(y[i]) + " | " + string + "\n")
Let's apply the written function to the part of the training sample (train_df_part, y_train_part_for_vw)
, to the deferred sample (valid_df, y_valid_for_vw)
, to the entire training sample and to the entire test sample. It should be noted that our method accepts matrices and vectors NumPy
as input
%%time
arrays_to_vw(train_df_part.values, y_train_part_for_vw, True, os.path.join(PATH_TO_DATA,'train_part.vw'))
arrays_to_vw(valid_df.values, y_valid_for_vw, False, os.path.join(PATH_TO_DATA,'valid.vw'))
arrays_to_vw(train_df_400[sites].values, y_for_vw, True, os.path.join(PATH_TO_DATA,'train.vw'))
arrays_to_vw(test_df_400[sites].values, None, False, os.path.join(PATH_TO_DATA,'test.vw'))
Let's check the result
!head -3 $PATH_TO_DATA/train_part.vw
!head -3 $PATH_TO_DATA/valid.vw
!head -3 $PATH_TO_DATA/test.vw
Let's train the Vowpal Wabbit model on a sample of train_part.vw. We indicate that the classification problem with 400 classes (--oaa) is being solved, we will make 3 passes through the sample (--passes). Let's set some cache file (--cache_file, you can just specify the -c flag), so VW will be faster to do all the next passes after the first one (the last cache file is deleted using the -k argument). We also specify the value of the parameter b=26. This is the number of bits used for hashing, in this case you need more than 18 by default. Finally, specify random_seed=17. We are not changing the other parameters yet.
train_part_vw = os.path.join(PATH_TO_DATA, 'train_part.vw')
valid_vw = os.path.join(PATH_TO_DATA, 'valid.vw')
train_vw = os.path.join(PATH_TO_DATA, 'train.vw')
test_vw = os.path.join(PATH_TO_DATA, 'test.vw')
model = os.path.join(PATH_TO_DATA, 'vw_model.vw')
pred = os.path.join(PATH_TO_DATA, 'vw_pred.csv')
%%time
!vw --oaa 400 /content/drive/MyDrive/DATA/Stepik/Kaggle/train_part.vw --passes 3 -c -k -b 26 --random_seed 17 -f /content/drive/MyDrive/DATA/Stepik/Kaggle/vw_model.vw
Let's write down the forecasts on the valid sample.vw in vw_valid_phead.csv.
%%time
!vw -i /content/drive/MyDrive/DATA/Stepik/Kaggle/vw_model.vw -t -d /content/drive/MyDrive/DATA/Stepik/Kaggle/valid.vw -p /content/drive/MyDrive/DATA/Stepik/Kaggle/vw_valid_pred.csv
We count the forecasts of kaggle_data/vw_valid_phead.csv from the file and look at the proportion of correct answers on the deferred part.
vw_valid = pd.read_csv( os.path.join(PATH_TO_DATA, 'vw_valid_pred.csv'), header=None)
print('The percentage of correct responses on the deferred sample for Vowpal Wabbit: %f' % accuracy_score(y_valid_for_vw,
vw_valid))
Now we will train SGDClassifier (3 sample passes, logistic loss function) and LogisticRegression on 70% of the sparse training sample – (X_train_part_sparse, y_train_part), make a forecast for the delayed sample (X_valid_sparse, y_valid) and calculate the proportion of correct answers. Logistic regression will not be trained quickly – this is normal. We will specify random_state=17, n_jobs=-1 everywhere. For SGDClassifier, we will also specify max_iter=3.
logit = LogisticRegression(random_state=17, n_jobs=-1)
sgd_logit = SGDClassifier(loss='log', random_state=17, max_iter=3)
%%time
logit.fit(X_train_part_sparse, y_train_part)
%%time
sgd_logit.fit(X_train_part_sparse, y_train_part)
Question 1. Calculate the proportion of correct answers on the deferred sample for Vowpal Wabbit, round to 3 decimal places.
Question 2. Calculate the proportion of correct answers on the deferred sample for SGD, round to 3 decimal places.
Question 3. Calculate the proportion of correct answers on the deferred sample for logistic regression, round to 3 decimal places.
vw_valid_acc = accuracy_score(y_valid_for_vw, vw_valid)
sgd_valid_acc = accuracy_score(y_valid, sgd_logit.predict(X_valid_sparse))
logit_valid_acc = accuracy_score(y_valid, logit.predict(X_valid_sparse))
def write_answer_to_file(answer, file_address):
with open(file_address, 'w') as out_f:
out_f.write(str(answer))
write_answer_to_file(round(vw_valid_acc, 3), os.path.join(PATH_TO_DATA, 'answer6_1.txt'))
write_answer_to_file(round(sgd_valid_acc, 3), os.path.join(PATH_TO_DATA, 'answer6_2.txt'))
write_answer_to_file(round(logit_valid_acc, 3), os.path.join(PATH_TO_DATA, 'answer6_3.txt'))
Let's train a VW model with the same parameters on the entire training sample - train.vw.
%%time
!vw --oaa 400 /content/drive/MyDrive/DATA/Stepik/Kaggle/train.vw --passes 3 -c -k -b 26 --random_seed 17 -f /content/drive/MyDrive/DATA/Stepik/Kaggle/vw_model.vw
Let's make a forecast for the test sample.
%%time
!vw -t -d /content/drive/MyDrive/DATA/Stepik/Kaggle/test.vw -i /content/drive/MyDrive/DATA/Stepik/Kaggle/vw_model.vw -p /content/drive/MyDrive/DATA/Stepik/Kaggle/vw_test_pred.csv
Let's write the forecast to a file, apply the reverse conversion of labels (there was a LabelEncoder and then +1 in the label) and send the solution to Kaggle.
def write_to_submission_file(predicted_labels, out_file,
target='user_id', index_label="session_id"):
# turn predictions into data frame and save as csv file
predicted_df = pd.DataFrame(predicted_labels,
index = np.arange(1, predicted_labels.shape[0] + 1),
columns=[target])
predicted_df.to_csv(out_file, index_label=index_label)
vw_pred = pd.read_csv('/content/drive/MyDrive/DATA/Stepik/Kaggle/vw_test_pred.csv', header=None)
vw_subm = class_encoder.inverse_transform(np.ravel(vw_pred) - 1)
write_to_submission_file(vw_subm, os.path.join(PATH_TO_DATA, '/content/drive/MyDrive/DATA/Stepik/Kaggle/vw_pred_kaggle.csv'))
Let's do the same for SGD and logistic regression.
sgd_logit = SGDClassifier(loss='log', random_state=17, max_iter=3, n_jobs=-1)
sgd_logit.fit(X_train_part_sparse, y_train_part)
sgd_logit_test_pred = sgd_logit.predict(X_test_sparse)
logit = LogisticRegression(random_state=17, n_jobs=-1, solver = 'lbfgs')
logit.fit(X_train_sparse, y)
logit_test_pred = logit.predict(X_test_sparse)
write_to_submission_file(sgd_logit_test_pred,
os.path.join(PATH_TO_DATA, '/content/drive/MyDrive/DATA/Stepik/Kaggle/sgd_pred.csv'))
write_to_submission_file(logit_test_pred,
os.path.join(PATH_TO_DATA, '/content/drive/MyDrive/DATA/Stepik/Kaggle/logit_pred.csv'))
Let's look at the proportion of correct answers on the public part (public leaderboard) of the test sample this competitions.
Question 4. What is the proportion of correct answers on the public part of the test sample (public leaderboard) for Vowpal Wabbit?
Question 5. What is the proportion of correct answers on the public part of the test sample (public leaderboard) for SGD?
Question 6. What is the proportion of correct answers on the public part of the test sample (public leaderboard) for logistic regression?
vw_lb_score, sgd_lb_score, logit_lb_score = 0.18164, 0.16994, 0.19060
write_answer_to_file(round(vw_lb_score, 3), os.path.join(PATH_TO_DATA,'answer6_4.txt'))
write_answer_to_file(round(sgd_lb_score, 3), os.path.join(PATH_TO_DATA,'answer6_5.txt'))
write_answer_to_file(round(logit_lb_score, 3), os.path.join(PATH_TO_DATA,'answer6_6.txt'))
Logistic regression showed the best result among the other two algorithms Vowpal Wabbit and SGD, but more time is spent on its training. SGD showed the worst result, but nevertheless he learns quickly. Vowpal Wabbit showed higher quality than SGD.