- Part 1. Comparison of several algorithms in sessions of 10 sites
- Part 2. Selection of parameters - session length and window width
- Part 3. Identification of a specific user and learning curves
- Ways to improve
Week 4. Comparison of classification algorithms
Now we will finally approach the training of classification models, compare several algorithms on cross-validation, and figure out which session length parameters (session_length and window_size) are better to use. Also, for the selected algorithm, we will construct validation curves (how the classification quality depends on one of the hyperparameters of the algorithm) and learning curves (how the classification quality depends on the sample size).
4 week plan:
- Part 1. Comparison of several algorithms in sessions of 10 sites
- Part 2. Selection of parameters - session length and window width
- Part 3. Identification of a specific user and learning curves
In this part of the project, video recordings of the following lectures of the course "Learning from marked data" may be useful:
#%load_ext watermark
from __future__ import division, print_function
# disable any Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
from time import time
import itertools
import os
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
import pickle
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
from google.colab import drive
drive.mount('/content/drive')
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'
Let's load the previously serialized objects X_sparse_10 users and y_10 users, corresponding to the training sample for 10 users.
with open(os.path.join(PATH_TO_DATA,
'X_sparse_10users.pkl'), 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(os.path.join(PATH_TO_DATA,
'y_10users.pkl'), 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
There are more than 14 thousand sessions and almost 5 thousand unique visited sites.
X_sparse_10users.shape
Let's split the sample into 2 parts. On one we will carry out cross-validation, on the second we will evaluate the model trained after cross-validation.
# X_sparse_10users = X_sparse_10users.todense()
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
Let's set the type of cross-validation in advance: 3-fold, with mixing, the parameter random_state = 56 - for reproducibility.
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
Auxiliary function for drawing validation curves after starting GridSearchCV (or RandomizedCV).
def plot_validation_curves(param_values, grid_cv_results_):
train_mu, train_std = grid_cv_results_['mean_train_score'], grid_cv_results_['std_train_score']
valid_mu, valid_std = grid_cv_results_['mean_test_score'], grid_cv_results_['std_test_score']
train_line = plt.plot(param_values, train_mu, '-', label='train', color='green')
valid_line = plt.plot(param_values, valid_mu, '-', label='test', color='red')
plt.fill_between(param_values, train_mu - train_std, train_mu + train_std, edgecolor='none',
facecolor=train_line[0].get_color(), alpha=0.2)
plt.fill_between(param_values, valid_mu - valid_std, valid_mu + valid_std, edgecolor='none',
facecolor=valid_line[0].get_color(), alpha=0.2)
plt.legend()
1. Let's train a 'KNeighborsClassifier' with 100 nearest neighbors (we'll leave the rest of the parameters by default, only 'n_jobs'= -1 for parallelization) and look at the proportion of correct answers on 3-fold cross-validation (for the sake of reproducibility, we use the StratifiedKFold
skf' object for this) on the sample
(X_train, y_train)and separately on the sample
(X_valid, y_valid)`.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from time import time
knn = KNeighborsClassifier(n_neighbors=100)
t_start = time()
knn_cv_score = cross_val_score(knn, X_train, y_train, cv=skf, n_jobs=-1)
print("CV scores:", knn_cv_score)
print("mean:", np.mean(knn_cv_score))
print("std:", np.std(knn_cv_score))
print("Time elapsed: ", time()-t_start)
t_start = time()
scores = cross_val_score(knn, X_valid, y_valid, cv=skf)
print("CV scores:", scores)
print("mean:", np.mean(scores))
print("std:", np.std(scores))
print("Time elapsed: ", time()-t_start)
Importernt:I had a problem with my code in this place: cross_val_score did not work and gave result NaN or, if I used the todense, code worked for too long.
Question 1 Let's calculate the proportion of correct answers for the KNeighborsClassifier on cross-validation and deferred sampling. I'll round each one up to 3 decimal places.
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_valid)
print(f"KNN Cross-Validation Score: {knn_cv_score.mean():.3f}")
print(f"KNN Validation Score: {accuracy_score(y_valid, knn_pred):.3f}")
2. Train a random forest (RandomForestClassifier) of 100 trees (for reproducibility random_state=17). Let's look at the OOB score (for which we will immediately set oob_score=True) and the proportion of correct answers in the sample (X_valid, y_valid). For parallelization, set n_jobs= -1.
Question 2. Calculate the percentages of correct answers for RandomForestClassifier during Out-of-Bag evaluation and on deferred sampling?
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=17, oob_score=True, n_jobs=-1)
t_start = time()
clf.fit(X_train, y_train)
print('Score: ', clf.score(X_train, y_train))
print("Time elapsed: ", time()-t_start)
print(clf.oob_score_)
t_start = time()
clf_pred = clf.predict(X_valid)
print(accuracy_score(y_valid, clf_pred))
print("Time elapsed: ", time()-t_start)
3. Let's train Logistic Regression with the default parameter C and random_state=17 (for reproducibility). Let's look at the proportion of correct answers on cross-validation (using the scf object created earlier) and on the sample (X_valid, y_valid). For parallelization, set n_jobs= -1.
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
logit = LogisticRegression(random_state=17, n_jobs=-1)
t_start = time()
logit_cv_score = cross_val_score(logit, X_train, y_train, cv=skf)
logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_valid)
logit_val_score = accuracy_score(y_valid, logit_y_pred)
print(f"LogReg Cross-Validation Score: {logit_cv_score.mean():.3f}")
print(f"LogReg Validation Score: {logit_val_score:.3f}")
print("Time elapsed: ", time()-t_start)
Using LogisticRegressionCV, we will select the parameter C for Logistic Regression first in a wide range: 10 values from 1e-4 to 1e2, use logspace from NumPy. Specify the LogisticRegressionCV parameters multi_class='multinomial' and random_state=17. For cross-validation, we use the skf object created earlier. For parallelization, set n_jobs= -1.
At the end, we will draw validation curves for parameter C.
t_start = time()
logit_c_values1 = np.logspace(-4, 2, 10)
logit_grid_searcher1 = LogisticRegressionCV(Cs=logit_c_values1, cv=skf,
multi_class="multinomial",
random_state=17)
logit_grid_searcher1.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)
The average values of the proportion of correct responses to cross-validation for each of the 10 parameters C.
logit_mean_cv_scores1 = np.array(
list(logit_grid_searcher1.scores_.values())).mean(axis=(0, 1))
logit_mean_cv_scores1
We will output the best value of the proportion of correct answers on cross-validation and the corresponding value with.
best_score1 = np.max(logit_mean_cv_scores1)
best_C1 = logit_grid_searcher1.Cs_[np.argmax(logit_mean_cv_scores1)]
print(f"Best Score: {best_score1}")
print(f"Best C: {best_C1}")
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on `C'.
plt.plot(logit_c_values1, logit_mean_cv_scores1);
Now the same thing, only the values of parameter 'C' are iterated over in the range np.linspace(0.1, 7, 20). Let's draw validation curves again, determine the maximum value of the proportion of correct answers on cross-validation.
t_start = time()
logit_c_values2 = np.linspace(0.1, 7, 20)
logit_grid_searcher2 = LogisticRegressionCV(Cs=logit_c_values2, cv=skf, multi_class='multinomial', random_state=17, n_jobs=-1)
logit_grid_searcher2.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)
The average values of the proportion of correct responses to cross-validation for each of the 10 parameters `C'.
logit_mean_cv_scores2 = np.array(
list(logit_grid_searcher2.scores_.values())).mean(axis=(0, 1))
logit_mean_cv_scores2
We will output the best value of the proportion of correct answers on cross-validation and the corresponding value with.
best_score2 = np.max(logit_mean_cv_scores2)
best_C2 = logit_grid_searcher2.Cs_[np.argmax(logit_mean_cv_scores2)]
print(f"Best Score: {best_score2}")
print(f"Best C: {best_C2}")
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on `C'.
plt.plot(logit_c_values2, logit_mean_cv_scores2);
We output the proportion of correct answers in the sample (X_value, y_value)' for logistic regression with the best values found
C'.
t_start = time()
logit = LogisticRegression(C=best_C2, n_jobs=-1, random_state=17)
logit_cv_score = cross_val_score(logit, X_train, y_train, cv=skf, n_jobs=-1)
logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_valid)
logit_val_score = accuracy_score(y_valid, logit_y_pred)
print("Time elapsed: ", time()-t_start)
Question 3. Let's calculate the proportions of correct answers for 'logit_grid_searcher 2' on cross-validation for the best value of parameter 'C` and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.
print(f"LogReg Cross-Validation Score: {logit_cv_score.mean():.3f}")
print(f"LogReg Validation Score: {logit_val_score:.3f}")
4. Let's train a linear SVM ('LinearSVC) with the parameter 'C'=1 and 'random_state'=17 (for reproducibility). Let's look at the proportion of correct answers on cross-validation (using the
skf' object created earlier) and on the sample (X_valid, y_valid)
.
from sklearn.svm import LinearSVC
t_start = time()
svm = LinearSVC(C=1, random_state=17)
scores_svm = cross_val_score(svm, X_train, y_train, cv=skf, n_jobs=-1)
print("CV scores:", scores_svm)
print("mean:", np.mean(scores_svm))
print("std:", np.std(scores_svm))
print("Time elapsed: ", time()-t_start)
t_start = time()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_valid)
print(accuracy_score(y_valid, svm_pred))
print("Time elapsed: ", time()-t_start)
Using GridSearchCV
, we will select the parameter C
for SVM first in a wide range: 10 values from 1e-4 to 1 e4, use linspace
from NumPy. Let's draw validation curves.
%%time
svm_params1 = {
"C": np.linspace(1e-4, 1e4, 10)
}
svm_grid_searcher1 = GridSearchCV(estimator=svm, cv=skf, param_grid=svm_params1,
return_train_score=True)
svm_grid_searcher1.fit(X_train, y_train)
t_start = time()
svm_params1 = {'C': np.linspace(1e-4, 1e4, 10)}
svm_grid_searcher1 = GridSearchCV(svm, param_grid=svm_params1, cv=skf, return_train_score=True, n_jobs=-1)
svm_grid_searcher1.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)
We will output the best value of the proportion of correct answers on cross-validation and the corresponding value C
.
svm_grid_searcher1.best_params_
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on C
.
plot_validation_curves(svm_params1['C'], svm_grid_searcher1.cv_results_)
But we remember that with the default regularization parameter (C=1) on cross-validation, the proportion of correct answers is higher. This is the case (not uncommon) when you can make a mistake and iterate over the parameters in the wrong range (the reason is that we took a uniform grid over a large interval and missed a really good range of values
C
). Here it is much more meaningful to selectC
in the region of 1, besides, this way the model learns faster than with largeC
.
Using GridSearchCV
, we will select the parameter C
for SVM in the range (1e-3, 1), 30 values using `linspace' from NumPy. Let's draw validation curves.
%%time
svm_params2 = {'C': np.linspace(1e-3, 1, 30)}
svm_grid_searcher2 = GridSearchCV(svm, param_grid=svm_params2, cv=skf, return_train_score=True, n_jobs=-1)
svm_grid_searcher2.fit(X_train, y_train)
Output the best value of the proportion of correct answers on cross-validation and the corresponding value of C
.
best_score = svm_grid_searcher2.best_score_
best_params = svm_grid_searcher2.best_params_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on 'C'.
plot_validation_curves(svm_params2['C'], svm_grid_searcher2.cv_results_)
Output the proportion of correct answers in the sample (X_value, y_value)' for 'LinearSVC
with the best values found `C'.
%%time
svm = LinearSVC(**best_params, random_state=17)
svm_cv_score = cross_val_score(svm, X_train, y_train, cv=skf, n_jobs=-1)
svm.fit(X_train, y_train)
svm_y_pred = svm.predict(X_valid)
svm_val_score = accuracy_score(y_valid, svm_y_pred)
Question 4. Let's calculate the proportions of correct answers for 'stm_grid_searcher 2' on cross-validation for the best value of parameter 'C` and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.
print(f"SVC Cross-Validation Score: {svm_cv_score.mean():.3f}")
print(f"SVC Validation Score: {svm_val_score:.3f}")
Let's take LinearSVC
, which showed the best quality on cross-validation in part 1, and check its work on 8 more samples for 10 users (with different combinations of parameters session_length and window_size). Since there are already more calculations here, we will not re-select the regularization parameter C
every time.
Let's define the model_assessment
function, the documentation of which is described below. The split of the sample with 'train_test_split' should be stratified.
def model_assessment(estimator, path_to_X_pickle, path_to_y_pickle, cv, random_state=17, test_size=0.3):
'''
Estimates CV-accuracy for (1 - test_size) share of (X_sparse, y)
loaded from path_to_X_pickle and path_to_y_pickle and holdout accuracy for (test_size) share of (X_sparse, y).
The split is made with stratified train_test_split with params random_state and test_size.
:param estimator – Scikit-learn estimator (classifier or regressor)
:param path_to_X_pickle – path to pickled sparse X (instances and their features)
:param path_to_y_pickle – path to pickled y (responses)
:param cv – cross-validation as in cross_val_score (use StratifiedKFold here)
:param random_state – for train_test_split
:param test_size – for train_test_split
:returns mean CV-accuracy for (X_train, y_train) and accuracy for (X_valid, y_valid) where (X_train, y_train)
and (X_valid, y_valid) are (1 - test_size) and (testsize) shares of (X_sparse, y).
'''
with open(path_to_X_pickle, 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(path_to_y_pickle, 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
t_start = time()
scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
t_start = time()
estimator.fit(X_train, y_train)
svm_pred = estimator.predict(X_valid)
return(np.mean(scores_svm), accuracy_score(y_valid, svm_pred), " Time elapsed: ", time()-t_start)
Let's make sure that the function works.
model_assessment(svm_grid_searcher2.best_estimator_,
os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'),
os.path.join(PATH_TO_DATA, 'y_10users.pkl'), skf, random_state=17, test_size=0.3)
Let's use the model_assessment function for the best algorithm from the previous part (namely, 'svm_grid_searcher 2.bestestimator`) and 9 samples of the form with different combinations of parameters session_length and window_size for 10 users. We will output the session_length and window_size parameters in the loop, as well as the output result of the model_assessment function.
Here, for convenience, it is worth creating copies of previously created pickle files X_sparse_10users.pkl, X_sparse_150users.pkl, y_10users.pkl and y_150users.pkl, adding s10_w10 to their names, which means the session length is 10 and the window width is 10.
!cp $PATH_TO_DATA/X_sparse_10users.pkl $PATH_TO_DATA/X_sparse_10users_s10_w10.pkl
!cp $PATH_TO_DATA/X_sparse_150users.pkl $PATH_TO_DATA/X_sparse_150users_s10_w10.pkl
!cp $PATH_TO_DATA/y_10users.pkl $PATH_TO_DATA/y_10users_s10_w10.pkl
!cp $PATH_TO_DATA/y_150users.pkl $PATH_TO_DATA/y_150users_s10_w10.pkl
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
if window_size <= session_length:
path_to_X_pkl = os.path.join(
PATH_TO_DATA, f"X_sparse_10users_s{session_length}_w{window_size}.pkl")
path_to_y_pkl = os.path.join(
PATH_TO_DATA, f"y_10users_s{session_length}_w{window_size}.pkl")
print(window_size, session_length,
model_assessment(estimator=estimator,
path_to_X_pickle=path_to_X_pkl,
path_to_y_pickle=path_to_y_pkl,
cv=skf))
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
if window_size <= session_length:
path_to_X_pkl = os.path.join(
PATH_TO_DATA, f"X_sparse_150users_s{session_length}_w{window_size}.pkl")
path_to_y_pkl = os.path.join(
PATH_TO_DATA, f"y_150users_s{session_length}_w{window_size}.pkl")
print(window_size, session_length,
model_assessment(estimator=estimator,
path_to_X_pickle=path_to_X_pkl,
path_to_y_pickle=path_to_y_pkl,
cv=skf))
Question 5. Let's calculate the proportion of correct answers for LinearSVC
with the configured parameter C
and the selection X_sparse_10 users_s15_w5
. We will indicate the proportions of correct answers on cross-validation and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.
%%time
estimator = svm_grid_searcher2.best_estimator_
path_to_X_pkl = os.path.join(PATH_TO_DATA, "X_sparse_10users_s15_w5.pkl")
path_to_y_pkl = os.path.join(PATH_TO_DATA, "y_10users_s15_w5.pkl")
with open(path_to_X_pkl, 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(path_to_y_pkl, 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
estimator.fit(X_train, y_train)
svm_pred = estimator.predict(X_valid)
print(f"SVC Cross-Validation Score: {np.mean(scores_svm):.3f}")
print(f"SVC Validation Score: {accuracy_score(y_valid, svm_pred):.3f}")
Make a conclusion about how the quality of classification depends on the length of the session and the width of the window.
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in [(5, 5), (7, 7), (10, 10)]:
path_to_X_pkl = os.path.join(
PATH_TO_DATA, f"X_sparse_150users_s{session_length}_w{window_size}.pkl")
path_to_y_pkl = os.path.join(
PATH_TO_DATA, f"y_150users_s{session_length}_w{window_size}.pkl")
print(window_size, session_length,
model_assessment(estimator=estimator,
path_to_X_pickle=path_to_X_pkl,
path_to_y_pickle=path_to_y_pkl,
cv=skf))
Question 6. Calculate the proportions of correct answers for LinearSVC
with the C' parameter configured and the
X_sparse_150 users` selection. Specify the proportions of correct answers on cross-validation and on deferred sampling. Round each one to 3 decimal places and separate it with a space.
%%time
estimator = svm_grid_searcher2.best_estimator_
path_to_X_pkl = os.path.join(PATH_TO_DATA, "X_sparse_150users.pkl")
path_to_y_pkl = os.path.join(PATH_TO_DATA, "y_150users.pkl")
with open(path_to_X_pkl, 'rb') as X_sparse_150users_pkl:
X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(path_to_y_pkl, 'rb') as y_150users_pkl:
y_150users = pickle.load(y_150users_pkl)
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_150users, y_150users,
test_size=0.3,
random_state=17, stratify=y_150users)
scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
estimator.fit(X_train, y_train)
svm_pred = estimator.predict(X_valid)
print(f"SVC Cross-Validation Score: {np.mean(scores_svm):.3f}")
print(f"SVC Validation Score: {accuracy_score(y_valid, svm_pred):.3f}")
Since it may be disappointing that the multiclass share of correct answers in a sample of 150 users is small, let's be glad that a particular user can be identified well enough.
Let's load the previously serialized objects X_sparse_150users and y_150users corresponding to the training sample for 150 users with parameters (session_length, window_size) = (10.10). Just exactly break them down into 70% and 30%.
with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'rb') as X_sparse_150users_pkl:
X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'rb') as y_150users_pkl:
y_150users = pickle.load(y_150users_pkl)
X_train_150, X_valid_150, y_train_150, y_valid_150 = train_test_split(X_sparse_150users,
y_150users, test_size=0.3,
random_state=17, stratify=y_150users)
Let's train LogisticRegressionCV
for one value of the parameter C
(the best on cross-validation in 1 part, use the exact value, not by eye). Now we will solve 150 tasks "One-against-All", so we will specify the argument multi_class
=ovr
. As always, where possible, specify n_jobs=-1
and random_state
=17.
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
best_C2_tmp = 1.9157894736842107
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
y_train_150_tmp = []
for i in y_train_150:
y_train_150_tmp.append(int(i[4:]))
# convert to int
y_train_150_work = np.array(y_train_150_tmp, dtype=np.int)
%%time
logit_cv_150users = LogisticRegressionCV(Cs=[best_C2_tmp], cv=skf, multi_class="ovr",
n_jobs=-1, random_state=17)
logit_cv_150users.fit(X_train_150, y_train_150_work)
Look at the average proportions of correct responses to cross-validation in the task of identifying each user individually.
cv_scores_by_user = logit_cv_150users.scores_
for user_id in logit_cv_150users.scores_:
print(f"User {user_id}, CV score: {cv_scores_by_user[user_id].mean()}")
The results seem impressive, but perhaps we forget about the imbalance of classes, and a high proportion of correct answers can be obtained by constant prediction. Let's calculate for each user the difference between the proportion of correct answers to cross-validation (just calculated using LogisticRegressionCV) and the proportion of labels in y_train_150 other than the ID of this user (this is the proportion of correct answers that can be obtained if the classifier always "says" that this is not the user with the number i in the classification task i-vs-All).
class_distr = np.bincount(y_train_150_work)
acc_diff_vs_constant = []
for user_id in np.unique(y_train_150_work):
val = (class_distr.sum() - class_distr[user_id]) / class_distr.sum()
print(user_id)
diff = cv_scores_by_user[user_id].mean() - val
acc_diff_vs_constant.append(diff)
print(f"User: {user_id} Val: {val:.3f} Diff: {diff}")
num_better_than_default = (np.array(acc_diff_vs_constant) > 0).sum()
num_better_than_default
Question 7. Let's calculate the proportion of users for whom the logistic regression on cross-validation gives a better forecast than the constant one. Round it up to 3 decimal places.|
better = num_better_than_default / len(acc_diff_vs_constant)
print(better)
Next, we will build learning curves for a specific user, for example, for the 128th. Let's make a new binary vector based on y_150 users, its values will be 1 or 0, depending on whether the user ID is 128.
y_binary_128 = y_150users == 'user0128'
y_binary_128.astype("int")
from sklearn.model_selection import learning_curve
def plot_learning_curve(val_train, val_test, train_sizes,
xlabel='Training Set Size', ylabel='score'):
def plot_with_err(x, data, **kwargs):
mu, std = data.mean(1), data.std(1)
lines = plt.plot(x, mu, '-', **kwargs)
plt.fill_between(x, mu - std, mu + std, edgecolor='none',
facecolor=lines[0].get_color(), alpha=0.2)
plot_with_err(train_sizes, val_train, label='train')
plot_with_err(train_sizes, val_test, label='valid')
plt.xlabel(xlabel); plt.ylabel(ylabel)
plt.legend(loc='lower right');
Let's calculate the proportions of correct answers to cross-validation in the classification problem "user128-vs-All" depending on the sample size.
%%time
train_sizes = np.linspace(0.25, 1, 20)
estimator = svm_grid_searcher2.best_estimator_
n_train, val_train, val_test = learning_curve(
estimator=estimator,
X=X_sparse_150users,
y=y_binary_128,
train_sizes=train_sizes,
cv=skf,
n_jobs=-1,
random_state=17)
plot_learning_curve(val_train, val_test, n_train,
xlabel='train_size', ylabel='accuracy')
Ways to improve
- of course, you can check a bunch of algorithms, for example, Xgboost, but in such a task it is very unlikely that something will do better than linear methods
- it is interesting to check the quality of the algorithm on data where sessions were distinguished not by the number of sites visited, but by time, for example, 5, 7, 10 and 15 minutes. Separately, it is worth noting the data of our соревнования
- again, if the resources allow, you can check how well you can solve the problem for 3000 users
Next week we will remember about linear models trained by stochastic gradient descent, and we will rejoice at how much faster they work. We will also make the first (or not the first) in the parcels [competition] (https://in class.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 ) Kaggle class.