Week 2. Preparation and initial data analysis
In the second week, we will continue to prepare data for further analysis and construction of forecast models. Specifically, earlier we determined that a session is a sequence of 10 sites visited by a user, now we will make the session length a parameter, and then when training predictive models we will choose the best session length. We will also get acquainted with the preprocessed data and statistically test the first hypotheses related to our observations.
Plan 2 weeks:
- Preparation of several training samples for comparison
- Primary data analysis, hypothesis testing
Preparation of several training samples for comparison
Let's make the number of sites in the session a parameter in order to further compare classification models trained on different samples – with 5, 7, 10 and 15 sites in the session. Moreover, so far we have taken 10 sites in a row, without crossing. Now let's apply the idea of a sliding window - sessions will overlap.
Example: for a session length of 10 and a window width of 7, a file of 30 records will generate not 3 sessions, as before (1-10, 11-20, 21-30), but 5 (1-10, 8-17, 15-24, 22-30, 29-30). At the same time, there will be one zero in the penultimate session, and 8 zeros in the last one.
Let's create several samples for different combinations of session length and window width parameters. All of them are presented in the table below:
session_length -> window_size |
5 | 7 | 10 | 15 |
---|---|---|---|---|
5 | v | v | v | v |
7 | v | v | v | |
10 | v | v |
In total, there should be 18 sparse matrices - the 9 combinations of session formation parameters indicated in the table for samples of 10 and 150 users. At the same time, we have already made 2 selections in the last part, they correspond to a combination of parameters: session_length=10, window_size=10, which are marked in the table above with a green check mark (done).
Implementing the function prepar_sparse_train_set_window.
Arguments:
- path_to_csv_files – path to the directory with csv files
- site_freq_path - path to the pickle file with the frequency dictionary obtained in part 1 of the project
- session_length – session length (parameter)
- window_size – window width (parameter)
The function should return 2 objects:
- a sparse matrix X_sparse (two-dimensional Scipy.sparse.csr_matrix), in which rows correspond to sessions from session_length sites, and max(site_id) columns correspond to the number of site_id visits in the session.
- vector y (Numpy array) of "responses" in the form of user IDs that belong to sessions from X_sparse
Details:
- Modify the function created in part 1 of the prepare_train_set
- Some sessions may be repeated – leave it as it is, do not delete duplicates
- We measure the execution time of the loop iterations using time from time, tqdm from tqdm or using the log_progress widget (an article about it on Habrahabr)
- 150 files from capstone_websites_data/150users/ should be processed in a few seconds (depending on the input parameters). If it takes longer– it's not scary, but the function can be accelerated.
import os
import time
import pickle
import math
import pylab
import collections
import pandas as pd
import numpy as np
import scipy.sparse as sps
import scipy.stats as stats
import matplotlib.pyplot as plt
from glob import glob
from tqdm.auto import tqdm
from scipy.sparse import csr_matrix
from datetime import timedelta
from scipy import stats
from statsmodels.stats.proportion import proportion_confint
from collections import Counter
import warnings
warnings.filterwarnings('ignore')
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'
def prepare_sparse_train_set_window(path_to_csv_files, site_freq_path, session_length=10, window_size=10):
stock_files = sorted(glob(path_to_csv_files))
##create a shared dataframe with all users and sites
df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)
##let's read the file with the site name, identification number and frequency
with open(site_freq_path, "rb") as fp:
df_site_dict = pickle.load(fp)
#create number list site
list_all_site = []
user_list = []
for filename in stock_files:
tmp_df = pd.read_csv(filename)
user = filename[-12:-4]
list_site = []
#we will read the session separately for each user and convert the sites into their identifiers
for site in tmp_df.site:
list_site.append(df_site_dict.get(site)[0])
count = 0
#iterating over the beginning of the window depending on its width
for start in range(0, (len(list_site) + window_size), window_size):
ind_1 = start
ind_2 = start + session_length #parameter for the condition
if ind_2 <= (len(list_site)-1):
sess = list_site[ind_1 : ind_2]
list_all_site.append(sess)
user_list.append(user)
elif(len(list_site[ind_1:]) !=0):
sess = list_site[ind_1:] + [0 for _ in range(session_length - len(list_site[ind_1:]))]
list_all_site.append(sess)
user_list.append(user)
#now the discharged matrix
X_toy = pd.DataFrame(list_all_site).values
X_sparse_toy = csr_matrix((np.ones(X_toy.size, dtype=int), X_toy.reshape(-1), \
np.arange(X_toy.shape[0] + 1) * X_toy.shape[1]))[:, 1:]
return(X_sparse_toy, np.array(user_list))
Let's apply the resulting function with the parameters session_length=5 and window_size=3 to the toy example to make sure that everything works as it should.
start = time.time()
X_toy_s5_w3, y_s5_w3 = prepare_sparse_train_set_window(os.path.join(PATH_TO_DATA, '3users/*.csv'), os.path.join(PATH_TO_DATA, 'site_freq_3users.pkl'), session_length=5, window_size=3)
end = time.time()
print(timedelta(seconds=end-start))
X_toy_s5_w3.todense()
y_s5_w3
Everything coincided with the presented example
Let's run the created function 16 times using cycles by the number of num_users users (10 or 150), the values of the parameter session_length (15, 10, 7 or 5) and the values of the parameter window_size (10, 7 or 5). Serialize all 16 sparse matrices (training samples) and vectors (target class labels – user ID) into the files X_sparse_{num_users}users_s{session_length}_w{window_size}.pkl
and y_{num_users}users_s{session_length}_w{window_size}.pkl
.
To make sure that we will continue to work with identical objects, we will write to the list data_lengths the number of rows in all the sparse matrices obtained (16 values). If some will match, it's fine (you can figure out why).
import itertools
start = time.time()
data_lengths = []
user_tmp=[]
for num_users in [10, 150]:
for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
if (window_size <= session_length) and ((window_size, session_length) != (10, 10)):
if num_users == 10:
path = os.path.join(PATH_TO_DATA, '10users/*.csv')
unpickled_df = os.path.join(PATH_TO_DATA, 'site_freq_10users.pkl')
else:
path = os.path.join(PATH_TO_DATA, '150users/*.csv')
unpickled_df = os.path.join(PATH_TO_DATA, 'site_freq_150users.pkl')
print("NUM USER = ", num_users, " Window Size = ", window_size, " Session Length = ", session_length)
end = time.time()
print(timedelta(seconds=end-start))
X_sparse, y = prepare_sparse_train_set_window(path, unpickled_df, session_length, window_size)
data_lengths.append(X_sparse.shape[0])
user_tmp.append(len(y))
file_name = os.path.join(PATH_TO_DATA, 'sparse/X_sparse_%dusers_s%d_w%d.pkl' % (num_users, session_length, window_size))
with open(file_name, 'wb') as fp:
pickle.dump(X_sparse, fp)
file_name = os.path.join(PATH_TO_DATA,'sparse/y_%dusers_s%d_w%d.pkl' % (num_users, session_length, window_size))
with open(file_name,'wb') as fp:
pickle.dump(y, fp)
data_lengths.append(X_sparse.shape[0])
end = time.time()
print(timedelta(seconds=end-start))
It is important to note that after I disabled tqdm, the processing time dropped from 1.5 hours to 51 seconds.
Write it to a file answer2_1.txt all numbers from the list data_length s separated by a space. The resulting file will be the answer to 1 question of the test.
def write_answer_to_file(answer, file_address):
with open(file_address, 'w') as out_f:
out_f.write(str(answer))
write_answer_to_file(' '.join([str(elem) for elem in data_lengths]), 'answer2_1.txt')
It is important to note that a space-separated response was not accepted on Stepik. And the gaps had to be removed.
Let's read the train_data_10 users.csv file prepared for 1 week in the DataFrame. Next, we will work with him.
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_data_10users.csv'),
index_col='session_id')
train_df.head()
train_df.info()
Distribution of the target class:
train_df['user_id'].value_counts()
Let's calculate the distribution of the number of unique sites in each session out of 10 sites visited in a row.
num_unique_sites = [np.unique(train_df.values[i, :-1]).shape[0]
for i in range(train_df.shape[0])]
pd.Series(num_unique_sites).value_counts()
pd.Series(num_unique_sites).hist();
Let's check with the help of the QQ-raft and the Shapiro-Wilk criterion that this value is distributed normally. The answer to the second question in the test will be a file with the word "YES" or "NO", depending on whether the number of unique sites in the session is normally distributed.
stats.probplot(num_unique_sites, dist="norm", plot=pylab)
pylab.show()
stat, p = stats.shapiro(num_unique_sites)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
print('Sample looks Gaussian (fail to reject H0)')
else:
print('Sample does not look Gaussian (reject H0)')
Распределение не является нормальным
Let's test the hypothesis that a user will visit a site at least once that he has previously visited in a session of 10 sites. Let's check using the binomial criterion for the share that the proportion of cases when the user has re-visited a site (that is, the number of unique sites in a session < 10) is large: more than 95% (note that the alternative to the fact that the share is 95% is one-sided). The answer to the 3rd question in the test will be the resulting p-value.
has_two_similar = (np.array(num_unique_sites) < 10).astype('int')
len(num_unique_sites)
stats.binom_test(has_two_similar.sum(), has_two_similar.shape[0], 0.95, alternative='greater')
p-value value 0.022
Let's construct a Wilson confidence interval for this fraction of 95%. Round the border of the interval to 3 decimal places.
wilson_interval = proportion_confint(has_two_similar.sum(), has_two_similar.shape[0], method='wilson')
wilson_interval
Intervals of 0.950 and 0.957
What is the 95% confidence interval for the average frequency of site appearance in the sample? It is necessary to build a 95% confidence interval for the average frequency of site appearance in the sample (in all, not only for those sites that have been visited at least 1000 times) based on bootstrap. We use as many bootstrap subsamples as there were sites in the original sample of 10 users. We will take subsamples from the calculated list of site visit frequencies – there is no need to count these frequencies again. It should be taken into account that the frequency of zero appearance (a site with an index of 0 appeared where sessions were shorter than 10 sites) should not be included. Round the boundaries of the interval to 3 decimal places.
Bagging (from Bootstrap aggregation) is one of the first and simplest types of ensembles. It was invented by Leo Breiman in 1994. Bagging is based on the statistical bootstrap method, which allows you to evaluate many statistics of complex distributions.
The bootstrap method is as follows. Let there be a sample X of size N. We will uniformly take from the sample N objects with a return. This means that we will N choose an arbitrary object of the sample (we believe that each object "gets" with the same probability 1/N), and each time we choose from all the original N objects. You can imagine a bag from which the balls are taken out:the ball selected at some step is returned back to the bag, and the next choice is again made equally likely from the same number of balls. Note that due to the return, there will be repeats among them. Denote the new selection by X1. Repeating the procedure M times, we will generate M subsamples X1...XM. Now we have a sufficiently large number of samples and can evaluate various statistics of the initial distribution.link
def get_bootstrap_samples(data, n_samples, random_seed=56):
# function for generating subsamples using bootstrap
np.random.seed(random_seed)
indices = np.random.randint(0, len(data), (n_samples, len(data)))
samples = data[indices]
return samples
def stat_intervals(stat, alpha):
# function for interval estimation
boundaries = np.percentile(stat,
[100 * alpha / 2., 100 * (1 - alpha / 2.)])
return boundaries
stock_files = sorted(glob(os.path.join(PATH_TO_DATA, '10users/*.csv')))
df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)
df.head()
sorted_site = dict(collections.OrderedDict(sorted(Counter(df.site).items(), key=lambda kv: kv[1], reverse = True)))
sorted_site_1000 = {}
for key, value in sorted_site.items():
if value >= 1000:
sorted_site_1000[key] = value
plt.hist(list(sorted_site_1000.values()))
plt.show()
site_freq = list(sorted_site.values())
site_mean_scores = list(map(np.mean, get_bootstrap_samples(np.array(site_freq), len(site_freq))))
print ("95% confidence interval for the ILEC median repair time:", stat_intervals(site_mean_scores, 0.05))
As a result, we got that with 95% probability the average frequency of sites lies between 22,515 and 35,763