Week 2. Preparation and initial data analysis

In the second week, we will continue to prepare data for further analysis and construction of forecast models. Specifically, earlier we determined that a session is a sequence of 10 sites visited by a user, now we will make the session length a parameter, and then when training predictive models we will choose the best session length. We will also get acquainted with the preprocessed data and statistically test the first hypotheses related to our observations.

Plan 2 weeks:

Preparation of several training samples for comparison
Primary data analysis, hypothesis testing

Preparation of several training samples for comparison

Let's make the number of sites in the session a parameter in order to further compare classification models trained on different samples – with 5, 7, 10 and 15 sites in the session. Moreover, so far we have taken 10 sites in a row, without crossing. Now let's apply the idea of a sliding window - sessions will overlap.

Example: for a session length of 10 and a window width of 7, a file of 30 records will generate not 3 sessions, as before (1-10, 11-20, 21-30), but 5 (1-10, 8-17, 15-24, 22-30, 29-30). At the same time, there will be one zero in the penultimate session, and 8 zeros in the last one.

Let's create several samples for different combinations of session length and window width parameters. All of them are presented in the table below:

session_length -> window_size	5	7	10	15
5	v	v	v	v
7		v	v	v
10			v	v

In total, there should be 18 sparse matrices - the 9 combinations of session formation parameters indicated in the table for samples of 10 and 150 users. At the same time, we have already made 2 selections in the last part, they correspond to a combination of parameters: session_length=10, window_size=10, which are marked in the table above with a green check mark (done).

Implementing the function prepar_sparse_train_set_window.

Arguments:

path_to_csv_files – path to the directory with csv files
site_freq_path - path to the pickle file with the frequency dictionary obtained in part 1 of the project
session_length – session length (parameter)
window_size – window width (parameter)

The function should return 2 objects:

a sparse matrix X_sparse (two-dimensional Scipy.sparse.csr_matrix), in which rows correspond to sessions from session_length sites, and max(site_id) columns correspond to the number of site_id visits in the session.
vector y (Numpy array) of "responses" in the form of user IDs that belong to sessions from X_sparse

Details:

Modify the function created in part 1 of the prepare_train_set
Some sessions may be repeated – leave it as it is, do not delete duplicates
We measure the execution time of the loop iterations using time from time, tqdm from tqdm or using the log_progress widget (an article about it on Habrahabr)
150 files from capstone_websites_data/150users/ should be processed in a few seconds (depending on the input parameters). If it takes longer– it's not scary, but the function can be accelerated.

import os
import time
import pickle
import math
import pylab 
import collections

import pandas as pd
import numpy as np
import scipy.sparse as sps
import scipy.stats as stats
import matplotlib.pyplot as plt

from glob import glob
from tqdm.auto import tqdm
from scipy.sparse import csr_matrix
from datetime import timedelta
from scipy import stats
from statsmodels.stats.proportion import proportion_confint
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'

def prepare_sparse_train_set_window(path_to_csv_files, site_freq_path, session_length=10, window_size=10):
    
    stock_files = sorted(glob(path_to_csv_files))
    
    ##create a shared dataframe with all users and sites
    df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

    ##let's read the file with the site name, identification number and frequency
    with open(site_freq_path, "rb") as fp:
      df_site_dict = pickle.load(fp)
    
    #create number list site
    list_all_site = []
    user_list = []
    for filename in stock_files:
        tmp_df = pd.read_csv(filename)
        user = filename[-12:-4]
        list_site = []
        #we will read the session separately for each user and convert the sites into their identifiers
        for site in tmp_df.site:
            list_site.append(df_site_dict.get(site)[0])
        count = 0
        #iterating over the beginning of the window depending on its width
        for start in range(0, (len(list_site) + window_size), window_size):
            ind_1 = start
            ind_2 = start + session_length #parameter for the condition
            if ind_2 <= (len(list_site)-1):
                sess = list_site[ind_1 : ind_2]
                list_all_site.append(sess)
                user_list.append(user)
            elif(len(list_site[ind_1:]) !=0):
                sess = list_site[ind_1:] + [0 for _ in range(session_length - len(list_site[ind_1:]))]
                list_all_site.append(sess)
                user_list.append(user)

    
    #now the discharged matrix
    X_toy = pd.DataFrame(list_all_site).values
    X_sparse_toy = csr_matrix((np.ones(X_toy.size, dtype=int), X_toy.reshape(-1), \
                           np.arange(X_toy.shape[0] + 1) * X_toy.shape[1]))[:, 1:]
    
    return(X_sparse_toy, np.array(user_list))

Let's apply the resulting function with the parameters session_length=5 and window_size=3 to the toy example to make sure that everything works as it should.

start = time.time()

X_toy_s5_w3, y_s5_w3 = prepare_sparse_train_set_window(os.path.join(PATH_TO_DATA, '3users/*.csv'), os.path.join(PATH_TO_DATA, 'site_freq_3users.pkl'), session_length=5, window_size=3)

end = time.time()

print(timedelta(seconds=end-start))

0:00:00.022730

X_toy_s5_w3.todense()

matrix([[0, 3, 1, 0, 0, 0, 1, 0, 0, 0, 0],
        [1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0],
        [0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
        [3, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
        [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 2, 1, 0, 0, 2, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1],
        [1, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

y_s5_w3

array(['user0001', 'user0001', 'user0001', 'user0001', 'user0001',
       'user0002', 'user0002', 'user0003', 'user0003', 'user0003',
       'user0003', 'user0003'], dtype='<U8')

Everything coincided with the presented example

Let's run the created function 16 times using cycles by the number of num_users users (10 or 150), the values of the parameter session_length (15, 10, 7 or 5) and the values of the parameter window_size (10, 7 or 5). Serialize all 16 sparse matrices (training samples) and vectors (target class labels – user ID) into the files X_sparse_{num_users}users_s{session_length}_w{window_size}.pkl and y_{num_users}users_s{session_length}_w{window_size}.pkl.

To make sure that we will continue to work with identical objects, we will write to the list data_lengths the number of rows in all the sparse matrices obtained (16 values). If some will match, it's fine (you can figure out why).

import itertools

start = time.time()

data_lengths = []
user_tmp=[]

for num_users in [10, 150]:
    for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
        if (window_size <= session_length) and ((window_size, session_length) != (10, 10)):
            if num_users == 10:
                path = os.path.join(PATH_TO_DATA, '10users/*.csv')
                unpickled_df = os.path.join(PATH_TO_DATA, 'site_freq_10users.pkl')
            else:
                path = os.path.join(PATH_TO_DATA, '150users/*.csv')
                unpickled_df = os.path.join(PATH_TO_DATA, 'site_freq_150users.pkl')   
            print("NUM USER = ", num_users, " Window Size = ", window_size, " Session Length = ", session_length)
            end = time.time()
            print(timedelta(seconds=end-start))

            X_sparse, y = prepare_sparse_train_set_window(path, unpickled_df, session_length, window_size)
            data_lengths.append(X_sparse.shape[0])
            user_tmp.append(len(y))
            
            file_name = os.path.join(PATH_TO_DATA, 'sparse/X_sparse_%dusers_s%d_w%d.pkl' % (num_users, session_length, window_size))
            with open(file_name, 'wb') as fp:
                pickle.dump(X_sparse, fp)
            file_name = os.path.join(PATH_TO_DATA,'sparse/y_%dusers_s%d_w%d.pkl' % (num_users, session_length, window_size))
            with open(file_name,'wb') as fp:
                pickle.dump(y, fp)
                
            data_lengths.append(X_sparse.shape[0])


end = time.time()

print(timedelta(seconds=end-start))

NUM USER =  10  Window Size =  10  Session Length =  15
0:00:00.003979
NUM USER =  10  Window Size =  7  Session Length =  15
0:00:00.457232
NUM USER =  10  Window Size =  7  Session Length =  10
0:00:01.009698
NUM USER =  10  Window Size =  7  Session Length =  7
0:00:01.452218
NUM USER =  10  Window Size =  5  Session Length =  15
0:00:01.917664
NUM USER =  10  Window Size =  5  Session Length =  10
0:00:02.484663
NUM USER =  10  Window Size =  5  Session Length =  7
0:00:02.953907
NUM USER =  10  Window Size =  5  Session Length =  5
0:00:03.404516
NUM USER =  150  Window Size =  10  Session Length =  15
0:00:03.863693
NUM USER =  150  Window Size =  7  Session Length =  15
0:00:08.548853
NUM USER =  150  Window Size =  7  Session Length =  10
0:00:13.495581
NUM USER =  150  Window Size =  7  Session Length =  7
0:00:18.213572
NUM USER =  150  Window Size =  5  Session Length =  15
0:00:22.648985
NUM USER =  150  Window Size =  5  Session Length =  10
0:00:28.012223
NUM USER =  150  Window Size =  5  Session Length =  7
0:00:33.058571
NUM USER =  150  Window Size =  5  Session Length =  5
0:00:37.866631
0:00:42.397943

It is important to note that after I disabled tqdm, the processing time dropped from 1.5 hours to 51 seconds.

Write it to a file answer2_1.txt all numbers from the list data_length s separated by a space. The resulting file will be the answer to 1 question of the test.

def write_answer_to_file(answer, file_address):
    with open(file_address, 'w') as out_f:
        out_f.write(str(answer))

write_answer_to_file(' '.join([str(elem) for elem in data_lengths]), 'answer2_1.txt')

It is important to note that a space-separated response was not accepted on Stepik. And the gaps had to be removed.

Primary data analysis, hypothesis testing

Let's read the train_data_10 users.csv file prepared for 1 week in the DataFrame. Next, we will work with him.

train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_data_10users.csv'), 
                       index_col='session_id')

train_df.head()

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14061 entries, 0 to 14060
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   site1    14061 non-null  int64 
 1   site2    14061 non-null  int64 
 2   site3    14061 non-null  int64 
 3   site4    14061 non-null  int64 
 4   site5    14061 non-null  int64 
 5   site6    14061 non-null  int64 
 6   site7    14061 non-null  int64 
 7   site8    14061 non-null  int64 
 8   site9    14061 non-null  int64 
 9   site10   14061 non-null  int64 
 10  user_id  14061 non-null  object
dtypes: int64(10), object(1)
memory usage: 1.3+ MB

Distribution of the target class:

train_df['user_id'].value_counts()

user0128    2796
user0039    2204
user0207    1868
user0127    1712
user0237    1643
user0033    1022
user0050     802
user0031     760
user0100     720
user0241     534
Name: user_id, dtype: int64

Let's calculate the distribution of the number of unique sites in each session out of 10 sites visited in a row.

num_unique_sites = [np.unique(train_df.values[i, :-1]).shape[0] 
                    for i in range(train_df.shape[0])]

pd.Series(num_unique_sites).value_counts()

7     2308
6     2197
8     2046
5     1735
9     1394
2     1246
4     1163
3      894
10     651
1      427
dtype: int64

pd.Series(num_unique_sites).hist();

Let's check with the help of the QQ-raft and the Shapiro-Wilk criterion that this value is distributed normally. The answer to the second question in the test will be a file with the word "YES" or "NO", depending on whether the number of unique sites in the session is normally distributed.

stats.probplot(num_unique_sites, dist="norm", plot=pylab)
pylab.show()

stat, p = stats.shapiro(num_unique_sites)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

Statistics=0.955, p=0.000
Sample does not look Gaussian (reject H0)

Распределение не является нормальным

Let's test the hypothesis that a user will visit a site at least once that he has previously visited in a session of 10 sites. Let's check using the binomial criterion for the share that the proportion of cases when the user has re-visited a site (that is, the number of unique sites in a session < 10) is large: more than 95% (note that the alternative to the fact that the share is 95% is one-sided). The answer to the 3rd question in the test will be the resulting p-value.

has_two_similar = (np.array(num_unique_sites) < 10).astype('int')

len(num_unique_sites)

14061

stats.binom_test(has_two_similar.sum(), has_two_similar.shape[0], 0.95, alternative='greater')

0.02207653769072678

p-value value 0.022

Let's construct a Wilson confidence interval for this fraction of 95%. Round the border of the interval to 3 decimal places.

wilson_interval = proportion_confint(has_two_similar.sum(), has_two_similar.shape[0], method='wilson')
wilson_interval

(0.9501028841411286, 0.9570527377232229)

Intervals of 0.950 and 0.957

What is the 95% confidence interval for the average frequency of site appearance in the sample? It is necessary to build a 95% confidence interval for the average frequency of site appearance in the sample (in all, not only for those sites that have been visited at least 1000 times) based on bootstrap. We use as many bootstrap subsamples as there were sites in the original sample of 10 users. We will take subsamples from the calculated list of site visit frequencies – there is no need to count these frequencies again. It should be taken into account that the frequency of zero appearance (a site with an index of 0 appeared where sessions were shorter than 10 sites) should not be included. Round the boundaries of the interval to 3 decimal places.

Bagging (from Bootstrap aggregation) is one of the first and simplest types of ensembles. It was invented by Leo Breiman in 1994. Bagging is based on the statistical bootstrap method, which allows you to evaluate many statistics of complex distributions.

The bootstrap method is as follows. Let there be a sample X of size N. We will uniformly take from the sample N objects with a return. This means that we will N choose an arbitrary object of the sample (we believe that each object "gets" with the same probability 1/N), and each time we choose from all the original N objects. You can imagine a bag from which the balls are taken out:the ball selected at some step is returned back to the bag, and the next choice is again made equally likely from the same number of balls. Note that due to the return, there will be repeats among them. Denote the new selection by X1. Repeating the procedure M times, we will generate M subsamples X1...XM. Now we have a sufficiently large number of samples and can evaluate various statistics of the initial distribution.link

def get_bootstrap_samples(data, n_samples, random_seed=56):
    # function for generating subsamples using bootstrap
    np.random.seed(random_seed)
    indices = np.random.randint(0, len(data), (n_samples, len(data)))
    samples = data[indices]
    return samples

def stat_intervals(stat, alpha):
    # function for interval estimation
    boundaries = np.percentile(stat, 
                 [100 * alpha / 2., 100 * (1 - alpha / 2.)])
    return boundaries

stock_files = sorted(glob(os.path.join(PATH_TO_DATA, '10users/*.csv')))
df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

df.head()

sorted_site = dict(collections.OrderedDict(sorted(Counter(df.site).items(), key=lambda kv: kv[1], reverse = True)))

sorted_site_1000 = {}

for key, value in sorted_site.items():
    if value >= 1000:
        sorted_site_1000[key] = value
    
plt.hist(list(sorted_site_1000.values()))
plt.show()

site_freq = list(sorted_site.values())

site_mean_scores = list(map(np.mean, get_bootstrap_samples(np.array(site_freq), len(site_freq))))

print ("95% confidence interval for the ILEC median repair time:",  stat_intervals(site_mean_scores, 0.05))

95% confidence interval for the ILEC median repair time: [22.53311622 36.08414411]

As a result, we got that with 95% probability the average frequency of sites lies between 22,515 and 35,763

	site1	site2	site3	site4	site5	site6	site7	site8	site9	site10	user_id
session_id
0	192	574	133	3	133	133	3	133	203	133	user0031
1	415	193	674	254	133	31	393	3305	217	55	user0031
2	55	3	55	55	5	293	415	333	897	55	user0031
3	473	3306	473	55	55	55	55	937	199	123	user0031
4	342	55	5	3307	258	211	3308	2086	675	2086	user0031

	timestamp	site
0	2013-11-15 08:12:07	fpdownload2.macromedia.com
1	2013-11-15 08:12:17	laposte.net
2	2013-11-15 08:12:17	www.laposte.net
3	2013-11-15 08:12:17	www.google.com
4	2013-11-15 08:12:18	www.laposte.net