Week 2. Preparation and initial data analysis

In the second week, we will continue to prepare data for further analysis and construction of forecast models. Specifically, earlier we determined that a session is a sequence of 10 sites visited by a user, now we will make the session length a parameter, and then when training predictive models we will choose the best session length. We will also get acquainted with the preprocessed data and statistically test the first hypotheses related to our observations.

Plan 2 weeks:

  • Preparation of several training samples for comparison
  • Primary data analysis, hypothesis testing

Preparation of several training samples for comparison

Let's make the number of sites in the session a parameter in order to further compare classification models trained on different samples – with 5, 7, 10 and 15 sites in the session. Moreover, so far we have taken 10 sites in a row, without crossing. Now let's apply the idea of a sliding window - sessions will overlap.

Example: for a session length of 10 and a window width of 7, a file of 30 records will generate not 3 sessions, as before (1-10, 11-20, 21-30), but 5 (1-10, 8-17, 15-24, 22-30, 29-30). At the same time, there will be one zero in the penultimate session, and 8 zeros in the last one.

Let's create several samples for different combinations of session length and window width parameters. All of them are presented in the table below:

session_length ->
window_size
5 7 10 15
5 v v v v
7 v v v
10 v v

In total, there should be 18 sparse matrices - the 9 combinations of session formation parameters indicated in the table for samples of 10 and 150 users. At the same time, we have already made 2 selections in the last part, they correspond to a combination of parameters: session_length=10, window_size=10, which are marked in the table above with a green check mark (done).

Implementing the function prepar_sparse_train_set_window.

Arguments:

  • path_to_csv_files – path to the directory with csv files
  • site_freq_path - path to the pickle file with the frequency dictionary obtained in part 1 of the project
  • session_length – session length (parameter)
  • window_size – window width (parameter)

The function should return 2 objects:

  • a sparse matrix X_sparse (two-dimensional Scipy.sparse.csr_matrix), in which rows correspond to sessions from session_length sites, and max(site_id) columns correspond to the number of site_id visits in the session.
  • vector y (Numpy array) of "responses" in the form of user IDs that belong to sessions from X_sparse

Details:

  • Modify the function created in part 1 of the prepare_train_set
  • Some sessions may be repeated – leave it as it is, do not delete duplicates
  • We measure the execution time of the loop iterations using time from time, tqdm from tqdm or using the log_progress widget (an article about it on Habrahabr)
  • 150 files from capstone_websites_data/150users/ should be processed in a few seconds (depending on the input parameters). If it takes longer– it's not scary, but the function can be accelerated.
import os
import time
import pickle
import math
import pylab 
import collections

import pandas as pd
import numpy as np
import scipy.sparse as sps
import scipy.stats as stats
import matplotlib.pyplot as plt

from glob import glob
from tqdm.auto import tqdm
from scipy.sparse import csr_matrix
from datetime import timedelta
from scipy import stats
from statsmodels.stats.proportion import proportion_confint
from collections import Counter

import warnings
warnings.filterwarnings('ignore')
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'
def prepare_sparse_train_set_window(path_to_csv_files, site_freq_path, session_length=10, window_size=10):
    
    stock_files = sorted(glob(path_to_csv_files))
    
    ##create a shared dataframe with all users and sites
    df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

    ##let's read the file with the site name, identification number and frequency
    with open(site_freq_path, "rb") as fp:
      df_site_dict = pickle.load(fp)
    
    #create number list site
    list_all_site = []
    user_list = []
    for filename in stock_files:
        tmp_df = pd.read_csv(filename)
        user = filename[-12:-4]
        list_site = []
        #we will read the session separately for each user and convert the sites into their identifiers
        for site in tmp_df.site:
            list_site.append(df_site_dict.get(site)[0])
        count = 0
        #iterating over the beginning of the window depending on its width
        for start in range(0, (len(list_site) + window_size), window_size):
            ind_1 = start
            ind_2 = start + session_length #parameter for the condition
            if ind_2 <= (len(list_site)-1):
                sess = list_site[ind_1 : ind_2]
                list_all_site.append(sess)
                user_list.append(user)
            elif(len(list_site[ind_1:]) !=0):
                sess = list_site[ind_1:] + [0 for _ in range(session_length - len(list_site[ind_1:]))]
                list_all_site.append(sess)
                user_list.append(user)

    
    #now the discharged matrix
    X_toy = pd.DataFrame(list_all_site).values
    X_sparse_toy = csr_matrix((np.ones(X_toy.size, dtype=int), X_toy.reshape(-1), \
                           np.arange(X_toy.shape[0] + 1) * X_toy.shape[1]))[:, 1:]
    
    return(X_sparse_toy, np.array(user_list))

Let's apply the resulting function with the parameters session_length=5 and window_size=3 to the toy example to make sure that everything works as it should.

start = time.time()

X_toy_s5_w3, y_s5_w3 = prepare_sparse_train_set_window(os.path.join(PATH_TO_DATA, '3users/*.csv'), os.path.join(PATH_TO_DATA, 'site_freq_3users.pkl'), session_length=5, window_size=3)

end = time.time()

print(timedelta(seconds=end-start))
0:00:00.022730
X_toy_s5_w3.todense()
matrix([[0, 3, 1, 0, 0, 0, 1, 0, 0, 0, 0],
        [1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0],
        [0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
        [3, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
        [2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 2, 1, 0, 0, 2, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1],
        [1, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
y_s5_w3
array(['user0001', 'user0001', 'user0001', 'user0001', 'user0001',
       'user0002', 'user0002', 'user0003', 'user0003', 'user0003',
       'user0003', 'user0003'], dtype='<U8')

Everything coincided with the presented example


Let's run the created function 16 times using cycles by the number of num_users users (10 or 150), the values of the parameter session_length (15, 10, 7 or 5) and the values of the parameter window_size (10, 7 or 5). Serialize all 16 sparse matrices (training samples) and vectors (target class labels – user ID) into the files X_sparse_{num_users}users_s{session_length}_w{window_size}.pkl and y_{num_users}users_s{session_length}_w{window_size}.pkl.

To make sure that we will continue to work with identical objects, we will write to the list data_lengths the number of rows in all the sparse matrices obtained (16 values). If some will match, it's fine (you can figure out why).

import itertools

start = time.time()

data_lengths = []
user_tmp=[]

for num_users in [10, 150]:
    for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
        if (window_size <= session_length) and ((window_size, session_length) != (10, 10)):
            if num_users == 10:
                path = os.path.join(PATH_TO_DATA, '10users/*.csv')
                unpickled_df = os.path.join(PATH_TO_DATA, 'site_freq_10users.pkl')
            else:
                path = os.path.join(PATH_TO_DATA, '150users/*.csv')
                unpickled_df = os.path.join(PATH_TO_DATA, 'site_freq_150users.pkl')   
            print("NUM USER = ", num_users, " Window Size = ", window_size, " Session Length = ", session_length)
            end = time.time()
            print(timedelta(seconds=end-start))

            X_sparse, y = prepare_sparse_train_set_window(path, unpickled_df, session_length, window_size)
            data_lengths.append(X_sparse.shape[0])
            user_tmp.append(len(y))
            
            file_name = os.path.join(PATH_TO_DATA, 'sparse/X_sparse_%dusers_s%d_w%d.pkl' % (num_users, session_length, window_size))
            with open(file_name, 'wb') as fp:
                pickle.dump(X_sparse, fp)
            file_name = os.path.join(PATH_TO_DATA,'sparse/y_%dusers_s%d_w%d.pkl' % (num_users, session_length, window_size))
            with open(file_name,'wb') as fp:
                pickle.dump(y, fp)
                
            data_lengths.append(X_sparse.shape[0])


end = time.time()

print(timedelta(seconds=end-start))
NUM USER =  10  Window Size =  10  Session Length =  15
0:00:00.003979
NUM USER =  10  Window Size =  7  Session Length =  15
0:00:00.457232
NUM USER =  10  Window Size =  7  Session Length =  10
0:00:01.009698
NUM USER =  10  Window Size =  7  Session Length =  7
0:00:01.452218
NUM USER =  10  Window Size =  5  Session Length =  15
0:00:01.917664
NUM USER =  10  Window Size =  5  Session Length =  10
0:00:02.484663
NUM USER =  10  Window Size =  5  Session Length =  7
0:00:02.953907
NUM USER =  10  Window Size =  5  Session Length =  5
0:00:03.404516
NUM USER =  150  Window Size =  10  Session Length =  15
0:00:03.863693
NUM USER =  150  Window Size =  7  Session Length =  15
0:00:08.548853
NUM USER =  150  Window Size =  7  Session Length =  10
0:00:13.495581
NUM USER =  150  Window Size =  7  Session Length =  7
0:00:18.213572
NUM USER =  150  Window Size =  5  Session Length =  15
0:00:22.648985
NUM USER =  150  Window Size =  5  Session Length =  10
0:00:28.012223
NUM USER =  150  Window Size =  5  Session Length =  7
0:00:33.058571
NUM USER =  150  Window Size =  5  Session Length =  5
0:00:37.866631
0:00:42.397943

It is important to note that after I disabled tqdm, the processing time dropped from 1.5 hours to 51 seconds.

Write it to a file answer2_1.txt all numbers from the list data_length s separated by a space. The resulting file will be the answer to 1 question of the test.

def write_answer_to_file(answer, file_address):
    with open(file_address, 'w') as out_f:
        out_f.write(str(answer))
write_answer_to_file(' '.join([str(elem) for elem in data_lengths]), 'answer2_1.txt')

It is important to note that a space-separated response was not accepted on Stepik. And the gaps had to be removed.

Primary data analysis, hypothesis testing

Let's read the train_data_10 users.csv file prepared for 1 week in the DataFrame. Next, we will work with him.

train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_data_10users.csv'), 
                       index_col='session_id')
train_df.head()
site1 site2 site3 site4 site5 site6 site7 site8 site9 site10 user_id
session_id
0 192 574 133 3 133 133 3 133 203 133 user0031
1 415 193 674 254 133 31 393 3305 217 55 user0031
2 55 3 55 55 5 293 415 333 897 55 user0031
3 473 3306 473 55 55 55 55 937 199 123 user0031
4 342 55 5 3307 258 211 3308 2086 675 2086 user0031
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14061 entries, 0 to 14060
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   site1    14061 non-null  int64 
 1   site2    14061 non-null  int64 
 2   site3    14061 non-null  int64 
 3   site4    14061 non-null  int64 
 4   site5    14061 non-null  int64 
 5   site6    14061 non-null  int64 
 6   site7    14061 non-null  int64 
 7   site8    14061 non-null  int64 
 8   site9    14061 non-null  int64 
 9   site10   14061 non-null  int64 
 10  user_id  14061 non-null  object
dtypes: int64(10), object(1)
memory usage: 1.3+ MB

Distribution of the target class:

train_df['user_id'].value_counts()
user0128    2796
user0039    2204
user0207    1868
user0127    1712
user0237    1643
user0033    1022
user0050     802
user0031     760
user0100     720
user0241     534
Name: user_id, dtype: int64

Let's calculate the distribution of the number of unique sites in each session out of 10 sites visited in a row.

num_unique_sites = [np.unique(train_df.values[i, :-1]).shape[0] 
                    for i in range(train_df.shape[0])]
pd.Series(num_unique_sites).value_counts()
7     2308
6     2197
8     2046
5     1735
9     1394
2     1246
4     1163
3      894
10     651
1      427
dtype: int64
pd.Series(num_unique_sites).hist();

Let's check with the help of the QQ-raft and the Shapiro-Wilk criterion that this value is distributed normally. The answer to the second question in the test will be a file with the word "YES" or "NO", depending on whether the number of unique sites in the session is normally distributed.

stats.probplot(num_unique_sites, dist="norm", plot=pylab)
pylab.show()
stat, p = stats.shapiro(num_unique_sites)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Statistics=0.955, p=0.000
Sample does not look Gaussian (reject H0)

Распределение не является нормальным

Let's test the hypothesis that a user will visit a site at least once that he has previously visited in a session of 10 sites. Let's check using the binomial criterion for the share that the proportion of cases when the user has re-visited a site (that is, the number of unique sites in a session < 10) is large: more than 95% (note that the alternative to the fact that the share is 95% is one-sided). The answer to the 3rd question in the test will be the resulting p-value.

has_two_similar = (np.array(num_unique_sites) < 10).astype('int')
len(num_unique_sites)
14061
stats.binom_test(has_two_similar.sum(), has_two_similar.shape[0], 0.95, alternative='greater')
0.02207653769072678

p-value value 0.022

Let's construct a Wilson confidence interval for this fraction of 95%. Round the border of the interval to 3 decimal places.

wilson_interval = proportion_confint(has_two_similar.sum(), has_two_similar.shape[0], method='wilson')
wilson_interval
(0.9501028841411286, 0.9570527377232229)

Intervals of 0.950 and 0.957

What is the 95% confidence interval for the average frequency of site appearance in the sample? It is necessary to build a 95% confidence interval for the average frequency of site appearance in the sample (in all, not only for those sites that have been visited at least 1000 times) based on bootstrap. We use as many bootstrap subsamples as there were sites in the original sample of 10 users. We will take subsamples from the calculated list of site visit frequencies – there is no need to count these frequencies again. It should be taken into account that the frequency of zero appearance (a site with an index of 0 appeared where sessions were shorter than 10 sites) should not be included. Round the boundaries of the interval to 3 decimal places.

Bagging (from Bootstrap aggregation) is one of the first and simplest types of ensembles. It was invented by Leo Breiman in 1994. Bagging is based on the statistical bootstrap method, which allows you to evaluate many statistics of complex distributions.

The bootstrap method is as follows. Let there be a sample X of size N. We will uniformly take from the sample N objects with a return. This means that we will N choose an arbitrary object of the sample (we believe that each object "gets" with the same probability 1/N), and each time we choose from all the original N objects. You can imagine a bag from which the balls are taken out:the ball selected at some step is returned back to the bag, and the next choice is again made equally likely from the same number of balls. Note that due to the return, there will be repeats among them. Denote the new selection by X1. Repeating the procedure M times, we will generate M subsamples X1...XM. Now we have a sufficiently large number of samples and can evaluate various statistics of the initial distribution.link

def get_bootstrap_samples(data, n_samples, random_seed=56):
    # function for generating subsamples using bootstrap
    np.random.seed(random_seed)
    indices = np.random.randint(0, len(data), (n_samples, len(data)))
    samples = data[indices]
    return samples
def stat_intervals(stat, alpha):
    # function for interval estimation
    boundaries = np.percentile(stat, 
                 [100 * alpha / 2., 100 * (1 - alpha / 2.)])
    return boundaries
stock_files = sorted(glob(os.path.join(PATH_TO_DATA, '10users/*.csv')))
df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)
df.head()
timestamp site
0 2013-11-15 08:12:07 fpdownload2.macromedia.com
1 2013-11-15 08:12:17 laposte.net
2 2013-11-15 08:12:17 www.laposte.net
3 2013-11-15 08:12:17 www.google.com
4 2013-11-15 08:12:18 www.laposte.net
sorted_site = dict(collections.OrderedDict(sorted(Counter(df.site).items(), key=lambda kv: kv[1], reverse = True)))
sorted_site_1000 = {}

for key, value in sorted_site.items():
    if value >= 1000:
        sorted_site_1000[key] = value
    
plt.hist(list(sorted_site_1000.values()))
plt.show()
site_freq = list(sorted_site.values())
site_mean_scores = list(map(np.mean, get_bootstrap_samples(np.array(site_freq), len(site_freq))))
print ("95% confidence interval for the ILEC median repair time:",  stat_intervals(site_mean_scores, 0.05))
95% confidence interval for the ILEC median repair time: [22.53311622 36.08414411]

As a result, we got that with 95% probability the average frequency of sites lies between 22,515 and 35,763