1 week. Data preparation for analysis and model building. Programming Assignment

The first part of the project is devoted to the preparation of data for further descriptive analysis and the construction of predictive models. It will be necessary to write code for preprocessing the data (the websites initially visited are indicated for each user in a separate file) and forming a single training sample. Also in this part we will get acquainted with the allowed data format (Scipy.sparse matrices), which is well suited for this task.

  • Preparation of a training sample
  • Working with sparse data format

Data preparation for analysis and model building

The first part of the project is devoted to the preparation of data for further descriptive analysis and the construction of predictive models. It will be necessary to write code for preprocessing the data (the websites initially visited are indicated for each user in a separate file) and forming a single training sample. Also in this part we will get acquainted with the sparse data format (Scipy.sparse matrices), which is well suited for this task.

Preparation of the training sample


import os
import math
import collections
import time
import pickle

import pandas as pd
import numpy as np

from tqdm.auto import tqdm
from glob import glob
from collections import Counter
from scipy.sparse import csr_matrix
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'

According to the task, it is necessary to implement the function prepare_train_set, which takes as input the path to the directory with csv files path_to_csv_files and the parameter session_length – the length of the session, and returns 2 objects:

  • DataFrame in which rows correspond to unique sessions from session_length sites, session_length columns correspond to the indexes of these session_length sites and the last column is the user ID
  • a frequency dictionary of sites of the form {'site_string': [site_id, site_freq]}, for example, for a recent toy example it would be {'vk.com ': (1, 2), 'google.com ': (2, 2), 'yandex.ru ': (3, 3), 'facebook.com ': (4, 1)}

Details:

  • It is necessary to use glob (or analogues) to crawl files in the directory. For certainty, sort the list of files lexicographically. It is convenient to use tqdm to track the number of completed iterations of the loop
  • Create a frequency dictionary of unique sites (like {'site_string': (site_id, site_freq)}) and fill it in as you read the files. Start with 1
  • It is recommended to give smaller indexes to more frequently encountered sites (the principle of the smallest description)
  • Do not do entity recognition, count google.com , http://www.google.com and www.google.com different sites
  • Most likely, the number of records in the file is not a multiple of the number of session_length. Then the last session will be shorter. Fill in the remainder with zeros. That is, if there are 24 entries in the file and sessions of length 10, then the 3rd session will consist of 4 sites, and we will match the vector [site1_id, site2_id, site3_id, site4_id, 0, 0, 0, 0, 0, 0, user_id]
  • As a result, some sessions may be repeated – leave as is, do not delete duplicates. If all sites are the same in two sessions, but the sessions belong to different users, then leave it as it is, this is a natural uncertainty in the data.
  • It is necessary not to leave the site 0 in the frequency dictionary
def prepare_train_set(path_to_csv_files, session_length=10):
    
    stock_files = sorted(glob(path_to_csv_files))
    
    #create a shared dataframe with all users and sites
    df = pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)
    
    #create a dictionary with the frequency of sites and sort it
    sorted_site = dict(collections.OrderedDict(sorted(Counter(df.site).items(),\
                                                      key=lambda kv: kv[1], reverse = True)))
    
    #define the site_id and add it to the tuple to add to the dictionary
    sorted_site_list = list(sorted_site.keys())
    df_site_dict_2 = {}
    for i, site in enumerate(sorted_site, 1):
        df_site_dict_2[site] = (i, sorted_site.get(site))  
    
    #creating a list of sites from site_id
    list_all_site = []
    user = 1
    for filename in tqdm((stock_files), desc='Loop2'):
        tmp_df = pd.read_csv(filename)
        list_site = []
        #I go through the sites in each file and transform them into site_id
        for site in tqdm(tmp_df.site, desc = 'Loop3'):
            list_site.append(df_site_dict_2.get(site)[0])
        #adding zeros to a session of 10 sites
        multiple_len =  (math.ceil(len(list_site)/session_length))*session_length
        tmp = [0] * (multiple_len - len(list_site))
        list_site.extend(tmp)
        count = 0
        #combining all the lists into one main one
        while (count < (len(list_site)/session_length)):
            ind_1 = count * session_length
            count = count + 1
            ind_2 = count * session_length
            sess = list_site[ind_1 : ind_2]
            sess.append(user)
            list_all_site.append(sess)
        user = user + 1

    #creating a dataframe from the main one
    name_site = []
    for i in range(session_length):
        name_site.append('site'+str(i+1))
    name_site.append('user_id')
    dt_tmp = pd.DataFrame(list_all_site, columns=name_site)
    
    return(dt_tmp, df_site_dict_2)

I will conduct a test on three users and check the task

path = os.path.join(PATH_TO_DATA, '3users/*.csv')
path
'/content/drive/MyDrive/DATA/Stepik/capstone_user_identification/3users/*.csv'
start = time.time()

train_data_toy, site_freq_3users = prepare_train_set(path)

end = time.time()
print(end - start)
2.2746620178222656
site_freq_3users
{'accounts.google.com': (8, 1),
 'apis.google.com': (9, 1),
 'football.kulichki.ru': (6, 2),
 'geo.mozilla.org': (7, 1),
 'google.com': (1, 9),
 'mail.google.com': (5, 2),
 'meduza.io': (4, 3),
 'oracle.com': (2, 8),
 'plus.google.com': (10, 1),
 'vk.com': (3, 3),
 'yandex.ru': (11, 1)}
train_data_toy
site1 site2 site3 site4 site5 site6 site7 site8 site9 site10 user_id
0 3 2 2 7 2 1 8 5 9 10 1
1 3 1 1 1 0 0 0 0 0 0 1
2 3 2 6 6 2 0 0 0 0 0 2
3 4 1 2 1 2 1 1 5 11 4 3
4 4 1 2 0 0 0 0 0 0 0 3

Everything works, I switch to 10 users

**Part 1. How many unique sessions out of 10 sites in Vyborg with 10 users?

path = os.path.join(PATH_TO_DATA, '10users/*.csv')
start = time.time()

train_data_toy10, site_freq_10users = prepare_train_set(path)

end = time.time()
print(end - start)
1.2744076251983643
len(train_data_toy10)
14061

As a result, I received 14061 unique sessions from 10 sites from 10 users.

Part 2. How many unique sites are there in a sample of 10 users?

len(site_freq_10users)
4913

In total, I received 4913 unique sites

Part 3. How many unique sessions out of 10 sites in the sample with 150 users?

path = os.path.join(PATH_TO_DATA, '150users/*.csv')
start = time.time()

train_data_toy150, site_freq_150users = prepare_train_set(path)

end = time.time()
print(end - start)
16.766276597976685
len(train_data_toy150)
137019

It doesn't seem to be the most efficient implementation of the function. She worked for me in 1.5 minutes, and for the teacher in 1.7 seconds. I got the result - 137019

Part 4. How many unique sites are there in the sample of 150 users?

len(site_freq_150users)
27797

There are 27797 unique sites in the sample of 150 users.

Part 5. What are the top 10 most popular sites among the 150 users visited

list(site_freq_150users.keys())[:10]
['www.google.fr',
 'www.google.com',
 'www.facebook.com',
 'apis.google.com',
 's.youtube.com',
 'clients1.google.com',
 'mail.google.com',
 'plus.google.com',
 'safebrowsing-cache.google.com',
 'www.youtube.com']

Working with sparse data format

If you think about it like that, then the obtained signs site1, ..., site 10 do not make sense as signs in the classification problem. But if you use the idea of a bag of words from text analysis– this is another matter. Let's create new matrices in which the rows will correspond to sessions from 10 sites, and the columns will correspond to site indexes. At the intersection of the row $i$ and the column $j$ will be the number $n_{ij}$ – how many times the site $j$ met in the session number $i$. We will do this using sparse Scipy – [csr_matrix] matrices(https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html ). First you need to test it on a toy example, then apply it for 10 and 150 users.

Please note that in short sessions, less than 10 sites, we have zeros left, so the first sign (how many times 0 was caught) is different in meaning from the rest (how many times a site with the index $i$ was caught). Therefore, the first column of the sparse matrix will need to be deleted.

X_toy, y_toy = train_data_toy.iloc[:, :-1].values, train_data_toy.iloc[:, -1].values
X_toy
array([[ 3,  2,  2,  7,  2,  1,  8,  5,  9, 10],
       [ 3,  1,  1,  1,  0,  0,  0,  0,  0,  0],
       [ 3,  2,  6,  6,  2,  0,  0,  0,  0,  0],
       [ 4,  1,  2,  1,  2,  1,  1,  5, 11,  4],
       [ 4,  1,  2,  0,  0,  0,  0,  0,  0,  0]])

There are two types of matrices - dense and sparse.

A sparse matrix is a matrix with predominantly zero elements. Otherwise, if most of the matrix elements are nonzero, the matrix is considered dense.

There is no unity among experts in determining exactly what number of non-zero elements makes the matrix sparse. Different authors offer different options.

X_sparse_toy = csr_matrix((np.ones(X_toy.size, dtype=int), X_toy.reshape(-1), \
                           np.arange(X_toy.shape[0] + 1) * X_toy.shape[1]))[:, 1:]

The dimension of the sparse matrix should be equal to 11, since in the toy example 3 users visited 11 unique sites.

X_sparse_toy.todense()
matrix([[1, 3, 1, 0, 1, 0, 1, 1, 1, 1, 0],
        [3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 2, 1, 0, 0, 2, 0, 0, 0, 0, 0],
        [4, 2, 0, 2, 1, 0, 0, 0, 0, 0, 1],
        [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]])
train_data_toy10
X_10users, y_10users = train_data_toy10.iloc[:, :-1].values, \
                       train_data_toy10.iloc[:, -1].values
X_150users, y_150users = train_data_toy150.iloc[:, :-1].values, \
                         train_data_toy150.iloc[:, -1].values
X_sparse_10users = csr_matrix((np.ones(X_10users.size, dtype=int), X_10users.reshape(-1), \
                           np.arange(X_10users.shape[0] + 1) * X_10users.shape[1]))[:, 1:]
X_sparse_150users = csr_matrix((np.ones(X_150users.size, dtype=int), X_150users.reshape(-1), \
                           np.arange(X_150users.shape[0] + 1) * X_150users.shape[1]))[:, 1:]

Save these sparse matrices using pickle (serialization in Python), we will also save the vectors y_10users, y_150users - target values (user id) in samples of 10 and 150 users. The fact that the names of these matrices begin with X and y hints that we will test the first classification models on these data. Finally, we will also save the frequency dictionaries of sites for 3, 10 and 150 users.

with open(os.path.join(PATH_TO_DATA, 
                       'X_sparse_10users.pkl'), 'wb') as X10_pkl:
    pickle.dump(X_sparse_10users, X10_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'y_10users.pkl'), 'wb') as y10_pkl:
    pickle.dump(y_10users, y10_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'X_sparse_150users.pkl'), 'wb') as X150_pkl:
    pickle.dump(X_sparse_150users, X150_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'y_150users.pkl'), 'wb') as y150_pkl:
    pickle.dump(y_150users, y150_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'site_freq_3users.pkl'), 'wb') as site_freq_3users_pkl:
    pickle.dump(site_freq_3users, site_freq_3users_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'site_freq_10users.pkl'), 'wb') as site_freq_10users_pkl:
    pickle.dump(site_freq_10users, site_freq_10users_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'site_freq_150users.pkl'), 'wb') as site_freq_150users_pkl:
    pickle.dump(site_freq_150users, site_freq_150users_pkl, protocol=2)

Just to be safe, let's check that the number of columns in the sparse matrices 'X_sparse_10users` and 'X_sparse_150users' is equal to the previously calculated numbers of unique sites for 10 and 150 users, respectively.

assert X_sparse_10users.shape[1] == len(site_freq_10users)
assert X_sparse_150users.shape[1] == len(site_freq_150users)