Identification of Internet users
In this project, we will solve the problem of identifying a user by his behavior on the Internet. This is a complex and interesting task at the intersection of data analysis and behavioral psychology. As an example, Yandex solves the problem of identifying a mailbox cracker by his behavior. In a nutshell, the hacker will behave differently from the owner of the mailbox: he may not delete messages immediately after reading, as the owner did, he will check the messages differently and even move the mouse in his own way. Then such an attacker can be identified and "thrown out" of the mailbox by inviting the owner to enter by SMS code. This pilot project is described in an article on Habrahabr. Similar things are done, for example, in Google Analytics and are described in scientific articles, you can find a lot by the phrases "Traversal Pattern Mining" and "Sequential Pattern Mining".
We will solve a similar problem: by a sequence of several websites visited in a row by the same person, we will identify this person. The idea is this: Internet users follow links in different ways, and this can help identify them (someone first to the mail, then to read about football, then news, contact, then finally to work, someone to work immediately, if possible).
We will use data from the article "A Tool for Classification of Sequential Data". And although we cannot recommend this article (the described methods are far from state-of-the-art, it is better to refer to the book "Frequent Pattern Mining" and the latest articles with ICDM), but the data there are collected neatly and are of interest.
There is data from the Blaise Pascal University proxy servers, their appearance is very simple: user ID, timestamp, visited website.
You can download the source data from the link in the article (there is also a description), for this task there is enough data not for all 3000 users, but for 10 and 150. Link to the archive capstone_user_identification.zip (~7 Mb, expanded ~60 Mb).
In the course of the project, you will have 4 tasks of the Programming Assignment type, dedicated to data preprocessing, primary analysis, visual data analysis, comparison of classification models and setting up the selected model and studying its retraining. You will also have 3 mutually evaluated tasks (Peer Review) - on data visualization (including with newly created features), on evaluating the results of participation in competition Kaggle Inclass and throughout the project as a whole.
During the project, we will work with the Vowpal Wabbit library. If there are problems with its installation, you can use the Docker image, for example, the one described in Wiki of the open course repository OpenDataScience on machine learning.
The project plan is as follows:
1 week. Preparing data for analysis and model building. Programming Assignment
The first part of the project is devoted to the preparation of data for further descriptive analysis and the construction of predictive models. It will be necessary to write code for preprocessing the data (the websites initially visited are indicated for each user in a separate file) and forming a single training sample. Also in this part we will get acquainted with the sparse data format (Scipy.sparse matrices), which is well suited for this task.
- Preparing a training sample
- Working with the sparse data format
2 week. Preparation and initial analysis of data. Programming Assignment
In the second week, we will continue to prepare data for further analysis and construction of forecast models. Specifically, earlier we determined that a session is a sequence of 10 sites visited by a user, now we will make the session length a parameter, and then when training predictive models we will choose the best session length. We will also get acquainted with the preprocessed data and statistically test the first hypotheses related to our observations.
- Preparation of several training samples for comparison
- Primary data analysis, hypothesis testing
3 week. Visual data analysis and feature construction. Peer-Review
In week 3, we will be engaged in visual data analysis and feature construction. First, we will build and analyze several signs together, then you will be able to come up with and describe various signs yourself. The task has the form of a Peer-Review, so creativity is actively welcome here. If you use IPython widgets, the Plotly library, animations and other interactive tools, it will only be better for everyone.
- Visual data analysis
- Building features
4 weeks. Comparison of classification algorithms. Programming Assignment
Here we will finally approach the training of classification models, compare several algorithms on cross-validation, and figure out which session length parameters (session_length and window_size) are better to use. Also, for the selected algorithm, we will construct validation curves (how the classification quality depends on one of the hyperparameters of the algorithm) and learning curves (how the classification quality depends on the sample size).
- Comparison of several algorithms in sessions from 10 sites
- Selection of parameters - session length and window width
- User-specific identification and learning curves
Week 5. Kaggle Inclass User Identification competition. Peer-Review
Here we will recall the concept of stochastic gradient descent and try the Scikit-learn SGDClassifier classifier, which works much faster on large samples than the algorithms we tested in week 4. We will also get acquainted with the data of the Kaggle user identification competition and make the first parcels in it. At the end of this week, those who beat the benchmarks specified in the competition will receive additional points.
Week 6. Vowpal Wabbit. Tutorial + Programming Assignment
This week we will get acquainted with the popular Vowpal Wabbit library and try it on web session data. We will get acquainted with the Scikit-learn data on the news, first in binary classification mode, then in multiclass mode. Then we will classify movie reviews from the IMDB website. Finally, let's apply Vowpal Wabbit to web session data. There is a lot of material, but Vowpal Wabbit is worth it!
- Article about Vowpal Wabbit
- Applying Vowpal Wabbit to site visit data
Week 7. Design of the final project. Peer-Review
At the very end, mutual verification of the final versions of the project awaits you. It will be possible to roam around here, because there is freedom of creativity at every stage of the project: you can use all the source data for 3000 users, you can create your own interesting signs, build beautiful pictures, use your models or ensembles of models and draw conclusions. Therefore, the advice is as follows: as the tasks are completed, copy the code and description in parallel to the .ipynb file of the project or describe the results along the way in a text editor.