I recently completed a test task for one organization. The data were presented and the main task was to predict the profitability based on the presented data. The values themselves were significantly scattered and unbalanced. Also, according to the data, information was not provided about what each value means and what kind of relationship exists.I immediately decided that I would conduct the analysis using popular boosting methods.

The first step was to download the necessary libraries.

import time
from datetime import datetime

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from collections import Counter
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler


%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

DataFarame

Now you can look at the data itself.

df = pd.read_csv('/content/drive/MyDrive/DATA/dataset.csv')
df.head()
Unnamed: 0 cmlt_daily_game_currency cmlt_seconds_with_us cmlt_max_sessions_duration cmlt_sum_sessions_duration cmlt_count_sessions cmlt_sum_quantum_duration cmlt_count_quant cmlt_max_quantum_duration cmlt_max_quant ... cmlt_spent_hc_per_grind cmlt_spent_hc_div_active_time cmlt_seconds_div_active_time cmlt_hard_med_spent birthday sex is_cheater has_email time_confirm_email target_game_currency
0 0 0.0 0 117 234 2 0 0 0 0 ... 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN 0.0
1 1 0.0 125 668 668 1 1448 4 271 28 ... 0.000000 0.000000 62.500000 0.000000 NaN NaN False False NaN 0.0
2 2 0.0 14070 4137 15551 12 41220 119 796 275 ... 0.298507 0.298507 210.000000 2.857143 1935-10-22 female False False NaN 0.0
3 3 0.0 0 290 290 1 176 2 44 3 ... 0.000000 0.000000 0.000000 0.000000 1987-08-08 female False True NaN 0.0
4 4 0.0 887 1002 1002 1 2448 12 123 121 ... 0.000000 0.000000 126.714286 0.000000 NaN NaN False False NaN 0.0

5 rows × 40 columns

df_work = df.copy()
df_work = df_work.drop(['Unnamed: 0'], axis = 1)
futures_number = df_work.select_dtypes(include=['int64', 'float64']).columns
df_work.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250006 entries, 0 to 250005
Data columns (total 39 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   cmlt_daily_game_currency       250006 non-null  float64
 1   cmlt_seconds_with_us           250006 non-null  int64  
 2   cmlt_max_sessions_duration     250006 non-null  int64  
 3   cmlt_sum_sessions_duration     250006 non-null  int64  
 4   cmlt_count_sessions            250006 non-null  int64  
 5   cmlt_sum_quantum_duration      250006 non-null  int64  
 6   cmlt_count_quant               250006 non-null  int64  
 7   cmlt_max_quantum_duration      250006 non-null  int64  
 8   cmlt_max_quant                 250006 non-null  int64  
 9   cmlt_final_level               250006 non-null  int64  
 10  cmlt_count_grind               250006 non-null  int64  
 11  cmlt_max_grind                 250006 non-null  int64  
 12  cmlt_count_won_grind           250006 non-null  int64  
 13  cmlt_hard_buy                  250006 non-null  int64  
 14  cmlt_hard_earn                 250006 non-null  int64  
 15  cmlt_hard_gift                 250006 non-null  int64  
 16  cmlt_hard_spent                250006 non-null  int64  
 17  cmlt_hard_max_spent            250006 non-null  int64  
 18  country                        249950 non-null  object 
 19  country_top_tier               250006 non-null  int64  
 20  network_name                   224850 non-null  object 
 21  date_install                   250006 non-null  object 
 22  first_command_time             250006 non-null  object 
 23  cmlt_time_4grind               250006 non-null  float64
 24  cmlt_time_with_us_4grind       250006 non-null  float64
 25  cmlt_avg_time_for_level        250006 non-null  float64
 26  cmlt_avg_duration              250006 non-null  float64
 27  cmlt_avg_grind_duration        250006 non-null  float64
 28  cmlt_winrate                   250006 non-null  float64
 29  cmlt_spent_hc_per_grind        250006 non-null  float64
 30  cmlt_spent_hc_div_active_time  250006 non-null  float64
 31  cmlt_seconds_div_active_time   250006 non-null  float64
 32  cmlt_hard_med_spent            250006 non-null  float64
 33  birthday                       64243 non-null   object 
 34  sex                            63041 non-null   object 
 35  is_cheater                     244347 non-null  object 
 36  has_email                      244347 non-null  object 
 37  time_confirm_email             4234 non-null    object 
 38  target_game_currency           250006 non-null  float64
dtypes: float64(12), int64(18), object(9)
memory usage: 74.4+ MB
df_work.shape
(250006, 39)
df.astype(bool).sum(axis=0)
Unnamed: 0                       250005
cmlt_daily_game_currency           4835
cmlt_seconds_with_us             187835
cmlt_max_sessions_duration       250006
cmlt_sum_sessions_duration       250006
cmlt_count_sessions              250006
cmlt_sum_quantum_duration        207246
cmlt_count_quant                 207246
cmlt_max_quantum_duration        207246
cmlt_max_quant                   207246
cmlt_final_level                 250006
cmlt_count_grind                 207246
cmlt_max_grind                   207246
cmlt_count_won_grind             206057
cmlt_hard_buy                      4727
cmlt_hard_earn                   166791
cmlt_hard_gift                   230864
cmlt_hard_spent                   79057
cmlt_hard_max_spent               79057
country                          250006
country_top_tier                 123327
network_name                     250006
date_install                     250006
first_command_time               250006
cmlt_time_4grind                 207246
cmlt_time_with_us_4grind         187835
cmlt_avg_time_for_level          250006
cmlt_avg_duration                250006
cmlt_avg_grind_duration          207246
cmlt_winrate                     206057
cmlt_spent_hc_per_grind           78630
cmlt_spent_hc_div_active_time     78630
cmlt_seconds_div_active_time     187835
cmlt_hard_med_spent               79057
birthday                         250006
sex                              250006
is_cheater                         5850
has_email                         44690
time_confirm_email               250006
target_game_currency               9086
dtype: int64
df.nunique()
Unnamed: 0                       250006
cmlt_daily_game_currency            576
cmlt_seconds_with_us              74344
cmlt_max_sessions_duration        16404
cmlt_sum_sessions_duration        45566
cmlt_count_sessions                 130
cmlt_sum_quantum_duration         37932
cmlt_count_quant                   1120
cmlt_max_quantum_duration          4806
cmlt_max_quant                     3027
cmlt_final_level                    119
cmlt_count_grind                    669
cmlt_max_grind                     3976
cmlt_count_won_grind                467
cmlt_hard_buy                       343
cmlt_hard_earn                      122
cmlt_hard_gift                       63
cmlt_hard_spent                     911
cmlt_hard_max_spent                 455
country                             216
country_top_tier                      2
network_name                         13
date_install                         90
first_command_time               245466
cmlt_time_4grind                  96448
cmlt_time_with_us_4grind         110912
cmlt_avg_time_for_level           77350
cmlt_avg_duration                 76352
cmlt_avg_grind_duration           96448
cmlt_winrate                      14332
cmlt_spent_hc_per_grind           10339
cmlt_spent_hc_div_active_time     10339
cmlt_seconds_div_active_time     110912
cmlt_hard_med_spent                1426
birthday                          21972
sex                                   3
is_cheater                            2
has_email                             2
time_confirm_email                 4110
target_game_currency               2713
dtype: int64
unique_values = df_work.select_dtypes(include="number").nunique().sort_values()

unique_values.plot.bar(logy=True, figsize=(15, 4), title="Unique values per feature");
df.isna().sum()
Unnamed: 0                            0
cmlt_daily_game_currency              0
cmlt_seconds_with_us                  0
cmlt_max_sessions_duration            0
cmlt_sum_sessions_duration            0
cmlt_count_sessions                   0
cmlt_sum_quantum_duration             0
cmlt_count_quant                      0
cmlt_max_quantum_duration             0
cmlt_max_quant                        0
cmlt_final_level                      0
cmlt_count_grind                      0
cmlt_max_grind                        0
cmlt_count_won_grind                  0
cmlt_hard_buy                         0
cmlt_hard_earn                        0
cmlt_hard_gift                        0
cmlt_hard_spent                       0
cmlt_hard_max_spent                   0
country                              56
country_top_tier                      0
network_name                      25156
date_install                          0
first_command_time                    0
cmlt_time_4grind                      0
cmlt_time_with_us_4grind              0
cmlt_avg_time_for_level               0
cmlt_avg_duration                     0
cmlt_avg_grind_duration               0
cmlt_winrate                          0
cmlt_spent_hc_per_grind               0
cmlt_spent_hc_div_active_time         0
cmlt_seconds_div_active_time          0
cmlt_hard_med_spent                   0
birthday                         185763
sex                              186965
is_cheater                         5659
has_email                          5659
time_confirm_email               245772
target_game_currency                  0
dtype: int64
plt.figure(figsize=(10, 8))
plt.imshow(df_work.isna(), aspect="auto", interpolation="nearest", cmap="gray")
plt.xlabel("Column Number")
plt.ylabel("Sample Number");
import missingno as msno

msno.matrix(df_work, labels=True, sort="descending");
df_work.isna().mean().sort_values().plot(
    kind="bar", figsize=(15, 4),
    title="Percentage of missing values per feature",
    ylabel="Ratio of missing values per feature");
global view of the dataset
df_work.plot(lw=0, marker=".", subplots=True, layout=(-1, 4),
          figsize=(15, 30), markersize=1);
Feature distribution
df_work.hist(bins=25, figsize=(15, 25), layout=(-1, 5), edgecolor="black")
plt.tight_layout();

I deleted the parasitic column and selected a separate variable that will contain the values of numeric columns. Many columns contain a large number of null values. Text columns are characterized by the presence of a large number of NaN values. The values are characterized by a significant spread. The dataframe size is 25,000 rows by 39 columns. The analysis itself is carried out in colab.

Data Preparation

Prepare int and float

df_work = df_work.drop(['birthday', 'sex', 'time_confirm_email'], axis=1)
prepare_data = pd.DataFrame()
futures_object = df_work.select_dtypes(include=['object']).columns
X_num = df_work[futures_number]
mask = np.zeros_like(X_num.corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 
f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)


sns.heatmap(X_num.corr(),linewidths=0.25,vmax=0.7,square=True,cmap="BuGn", #"BuGn_r" to reverse 
            linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});