I recently completed a test task for one organization. The data were presented and the main task was to predict the profitability based on the presented data. The values themselves were significantly scattered and unbalanced. Also, according to the data, information was not provided about what each value means and what kind of relationship exists.I immediately decided that I would conduct the analysis using popular boosting methods.

The first step was to download the necessary libraries.

import time
from datetime import datetime

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from collections import Counter
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler


%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

DataFarame

Now you can look at the data itself.

df = pd.read_csv('/content/drive/MyDrive/DATA/dataset.csv')
df.head()
Unnamed: 0 cmlt_daily_game_currency cmlt_seconds_with_us cmlt_max_sessions_duration cmlt_sum_sessions_duration cmlt_count_sessions cmlt_sum_quantum_duration cmlt_count_quant cmlt_max_quantum_duration cmlt_max_quant ... cmlt_spent_hc_per_grind cmlt_spent_hc_div_active_time cmlt_seconds_div_active_time cmlt_hard_med_spent birthday sex is_cheater has_email time_confirm_email target_game_currency
0 0 0.0 0 117 234 2 0 0 0 0 ... 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN 0.0
1 1 0.0 125 668 668 1 1448 4 271 28 ... 0.000000 0.000000 62.500000 0.000000 NaN NaN False False NaN 0.0
2 2 0.0 14070 4137 15551 12 41220 119 796 275 ... 0.298507 0.298507 210.000000 2.857143 1935-10-22 female False False NaN 0.0
3 3 0.0 0 290 290 1 176 2 44 3 ... 0.000000 0.000000 0.000000 0.000000 1987-08-08 female False True NaN 0.0
4 4 0.0 887 1002 1002 1 2448 12 123 121 ... 0.000000 0.000000 126.714286 0.000000 NaN NaN False False NaN 0.0

5 rows × 40 columns

df_work = df.copy()
df_work = df_work.drop(['Unnamed: 0'], axis = 1)
futures_number = df_work.select_dtypes(include=['int64', 'float64']).columns
df_work.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250006 entries, 0 to 250005
Data columns (total 39 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   cmlt_daily_game_currency       250006 non-null  float64
 1   cmlt_seconds_with_us           250006 non-null  int64  
 2   cmlt_max_sessions_duration     250006 non-null  int64  
 3   cmlt_sum_sessions_duration     250006 non-null  int64  
 4   cmlt_count_sessions            250006 non-null  int64  
 5   cmlt_sum_quantum_duration      250006 non-null  int64  
 6   cmlt_count_quant               250006 non-null  int64  
 7   cmlt_max_quantum_duration      250006 non-null  int64  
 8   cmlt_max_quant                 250006 non-null  int64  
 9   cmlt_final_level               250006 non-null  int64  
 10  cmlt_count_grind               250006 non-null  int64  
 11  cmlt_max_grind                 250006 non-null  int64  
 12  cmlt_count_won_grind           250006 non-null  int64  
 13  cmlt_hard_buy                  250006 non-null  int64  
 14  cmlt_hard_earn                 250006 non-null  int64  
 15  cmlt_hard_gift                 250006 non-null  int64  
 16  cmlt_hard_spent                250006 non-null  int64  
 17  cmlt_hard_max_spent            250006 non-null  int64  
 18  country                        249950 non-null  object 
 19  country_top_tier               250006 non-null  int64  
 20  network_name                   224850 non-null  object 
 21  date_install                   250006 non-null  object 
 22  first_command_time             250006 non-null  object 
 23  cmlt_time_4grind               250006 non-null  float64
 24  cmlt_time_with_us_4grind       250006 non-null  float64
 25  cmlt_avg_time_for_level        250006 non-null  float64
 26  cmlt_avg_duration              250006 non-null  float64
 27  cmlt_avg_grind_duration        250006 non-null  float64
 28  cmlt_winrate                   250006 non-null  float64
 29  cmlt_spent_hc_per_grind        250006 non-null  float64
 30  cmlt_spent_hc_div_active_time  250006 non-null  float64
 31  cmlt_seconds_div_active_time   250006 non-null  float64
 32  cmlt_hard_med_spent            250006 non-null  float64
 33  birthday                       64243 non-null   object 
 34  sex                            63041 non-null   object 
 35  is_cheater                     244347 non-null  object 
 36  has_email                      244347 non-null  object 
 37  time_confirm_email             4234 non-null    object 
 38  target_game_currency           250006 non-null  float64
dtypes: float64(12), int64(18), object(9)
memory usage: 74.4+ MB
df_work.shape
(250006, 39)
df.astype(bool).sum(axis=0)
Unnamed: 0                       250005
cmlt_daily_game_currency           4835
cmlt_seconds_with_us             187835
cmlt_max_sessions_duration       250006
cmlt_sum_sessions_duration       250006
cmlt_count_sessions              250006
cmlt_sum_quantum_duration        207246
cmlt_count_quant                 207246
cmlt_max_quantum_duration        207246
cmlt_max_quant                   207246
cmlt_final_level                 250006
cmlt_count_grind                 207246
cmlt_max_grind                   207246
cmlt_count_won_grind             206057
cmlt_hard_buy                      4727
cmlt_hard_earn                   166791
cmlt_hard_gift                   230864
cmlt_hard_spent                   79057
cmlt_hard_max_spent               79057
country                          250006
country_top_tier                 123327
network_name                     250006
date_install                     250006
first_command_time               250006
cmlt_time_4grind                 207246
cmlt_time_with_us_4grind         187835
cmlt_avg_time_for_level          250006
cmlt_avg_duration                250006
cmlt_avg_grind_duration          207246
cmlt_winrate                     206057
cmlt_spent_hc_per_grind           78630
cmlt_spent_hc_div_active_time     78630
cmlt_seconds_div_active_time     187835
cmlt_hard_med_spent               79057
birthday                         250006
sex                              250006
is_cheater                         5850
has_email                         44690
time_confirm_email               250006
target_game_currency               9086
dtype: int64
df.nunique()
Unnamed: 0                       250006
cmlt_daily_game_currency            576
cmlt_seconds_with_us              74344
cmlt_max_sessions_duration        16404
cmlt_sum_sessions_duration        45566
cmlt_count_sessions                 130
cmlt_sum_quantum_duration         37932
cmlt_count_quant                   1120
cmlt_max_quantum_duration          4806
cmlt_max_quant                     3027
cmlt_final_level                    119
cmlt_count_grind                    669
cmlt_max_grind                     3976
cmlt_count_won_grind                467
cmlt_hard_buy                       343
cmlt_hard_earn                      122
cmlt_hard_gift                       63
cmlt_hard_spent                     911
cmlt_hard_max_spent                 455
country                             216
country_top_tier                      2
network_name                         13
date_install                         90
first_command_time               245466
cmlt_time_4grind                  96448
cmlt_time_with_us_4grind         110912
cmlt_avg_time_for_level           77350
cmlt_avg_duration                 76352
cmlt_avg_grind_duration           96448
cmlt_winrate                      14332
cmlt_spent_hc_per_grind           10339
cmlt_spent_hc_div_active_time     10339
cmlt_seconds_div_active_time     110912
cmlt_hard_med_spent                1426
birthday                          21972
sex                                   3
is_cheater                            2
has_email                             2
time_confirm_email                 4110
target_game_currency               2713
dtype: int64
unique_values = df_work.select_dtypes(include="number").nunique().sort_values()

unique_values.plot.bar(logy=True, figsize=(15, 4), title="Unique values per feature");
df.isna().sum()
Unnamed: 0                            0
cmlt_daily_game_currency              0
cmlt_seconds_with_us                  0
cmlt_max_sessions_duration            0
cmlt_sum_sessions_duration            0
cmlt_count_sessions                   0
cmlt_sum_quantum_duration             0
cmlt_count_quant                      0
cmlt_max_quantum_duration             0
cmlt_max_quant                        0
cmlt_final_level                      0
cmlt_count_grind                      0
cmlt_max_grind                        0
cmlt_count_won_grind                  0
cmlt_hard_buy                         0
cmlt_hard_earn                        0
cmlt_hard_gift                        0
cmlt_hard_spent                       0
cmlt_hard_max_spent                   0
country                              56
country_top_tier                      0
network_name                      25156
date_install                          0
first_command_time                    0
cmlt_time_4grind                      0
cmlt_time_with_us_4grind              0
cmlt_avg_time_for_level               0
cmlt_avg_duration                     0
cmlt_avg_grind_duration               0
cmlt_winrate                          0
cmlt_spent_hc_per_grind               0
cmlt_spent_hc_div_active_time         0
cmlt_seconds_div_active_time          0
cmlt_hard_med_spent                   0
birthday                         185763
sex                              186965
is_cheater                         5659
has_email                          5659
time_confirm_email               245772
target_game_currency                  0
dtype: int64
plt.figure(figsize=(10, 8))
plt.imshow(df_work.isna(), aspect="auto", interpolation="nearest", cmap="gray")
plt.xlabel("Column Number")
plt.ylabel("Sample Number");
import missingno as msno

msno.matrix(df_work, labels=True, sort="descending");
df_work.isna().mean().sort_values().plot(
    kind="bar", figsize=(15, 4),
    title="Percentage of missing values per feature",
    ylabel="Ratio of missing values per feature");
global view of the dataset
df_work.plot(lw=0, marker=".", subplots=True, layout=(-1, 4),
          figsize=(15, 30), markersize=1);
Feature distribution
df_work.hist(bins=25, figsize=(15, 25), layout=(-1, 5), edgecolor="black")
plt.tight_layout();

I deleted the parasitic column and selected a separate variable that will contain the values of numeric columns. Many columns contain a large number of null values. Text columns are characterized by the presence of a large number of NaN values. The values are characterized by a significant spread. The dataframe size is 25,000 rows by 39 columns. The analysis itself is carried out in colab.

Data Preparation

Prepare int and float

df_work = df_work.drop(['birthday', 'sex', 'time_confirm_email'], axis=1)
prepare_data = pd.DataFrame()
futures_object = df_work.select_dtypes(include=['object']).columns
X_num = df_work[futures_number]
mask = np.zeros_like(X_num.corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 
f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)


sns.heatmap(X_num.corr(),linewidths=0.25,vmax=0.7,square=True,cmap="BuGn", #"BuGn_r" to reverse 
            linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});
X_num = X_num.drop(['target_game_currency'], axis = 1)

y = df_work['target_game_currency']
scaler = MinMaxScaler()
scaler.fit(X_num)
scaled = scaler.fit_transform(X_num)
X_num_scaled = pd.DataFrame(scaled, columns = X_num.columns)
prepare_data = pd.concat([prepare_data, X_num_scaled])
del(X_num_scaled)

Prepare data

prepare_data_date = pd.DataFrame()
prepare_data_date['date_install'] = pd.to_datetime(df_work['date_install']).astype(int)
prepare_data_date['data_first_command_time'] = pd.to_datetime(df_work['first_command_time']).astype(int)
prepare_data = pd.concat([prepare_data, prepare_data_date], axis = 1)
del(prepare_data_date)

Prepare Boolean data

for is_cheater
cheat_count = df_work['is_cheater'].value_counts().index[0]
df_work['is_cheater'] = df_work['is_cheater'].fillna(cheat_count)
prepare_data['is_cheater'] = (df_work['is_cheater'] != cheat_count).astype(int)
for has_email
cheat_count = df_work['has_email'].value_counts().index[0]
df_work['has_email'] = df_work['has_email'].fillna(cheat_count)
prepare_data['has_email'] = (df_work['has_email'] != cheat_count).astype(int)

Prepare object columns

df_work_str = df_work[['country', 'network_name']].fillna('other')
prepare_data = pd.concat([prepare_data, df_work_str], axis = 1)
del(df_work_str)

Deleted columns with a large number of NaN values from the dataset. Created a separate dataset where the prepared data will be stored. Also separately created a variable for storing text columns. Having built the Pearson correlation, I saw that some values have a significant relationship with each other. I normalized the numeric data using the MinMaxScaler() function. The date value was converted to UNIX format. I changed the Boolean values to 0 and 1.

Undersampling and Oversampling

Undersampling

not_zero_target = (df_work['target_game_currency'] > 0).astype(int)

df_for_res = prepare_data.copy()

df_for_res['target_game_currency'] = df_work['target_game_currency']

print('Original dataset shape %s' % Counter(not_zero_target))

rus = RandomUnderSampler(random_state=56)

X_res, y_res = rus.fit_resample(df_for_res, not_zero_target)

print('Resampled dataset shape %s' % Counter(y_res))
Original dataset shape Counter({0: 240920, 1: 9086})
Resampled dataset shape Counter({0: 9086, 1: 9086})
y_usamp = X_res['target_game_currency']
X_usamp = X_res.drop(['target_game_currency'], axis = 1)

Oversampling

not_zero_target = (df_work['target_game_currency'] > 0).astype(int)

df_for_res = prepare_data.copy()

df_for_res['target_game_currency'] = df_work['target_game_currency']

print('Original dataset shape %s' % Counter(not_zero_target))

rus = RandomOverSampler(random_state=56)

X_res, y_res = rus.fit_resample(df_for_res, not_zero_target)

print('Resampled dataset shape %s' % Counter(y_res))
Original dataset shape Counter({0: 240920, 1: 9086})
Resampled dataset shape Counter({0: 240920, 1: 240920})
y_osamp = X_res['target_game_currency']
X_osamp = X_res.drop(['target_game_currency'], axis = 1)

Due to the large spread of the target value, I decided to apply and compare two methods: Undersampling and Oversampling.

Machine Learning

The studies were carried out using Cut Boost and Boost, while the second method additionally requires the preparation of text data.

CatBoost

!pip install catboost
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: catboost in /usr/local/lib/python3.7/dist-packages (1.0.6)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (5.5.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.3.5)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.21.6)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2022.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.4.3)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib->catboost) (4.1.1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (8.0.1)
import math
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
K = 5
kf = KFold(n_splits = K, random_state = 1, shuffle = True)
categorical_var = np.where(prepare_data.dtypes == np.object)[0]
model_catboost = CatBoostRegressor(verbose=0, n_estimators=100)
def cv_catboost(datafame, y, categorical_var = categorical_var):
  cum_MAE = 0
  cum_MSE = 0
  cum_RMSE = 0

  for i, (train_index, test_index) in enumerate(kf.split(datafame)):
      
      # Create data for this fold
      y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
      X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
      print( "\nFold ", i)
      
      # Run model for this fold     
      fit_model = model_catboost.fit( X_train, y_train, cat_features = categorical_var)
      
      print( "  N trees = ", model_catboost.tree_count_ )
          
      # Generate validation predictions for this fold
      pred = fit_model.predict(X_valid)
      mae = metrics.mean_absolute_error(y_valid, pred)
      mse = mean_squared_error(y_valid, pred)
      rmse = math.sqrt(mse)

      cum_MAE += mae
      cum_MSE += mse
      cum_RMSE += rmse
  print('Mean Absolute Error (MAE):', cum_MAE/K)
  print('Mean Square Error (MSE):', cum_MSE/K)
  print('Root Mean Square Error (RMSE):', cum_RMSE/K)
Usual Data
%%time
cv_catboost(prepare_data, y, categorical_var)
Fold  0
  N trees =  100

Fold  1
  N trees =  100

Fold  2
  N trees =  100

Fold  3
  N trees =  100

Fold  4
  N trees =  100
Mean Absolute Error (MAE): 1543.1100970401158
Mean Square Error (MSE): 73376637.45596504
Root Mean Square Error (RMSE): 8559.830035963872
CPU times: user 1min 8s, sys: 1.94 s, total: 1min 10s
Wall time: 54.2 s
Undersampling
%%time
cv_catboost(X_usamp, y_usamp, categorical_var)
Fold  0
  N trees =  100

Fold  1
  N trees =  100

Fold  2
  N trees =  100

Fold  3
  N trees =  100

Fold  4
  N trees =  100
Mean Absolute Error (MAE): 8405.613128733008
Mean Square Error (MSE): 901831720.617851
Root Mean Square Error (RMSE): 29983.3911328569
CPU times: user 12.7 s, sys: 344 ms, total: 13.1 s
Wall time: 6.96 s
Oversampling
%%time
cv_catboost(X_osamp, y_osamp, categorical_var)
Fold  0
  N trees =  100

Fold  1
  N trees =  100

Fold  2
  N trees =  100

Fold  3
  N trees =  100

Fold  4
  N trees =  100
Mean Absolute Error (MAE): 4925.311693019621
Mean Square Error (MSE): 147691934.14503184
Root Mean Square Error (RMSE): 12151.051371115089
CPU times: user 2min 18s, sys: 1.32 s, total: 2min 19s
Wall time: 1min 16s

XGBoost

from xgboost import XGBRegressor
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
def ohe_xgboost(dataframe):
  enc = OneHotEncoder()
  df_for_ohc = dataframe[['country', 'network_name']]
  enc.fit(df_for_ohc)
  x_cat = enc.transform(df_for_ohc)
  df_ohe = pd.DataFrame(x_cat.toarray(), columns = enc.get_feature_names())
  dataframe = dataframe.drop(['country', 'network_name'], axis=1)
  return(pd.concat([dataframe, df_ohe], axis=1))

Prepare data for XGBoost

model_xgb = XGBRegressor()
def cv_xgboost (datafame, y):
  cum_MAE = 0
  cum_MSE = 0
  cum_RMSE = 0

  for i, (train_index, test_index) in enumerate(kf.split(datafame)):
      
      # Create data for this fold
      y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
      X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
      print( "\nFold ", i)
      
      # Run model for this fold     
      fit_model = model_xgb.fit(X_train, y_train)
          
      # Generate validation predictions for this fold
      pred = fit_model.predict(X_valid)
      mae = metrics.mean_absolute_error(y_valid, pred)
      mse = mean_squared_error(y_valid, pred)
      rmse = math.sqrt(mse)

      cum_MAE += mae
      cum_MSE += mse
      cum_RMSE += rmse
      mae = metrics.mean_absolute_error(y_valid, pred)
      cum_MAE += mae
  print('Mean Absolute Error (MAE):', cum_MAE/K)
  print('Mean Square Error (MSE):', cum_MSE/K)
  print('Root Mean Square Error (RMSE):', cum_RMSE/K)
Usual Data
%%time
df_xgboost = ohe_xgboost(prepare_data)
CPU times: user 1.98 s, sys: 1.28 s, total: 3.26 s
Wall time: 3.27 s
%%time
cv_xgboost(df_xgboost, y)
Fold  0
[14:57:38] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  1
[14:59:59] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  2
[15:02:05] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  3
[15:04:10] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  4
[15:06:12] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Mean Absolute Error (MAE): 1289.408277519576
Mean Square Error (MSE): 64241993.82769384
Root Mean Square Error (RMSE): 8001.296507446097
CPU times: user 10min 27s, sys: 3.68 s, total: 10min 31s
Wall time: 10min 37s
Undersampling
%%time
df_xgboost_usamp = ohe_xgboost(X_usamp)
CPU times: user 55.7 ms, sys: 0 ns, total: 55.7 ms
Wall time: 58.6 ms
%%time
cv_xgboost(df_xgboost_usamp, y_usamp)
Fold  0
[15:08:14] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  1
[15:08:22] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  2
[15:08:29] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  3
[15:08:36] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  4
[15:08:43] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Mean Absolute Error (MAE): 16329.402967079233
Mean Square Error (MSE): 898315432.5187995
Root Mean Square Error (RMSE): 29904.80882465578
CPU times: user 36.3 s, sys: 157 ms, total: 36.4 s
Wall time: 36.4 s
Oversampling
%%time
df_xgboost_osamp = ohe_xgboost(X_osamp)
CPU times: user 3.78 s, sys: 20.9 ms, total: 3.8 s
Wall time: 3.81 s
%%time
cv_xgboost(df_xgboost_osamp, y_osamp)
Fold  0
[15:08:58] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  1
[15:13:25] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  2
[15:17:57] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  3
[15:22:32] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Fold  4
[15:26:47] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Mean Absolute Error (MAE): 14163.804018335964
Mean Square Error (MSE): 488596143.2320727
Root Mean Square Error (RMSE): 22102.677885694204
CPU times: user 21min 32s, sys: 14.1 s, total: 21min 47s
Wall time: 21min 59s

Significant errors were obtained on both algorithms. and the possibility of their optimization was not allowed by the limitations of the free version of colab. It is important to note that both the use of under- and -oversampling did not bring tangible results and worsened the values.

Normalize target

After that, I switched to an extremely not recommended method and optimized the target value.

normalized_y_usampl=(y_usamp-y_usamp.mean())/y_usamp.std()
def different_model(datafame, y, model):
  cum_MAE = 0
  cum_MSE = 0
  cum_RMSE = 0

  for i, (train_index, test_index) in enumerate(kf.split(datafame)):
      
      # Create data for this fold
      y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
      X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
      print( "\nFold ", i)
      
      # Run model for this fold     
      fit_model = model.fit(X_train, y_train)
          
      # Generate validation predictions for this fold
      pred = fit_model.predict(X_valid)
      mae = metrics.mean_absolute_error(y_valid, pred)
      mse = mean_squared_error(y_valid, pred)
      rmse = math.sqrt(mse)

      cum_MAE += mae
      cum_MSE += mse
      cum_RMSE += rmse
      mae = metrics.mean_absolute_error(y_valid, pred)
      cum_MAE += mae
  print('Mean Absolute Error (MAE):', cum_MAE/K)
  print('Mean Square Error (MSE):', cum_MSE/K)
  print('Root Mean Square Error (RMSE):', cum_RMSE/K)

Linear regression

lr = LinearRegression()
different_model(df_xgboost_usamp, normalized_y_usampl, lr)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.674615230140304
Mean Square Error (MSE): 1.0000402612691461
Root Mean Square Error (RMSE): 0.9968567057322633

Ridge

from sklearn.linear_model import Ridge
ridge_model = Ridge()
different_model(df_xgboost_usamp, normalized_y_usampl, ridge_model)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.548602407961031
Mean Square Error (MSE): 0.8106651301818637
Root Mean Square Error (RMSE): 0.8972975463889405

Lasso

from sklearn.linear_model import Lasso
lasso_model = Lasso()
different_model(df_xgboost_usamp, normalized_y_usampl, lasso_model)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.662553113205324
Mean Square Error (MSE): 0.9997445124896084
Root Mean Square Error (RMSE): 0.9963907854485617

k-nearest neighbors

from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor()
different_model(df_xgboost_usamp, normalized_y_usampl, neigh)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.7085960003479421
Mean Square Error (MSE): 1.204286903861976
Root Mean Square Error (RMSE): 1.0951818336923362

Support Vector Regression

from sklearn.svm import SVR
svr_model = SVR()
different_model(df_xgboost_usamp, normalized_y_usampl, svr_model)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.5137363720489654
Mean Square Error (MSE): 1.0111543429807524
Root Mean Square Error (RMSE): 1.0020891316710951

Decision Tree

from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
different_model(df_xgboost_usamp, normalized_y_usampl, dtr)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.6457364186513036
Mean Square Error (MSE): 1.8099131652811817
Root Mean Square Error (RMSE): 1.3394468455732116

Random Forest

from sklearn.ensemble import RandomForestRegressor
forest=RandomForestRegressor()
different_model(df_xgboost_usamp, normalized_y_usampl, forest)
Fold  0

Fold  1

Fold  2

Fold  3

Fold  4
Mean Absolute Error (MAE): 0.5483571262602982
Mean Square Error (MSE): 0.8930494185630643
Root Mean Square Error (RMSE): 0.9425253291821919

CatBoost

model_catboost = CatBoostRegressor(verbose=False)
def cv_catboost(datafame, y, categorical_var = categorical_var):
  cum_MAE = 0
  cum_MSE = 0
  cum_RMSE = 0

  for i, (train_index, test_index) in enumerate(kf.split(datafame)):
      
      # Create data for this fold
      y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
      X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
      print( "\nFold ", i)
      
      # Run model for this fold     
      fit_model = model_catboost.fit( X_train, y_train, cat_features = categorical_var)
      
      print( "  N trees = ", model_catboost.tree_count_ )
          
      # Generate validation predictions for this fold
      pred = fit_model.predict(X_valid)
      mae = metrics.mean_absolute_error(y_valid, pred)
      mse = mean_squared_error(y_valid, pred)
      rmse = math.sqrt(mse)

      cum_MAE += mae
      cum_MSE += mse
      cum_RMSE += rmse
  print('Mean Absolute Error (MAE):', cum_MAE/K)
  print('Mean Square Error (MSE):', cum_MSE/K)
  print('Root Mean Square Error (RMSE):', cum_RMSE/K)
Undersampling
cv_catboost(X_usamp, normalized_y_usampl, categorical_var)
Fold  0
  N trees =  1000

Fold  1
  N trees =  1000

Fold  2
  N trees =  1000

Fold  3
  N trees =  1000

Fold  4
  N trees =  1000
Mean Absolute Error (MAE): 0.2576291394335407
Mean Square Error (MSE): 0.8534013262771272
Root Mean Square Error (RMSE): 0.9211363112168826

Feature Importance

feats = {}
for feature, importance in zip(X_usamp.columns[:30], model_catboost.feature_importances_[:30]):
    feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'})
importances = importances.sort_values(by='Gini-Importance', ascending=False)
importances = importances.reset_index()
importances = importances.rename(columns={'index': 'Features'})
sns.set(font_scale = 5)
sns.set(style="whitegrid", color_codes=True, font_scale = 1.7)
fig, ax = plt.subplots()
fig.set_size_inches(30,15)
sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='skyblue')
plt.xlabel('Importance', fontsize=25, weight = 'bold')
plt.ylabel('Features', fontsize=25, weight = 'bold')
plt.title('Feature Importance', fontsize=25, weight = 'bold')
display(plt.show())
display(importances)
None
Features Gini-Importance
0 cmlt_daily_game_currency 13.747446
1 cmlt_hard_spent 8.419419
2 cmlt_hard_buy 8.012952
3 cmlt_hard_max_spent 6.346073
4 cmlt_hard_med_spent 4.148895
5 cmlt_avg_duration 3.889323
6 cmlt_count_sessions 3.558750
7 cmlt_final_level 3.148223
8 cmlt_max_quantum_duration 3.059444
9 cmlt_seconds_div_active_time 2.957643
10 cmlt_time_with_us_4grind 2.761090
11 cmlt_seconds_with_us 2.519549
12 date_install 2.322614
13 cmlt_max_quant 2.294351
14 cmlt_time_4grind 2.229900
15 cmlt_winrate 2.134882
16 cmlt_max_grind 2.023839
17 cmlt_sum_sessions_duration 1.949491
18 cmlt_spent_hc_per_grind 1.919232
19 cmlt_avg_grind_duration 1.794314
20 cmlt_max_sessions_duration 1.627571
21 cmlt_avg_time_for_level 1.580913
22 cmlt_spent_hc_div_active_time 1.577912
23 cmlt_sum_quantum_duration 1.452757
24 cmlt_hard_earn 1.444816
25 cmlt_count_quant 1.282531
26 cmlt_count_won_grind 1.236169
27 cmlt_count_grind 0.994787
28 cmlt_hard_gift 0.257577
29 country_top_tier 0.251787
Optimization CatBoost
from sklearn.model_selection import GridSearchCV
parameters = {'depth'         : [10, 50, 100, 500, 1000],
                 'learning_rate' : [0.01,0.02,0.03,0.04],
                  'iterations'    : [10, 20,30,40,50,60,70,80,90, 100]
                 }
Grid_CBC = GridSearchCV(estimator=model_catboost, param_grid = parameters, cv = 2, verbose=10, n_jobs=-1)
Grid_CBC.fit(X_usamp, normalized_y_usampl, cat_features = categorical_var)
Fitting 2 folds for each of 27 candidates, totalling 54 fits
GridSearchCV(cv=2,
             estimator=<catboost.core.CatBoostRegressor object at 0x7f5ad3c5a250>,
             n_jobs=-1,
             param_grid={'depth': [10, 50, 100], 'iterations': [100, 500, 1000],
                         'learning_rate': [0.1, 0.5, 0.8]},
             verbose=10)
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n", Grid_CBC.best_estimator_)
print("\n The best score across ALL searched params:\n", Grid_CBC.best_score_)
print("\n The best parameters across ALL searched params:\n", Grid_CBC.best_params_)
 Results from Grid Search 

 The best estimator across ALL searched params:
 <catboost.core.CatBoostRegressor object at 0x7f5ad86e3b50>

 The best score across ALL searched params:
 nan

 The best parameters across ALL searched params:
 {'depth': 10, 'iterations': 100, 'learning_rate': 0.1}
model_catboost_opt = CatBoostRegressor(depth=10, iterations = 100, learning_rate = 0.1, verbose=False)
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0

for i, (train_index, test_index) in enumerate(kf.split(X_usamp)):
    
    # Create data for this fold
    y_train, y_valid = normalized_y_usampl.iloc[train_index], normalized_y_usampl.iloc[test_index]
    X_train, X_valid = X_usamp.iloc[train_index,:], X_usamp.iloc[test_index,:]
    print( "\nFold ", i)
    
    # Run model for this fold     
    fit_model = model_catboost.fit(X_train, y_train, cat_features = categorical_var)
    
    print( "  N trees = ", model_catboost.tree_count_ )
        
    # Generate validation predictions for this fold
    pred = fit_model.predict(X_valid)
    mae = metrics.mean_absolute_error(y_valid, pred)
    mse = mean_squared_error(y_valid, pred)
    rmse = math.sqrt(mse)

    cum_MAE += mae
    cum_MSE += mse
    cum_RMSE += rmse
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
Fold  0
  N trees =  1000

Fold  1
  N trees =  1000

Fold  2
  N trees =  1000

Fold  3
  N trees =  1000

Fold  4
  N trees =  1000
Mean Absolute Error (MAE): 0.2576291394335407
Mean Square Error (MSE): 0.8534013262771272
Root Mean Square Error (RMSE): 0.9211363112168826
Full data
normalized_y=(y-y.mean())/y.std()
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0

for i, (train_index, test_index) in enumerate(kf.split(X_usamp)):
    
    # Create data for this fold
    y_train, y_valid = normalized_y.iloc[train_index], normalized_y.iloc[test_index]
    X_train, X_valid = prepare_data.iloc[train_index,:], prepare_data.iloc[test_index,:]
    print( "\nFold ", i)
    
    # Run model for this fold     
    fit_model = model_catboost.fit(X_train, y_train, cat_features = categorical_var)
    
    print( "  N trees = ", model_catboost.tree_count_ )
        
    # Generate validation predictions for this fold
    pred = fit_model.predict(X_valid)
    mae = metrics.mean_absolute_error(y_valid, pred)
    mse = mean_squared_error(y_valid, pred)
    rmse = math.sqrt(mse)

    cum_MAE += mae
    cum_MSE += mse
    cum_RMSE += rmse
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
Fold  0
  N trees =  1000

Fold  1
  N trees =  1000

Fold  2
  N trees =  1000

Fold  3
  N trees =  1000

Fold  4
  N trees =  1000
Mean Absolute Error (MAE): 0.07252360673312311
Mean Square Error (MSE): 1.0925019398307274
Root Mean Square Error (RMSE): 0.8952761280877471

Conclusions:

  1. Worked in free versions of Colab and Paperspace.
  2. The target variable has a very large spread. In this regard, all the algorithms used give a significant error. After its normalization, the error was reduced, but normalization of the target variable is not recommended. Such a spread may be caused by the fact that different currencies were used or accounting was conducted for different times. In the first case, normalization at unknown rates due to the average, in the second case due to time on the network.
  3. For the initial analysis, the following algorithms were considered using cross-validation: Linear, Decision Tree, Lasso, Ridge and k-nearest neighbors, LightGBM
  4. The data is not balanced by the target variable. But the use of sampling led to an increase in the error.
  5. NaN values were present only in the columns of the object format.
  6. I considered three algorithms for optimization: Random Forest, XGBoost and CatBoost. The first two did not have enough power. After that, I stopped only at CatBoost.
  7. An attempt to normalize to the most correlated values or time from the moment of installation led to an increase in the error.
  8. Of all the algorithms, Colab had enough power to optimize only CatBoost
  9. Further ways of improvement: creation of new variables from existing ones, application of other sampling, analysis using previously deleted columns.