- DataFarame
- Data Preparation
- Undersampling and Oversampling
- Machine Learning
- Normalize target
- Conclusions:
I recently completed a test task for one organization. The data were presented and the main task was to predict the profitability based on the presented data. The values themselves were significantly scattered and unbalanced. Also, according to the data, information was not provided about what each value means and what kind of relationship exists.I immediately decided that I would conduct the analysis using popular boosting methods.
The first step was to download the necessary libraries.
import time
from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from collections import Counter
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Now you can look at the data itself.
df = pd.read_csv('/content/drive/MyDrive/DATA/dataset.csv')
df.head()
df_work = df.copy()
df_work = df_work.drop(['Unnamed: 0'], axis = 1)
futures_number = df_work.select_dtypes(include=['int64', 'float64']).columns
df_work.info()
df_work.shape
df.astype(bool).sum(axis=0)
df.nunique()
unique_values = df_work.select_dtypes(include="number").nunique().sort_values()
unique_values.plot.bar(logy=True, figsize=(15, 4), title="Unique values per feature");
df.isna().sum()
plt.figure(figsize=(10, 8))
plt.imshow(df_work.isna(), aspect="auto", interpolation="nearest", cmap="gray")
plt.xlabel("Column Number")
plt.ylabel("Sample Number");
import missingno as msno
msno.matrix(df_work, labels=True, sort="descending");
df_work.isna().mean().sort_values().plot(
kind="bar", figsize=(15, 4),
title="Percentage of missing values per feature",
ylabel="Ratio of missing values per feature");
df_work.plot(lw=0, marker=".", subplots=True, layout=(-1, 4),
figsize=(15, 30), markersize=1);
df_work.hist(bins=25, figsize=(15, 25), layout=(-1, 5), edgecolor="black")
plt.tight_layout();
I deleted the parasitic column and selected a separate variable that will contain the values of numeric columns. Many columns contain a large number of null values. Text columns are characterized by the presence of a large number of NaN values. The values are characterized by a significant spread. The dataframe size is 25,000 rows by 39 columns. The analysis itself is carried out in colab.
df_work = df_work.drop(['birthday', 'sex', 'time_confirm_email'], axis=1)
prepare_data = pd.DataFrame()
futures_object = df_work.select_dtypes(include=['object']).columns
X_num = df_work[futures_number]
mask = np.zeros_like(X_num.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)
sns.heatmap(X_num.corr(),linewidths=0.25,vmax=0.7,square=True,cmap="BuGn", #"BuGn_r" to reverse
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});
X_num = X_num.drop(['target_game_currency'], axis = 1)
y = df_work['target_game_currency']
scaler = MinMaxScaler()
scaler.fit(X_num)
scaled = scaler.fit_transform(X_num)
X_num_scaled = pd.DataFrame(scaled, columns = X_num.columns)
prepare_data = pd.concat([prepare_data, X_num_scaled])
del(X_num_scaled)
prepare_data_date = pd.DataFrame()
prepare_data_date['date_install'] = pd.to_datetime(df_work['date_install']).astype(int)
prepare_data_date['data_first_command_time'] = pd.to_datetime(df_work['first_command_time']).astype(int)
prepare_data = pd.concat([prepare_data, prepare_data_date], axis = 1)
del(prepare_data_date)
cheat_count = df_work['is_cheater'].value_counts().index[0]
df_work['is_cheater'] = df_work['is_cheater'].fillna(cheat_count)
prepare_data['is_cheater'] = (df_work['is_cheater'] != cheat_count).astype(int)
cheat_count = df_work['has_email'].value_counts().index[0]
df_work['has_email'] = df_work['has_email'].fillna(cheat_count)
prepare_data['has_email'] = (df_work['has_email'] != cheat_count).astype(int)
df_work_str = df_work[['country', 'network_name']].fillna('other')
prepare_data = pd.concat([prepare_data, df_work_str], axis = 1)
del(df_work_str)
Deleted columns with a large number of NaN values from the dataset. Created a separate dataset where the prepared data will be stored. Also separately created a variable for storing text columns. Having built the Pearson correlation, I saw that some values have a significant relationship with each other. I normalized the numeric data using the MinMaxScaler() function. The date value was converted to UNIX format. I changed the Boolean values to 0 and 1.
not_zero_target = (df_work['target_game_currency'] > 0).astype(int)
df_for_res = prepare_data.copy()
df_for_res['target_game_currency'] = df_work['target_game_currency']
print('Original dataset shape %s' % Counter(not_zero_target))
rus = RandomUnderSampler(random_state=56)
X_res, y_res = rus.fit_resample(df_for_res, not_zero_target)
print('Resampled dataset shape %s' % Counter(y_res))
y_usamp = X_res['target_game_currency']
X_usamp = X_res.drop(['target_game_currency'], axis = 1)
not_zero_target = (df_work['target_game_currency'] > 0).astype(int)
df_for_res = prepare_data.copy()
df_for_res['target_game_currency'] = df_work['target_game_currency']
print('Original dataset shape %s' % Counter(not_zero_target))
rus = RandomOverSampler(random_state=56)
X_res, y_res = rus.fit_resample(df_for_res, not_zero_target)
print('Resampled dataset shape %s' % Counter(y_res))
y_osamp = X_res['target_game_currency']
X_osamp = X_res.drop(['target_game_currency'], axis = 1)
Due to the large spread of the target value, I decided to apply and compare two methods: Undersampling and Oversampling.
The studies were carried out using Cut Boost and Boost, while the second method additionally requires the preparation of text data.
!pip install catboost
import math
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
K = 5
kf = KFold(n_splits = K, random_state = 1, shuffle = True)
categorical_var = np.where(prepare_data.dtypes == np.object)[0]
model_catboost = CatBoostRegressor(verbose=0, n_estimators=100)
def cv_catboost(datafame, y, categorical_var = categorical_var):
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0
for i, (train_index, test_index) in enumerate(kf.split(datafame)):
# Create data for this fold
y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
print( "\nFold ", i)
# Run model for this fold
fit_model = model_catboost.fit( X_train, y_train, cat_features = categorical_var)
print( " N trees = ", model_catboost.tree_count_ )
# Generate validation predictions for this fold
pred = fit_model.predict(X_valid)
mae = metrics.mean_absolute_error(y_valid, pred)
mse = mean_squared_error(y_valid, pred)
rmse = math.sqrt(mse)
cum_MAE += mae
cum_MSE += mse
cum_RMSE += rmse
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
%%time
cv_catboost(prepare_data, y, categorical_var)
%%time
cv_catboost(X_usamp, y_usamp, categorical_var)
%%time
cv_catboost(X_osamp, y_osamp, categorical_var)
from xgboost import XGBRegressor
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
def ohe_xgboost(dataframe):
enc = OneHotEncoder()
df_for_ohc = dataframe[['country', 'network_name']]
enc.fit(df_for_ohc)
x_cat = enc.transform(df_for_ohc)
df_ohe = pd.DataFrame(x_cat.toarray(), columns = enc.get_feature_names())
dataframe = dataframe.drop(['country', 'network_name'], axis=1)
return(pd.concat([dataframe, df_ohe], axis=1))
Prepare data for XGBoost
model_xgb = XGBRegressor()
def cv_xgboost (datafame, y):
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0
for i, (train_index, test_index) in enumerate(kf.split(datafame)):
# Create data for this fold
y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
print( "\nFold ", i)
# Run model for this fold
fit_model = model_xgb.fit(X_train, y_train)
# Generate validation predictions for this fold
pred = fit_model.predict(X_valid)
mae = metrics.mean_absolute_error(y_valid, pred)
mse = mean_squared_error(y_valid, pred)
rmse = math.sqrt(mse)
cum_MAE += mae
cum_MSE += mse
cum_RMSE += rmse
mae = metrics.mean_absolute_error(y_valid, pred)
cum_MAE += mae
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
%%time
df_xgboost = ohe_xgboost(prepare_data)
%%time
cv_xgboost(df_xgboost, y)
%%time
df_xgboost_usamp = ohe_xgboost(X_usamp)
%%time
cv_xgboost(df_xgboost_usamp, y_usamp)
%%time
df_xgboost_osamp = ohe_xgboost(X_osamp)
%%time
cv_xgboost(df_xgboost_osamp, y_osamp)
Significant errors were obtained on both algorithms. and the possibility of their optimization was not allowed by the limitations of the free version of colab. It is important to note that both the use of under- and -oversampling did not bring tangible results and worsened the values.
normalized_y_usampl=(y_usamp-y_usamp.mean())/y_usamp.std()
def different_model(datafame, y, model):
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0
for i, (train_index, test_index) in enumerate(kf.split(datafame)):
# Create data for this fold
y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
print( "\nFold ", i)
# Run model for this fold
fit_model = model.fit(X_train, y_train)
# Generate validation predictions for this fold
pred = fit_model.predict(X_valid)
mae = metrics.mean_absolute_error(y_valid, pred)
mse = mean_squared_error(y_valid, pred)
rmse = math.sqrt(mse)
cum_MAE += mae
cum_MSE += mse
cum_RMSE += rmse
mae = metrics.mean_absolute_error(y_valid, pred)
cum_MAE += mae
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
lr = LinearRegression()
different_model(df_xgboost_usamp, normalized_y_usampl, lr)
from sklearn.linear_model import Ridge
ridge_model = Ridge()
different_model(df_xgboost_usamp, normalized_y_usampl, ridge_model)
from sklearn.linear_model import Lasso
lasso_model = Lasso()
different_model(df_xgboost_usamp, normalized_y_usampl, lasso_model)
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor()
different_model(df_xgboost_usamp, normalized_y_usampl, neigh)
from sklearn.svm import SVR
svr_model = SVR()
different_model(df_xgboost_usamp, normalized_y_usampl, svr_model)
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
different_model(df_xgboost_usamp, normalized_y_usampl, dtr)
from sklearn.ensemble import RandomForestRegressor
forest=RandomForestRegressor()
different_model(df_xgboost_usamp, normalized_y_usampl, forest)
model_catboost = CatBoostRegressor(verbose=False)
def cv_catboost(datafame, y, categorical_var = categorical_var):
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0
for i, (train_index, test_index) in enumerate(kf.split(datafame)):
# Create data for this fold
y_train, y_valid = y.iloc[train_index], y.iloc[test_index]
X_train, X_valid = datafame.iloc[train_index,:], datafame.iloc[test_index,:]
print( "\nFold ", i)
# Run model for this fold
fit_model = model_catboost.fit( X_train, y_train, cat_features = categorical_var)
print( " N trees = ", model_catboost.tree_count_ )
# Generate validation predictions for this fold
pred = fit_model.predict(X_valid)
mae = metrics.mean_absolute_error(y_valid, pred)
mse = mean_squared_error(y_valid, pred)
rmse = math.sqrt(mse)
cum_MAE += mae
cum_MSE += mse
cum_RMSE += rmse
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
cv_catboost(X_usamp, normalized_y_usampl, categorical_var)
Feature Importance
feats = {}
for feature, importance in zip(X_usamp.columns[:30], model_catboost.feature_importances_[:30]):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-Importance'})
importances = importances.sort_values(by='Gini-Importance', ascending=False)
importances = importances.reset_index()
importances = importances.rename(columns={'index': 'Features'})
sns.set(font_scale = 5)
sns.set(style="whitegrid", color_codes=True, font_scale = 1.7)
fig, ax = plt.subplots()
fig.set_size_inches(30,15)
sns.barplot(x=importances['Gini-Importance'], y=importances['Features'], data=importances, color='skyblue')
plt.xlabel('Importance', fontsize=25, weight = 'bold')
plt.ylabel('Features', fontsize=25, weight = 'bold')
plt.title('Feature Importance', fontsize=25, weight = 'bold')
display(plt.show())
display(importances)
from sklearn.model_selection import GridSearchCV
parameters = {'depth' : [10, 50, 100, 500, 1000],
'learning_rate' : [0.01,0.02,0.03,0.04],
'iterations' : [10, 20,30,40,50,60,70,80,90, 100]
}
Grid_CBC = GridSearchCV(estimator=model_catboost, param_grid = parameters, cv = 2, verbose=10, n_jobs=-1)
Grid_CBC.fit(X_usamp, normalized_y_usampl, cat_features = categorical_var)
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n", Grid_CBC.best_estimator_)
print("\n The best score across ALL searched params:\n", Grid_CBC.best_score_)
print("\n The best parameters across ALL searched params:\n", Grid_CBC.best_params_)
model_catboost_opt = CatBoostRegressor(depth=10, iterations = 100, learning_rate = 0.1, verbose=False)
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0
for i, (train_index, test_index) in enumerate(kf.split(X_usamp)):
# Create data for this fold
y_train, y_valid = normalized_y_usampl.iloc[train_index], normalized_y_usampl.iloc[test_index]
X_train, X_valid = X_usamp.iloc[train_index,:], X_usamp.iloc[test_index,:]
print( "\nFold ", i)
# Run model for this fold
fit_model = model_catboost.fit(X_train, y_train, cat_features = categorical_var)
print( " N trees = ", model_catboost.tree_count_ )
# Generate validation predictions for this fold
pred = fit_model.predict(X_valid)
mae = metrics.mean_absolute_error(y_valid, pred)
mse = mean_squared_error(y_valid, pred)
rmse = math.sqrt(mse)
cum_MAE += mae
cum_MSE += mse
cum_RMSE += rmse
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
normalized_y=(y-y.mean())/y.std()
cum_MAE = 0
cum_MSE = 0
cum_RMSE = 0
for i, (train_index, test_index) in enumerate(kf.split(X_usamp)):
# Create data for this fold
y_train, y_valid = normalized_y.iloc[train_index], normalized_y.iloc[test_index]
X_train, X_valid = prepare_data.iloc[train_index,:], prepare_data.iloc[test_index,:]
print( "\nFold ", i)
# Run model for this fold
fit_model = model_catboost.fit(X_train, y_train, cat_features = categorical_var)
print( " N trees = ", model_catboost.tree_count_ )
# Generate validation predictions for this fold
pred = fit_model.predict(X_valid)
mae = metrics.mean_absolute_error(y_valid, pred)
mse = mean_squared_error(y_valid, pred)
rmse = math.sqrt(mse)
cum_MAE += mae
cum_MSE += mse
cum_RMSE += rmse
print('Mean Absolute Error (MAE):', cum_MAE/K)
print('Mean Square Error (MSE):', cum_MSE/K)
print('Root Mean Square Error (RMSE):', cum_RMSE/K)
Conclusions:
- Worked in free versions of Colab and Paperspace.
- The target variable has a very large spread. In this regard, all the algorithms used give a significant error. After its normalization, the error was reduced, but normalization of the target variable is not recommended. Such a spread may be caused by the fact that different currencies were used or accounting was conducted for different times. In the first case, normalization at unknown rates due to the average, in the second case due to time on the network.
- For the initial analysis, the following algorithms were considered using cross-validation: Linear, Decision Tree, Lasso, Ridge and k-nearest neighbors, LightGBM
- The data is not balanced by the target variable. But the use of sampling led to an increase in the error.
- NaN values were present only in the columns of the object format.
- I considered three algorithms for optimization: Random Forest, XGBoost and CatBoost. The first two did not have enough power. After that, I stopped only at CatBoost.
- An attempt to normalize to the most correlated values or time from the moment of installation led to an increase in the error.
- Of all the algorithms, Colab had enough power to optimize only CatBoost
- Further ways of improvement: creation of new variables from existing ones, application of other sampling, analysis using previously deleted columns.