Problem statement

Task

It is required, based on the available data on the bank's customers, to build a model using a training dataset to predict the non-fulfillment of debt obligations on a current loan. Make a forecast for the examples from the test dataset.

Name of data files

course_project_train.csv - training dataset
course_project_test.csv - test dataset

Target variable

Credit Default - the fact of non-fulfillment of credit obligations

Quality metric

F1-score (sklearn.metrics.f1_score)

Solution requirements

Target metric

  • F1 > 0.5
  • The metric is evaluated by the quality of the forecast for the main class (1 - loan delinquency)

The solution must contain

  1. Jupyter Notebook with the code of your solution, named after the pattern {FULL name}_solution.ipynb, example SShirkin_solution.ipynb
  2. CSV file with forecasts of the target variable for the test dataset, named after the sample {FULL name}_predictions.csv, example SShirkin_predictions.csv

Recommendations for the code file (ipynb)

  1. The file must contain headers and comments (markdown)
  2. It is better to design repetitive operations in the form of functions
  3. Do not output a large number of rows of tables (5-10 is enough)
  4. If possible, add graphs describing the data (about 3-5)
  5. Add only the best model, that is, do not include in the code all the options for solving the project
  6. The project script should work from the beginning to the end (from loading data to unloading predictions)
  7. The whole project should be in one script (ipynb file).
  8. It is allowed to use Python libraries and machine learning models that were in this course.

Deadlines for delivery

You need to submit the project within 5 days after the end of the last webinar. Estimates of works submitted before the deadline will be presented in the form of a rating ranked according to a given quality metric. Projects submitted after the deadline or submitted again do not get into the rating, but you will be able to find out the result.

An approximate description of the stages of the course project

Building a classification model

  1. Overview of the training dataset
  2. Handling emissions
  3. Processing of omissions
  4. Data analysis
  5. Selection of features
  6. Balancing classes
  7. Selection of models, obtaining a baseline
  8. Choosing the best model, setting hyperparameters
  9. Quality control, fight against retraining
  10. Interpretation of results

Forecasting on a test dataset

  1. Perform the same stages of processing and building features for the test dataset
  2. Predict the target variable using a model based on a training dataset
  3. Forecasts should be for all examples from the test dataset (for all rows)
  4. Observe the original order of the examples from the test dataset

Обзор данных

Description of the dataset

  • Home Ownership - Home ownership
  • Annual Income - annual income
  • Years in current job - the number of years at the current job
  • Tax Liens - tax encumbrances
  • Number of Open Accounts - number of open accounts
  • Years of Credit History - number of years of credit history
  • Maximum Open Credit - the largest open credit
  • Number of Credit Problems - number of credit problems
  • Months since last delinquent - the number of months since the last payment delay
  • Bankruptcies - bankruptcies
  • Purpose - purpose of the loan
  • Term - loan term
  • Current Loan Amount - current loan amount
  • Current Credit Balance - current credit balance
  • Monthly Debt - monthly debt
  • Credit Default - the fact of non-fulfillment of credit obligations (0 - repaid on time, 1 - overdue)

Connecting the script library

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, KFold
import xgboost as xgb, lightgbm as lgbm, catboost as catb
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt 
import warnings
warnings.filterwarnings('ignore')
sns.set(style='whitegrid')
sns.set_context("paper", font_scale=1.5)  
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_rows', 50)
def get_classification_report(y_train_true, y_train_pred, y_test_true, y_test_pred):
    print('TRAIN\n\n' + classification_report(y_train_true, y_train_pred))
    print('TEST\n\n' + classification_report(y_test_true, y_test_pred))
    print('CONFUSION MATRIX\n')
    print(pd.crosstab(y_test_true, y_test_pred))
def balance_df_by_target(df, target_name):

    target_counts = df[target_name].value_counts()

    major_class_name = target_counts.argmax()
    minor_class_name = target_counts.argmin()

    disbalance_coeff = int(target_counts[major_class_name] / target_counts[minor_class_name]) - 1

    for i in range(disbalance_coeff):
        sample = df[df[target_name] == minor_class_name].sample(target_counts[minor_class_name])
        df = df.append(sample, ignore_index=True)

    return df.sample(frac=1)
def show_learning_curve_plot(estimator, X, y, cv=3, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):

    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, 
                                                            cv=cv, 
                                                            scoring='f1',
                                                            train_sizes=train_sizes, 
                                                            n_jobs=n_jobs)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.figure(figsize=(15,8))
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.title(f"Learning curves ({type(estimator).__name__})")
    plt.xlabel("Training examples")
    plt.ylabel("Score")     
    plt.legend(loc="best")
    plt.grid()
    plt.show()
def show_proba_calibration_plots(y_predicted_probs, y_true_labels):
    preds_with_true_labels = np.array(list(zip(y_predicted_probs, y_true_labels)))

    thresholds = []
    precisions = []
    recalls = []
    f1_scores = []

    for threshold in np.linspace(0.1, 0.9, 9):
        thresholds.append(threshold)
        precisions.append(precision_score(y_true_labels, list(map(int, y_predicted_probs > threshold))))
        recalls.append(recall_score(y_true_labels, list(map(int, y_predicted_probs > threshold))))
        f1_scores.append(f1_score(y_true_labels, list(map(int, y_predicted_probs > threshold))))

    scores_table = pd.DataFrame({'f1':f1_scores,
                                 'precision':precisions,
                                 'recall':recalls,
                                 'probability':thresholds}).sort_values('f1', ascending=False).round(3)
  
    figure = plt.figure(figsize = (15, 5))

    plt1 = figure.add_subplot(121)
    plt1.plot(thresholds, precisions, label='Precision', linewidth=4)
    plt1.plot(thresholds, recalls, label='Recall', linewidth=4)
    plt1.plot(thresholds, f1_scores, label='F1', linewidth=4)
    plt1.set_ylabel('Scores')
    plt1.set_xlabel('Probability threshold')
    plt1.set_title('Probabilities threshold calibration')
    plt1.legend(bbox_to_anchor=(0.25, 0.25))   
    plt1.table(cellText = scores_table.values,
               colLabels = scores_table.columns, 
               colLoc = 'center', cellLoc = 'center', loc = 'bottom', bbox = [0, -1.3, 1, 1])

    plt2 = figure.add_subplot(122)
    plt2.hist(preds_with_true_labels[preds_with_true_labels[:, 1] == 0][:, 0], 
              label='Another class', color='royalblue', alpha=1)
    plt2.hist(preds_with_true_labels[preds_with_true_labels[:, 1] == 1][:, 0], 
              label='Main class', color='darkcyan', alpha=0.8)
    plt2.set_ylabel('Number of examples')
    plt2.set_xlabel('Probabilities')
    plt2.set_title('Probability histogram')
    plt2.legend(bbox_to_anchor=(1, 1))

    plt.show()
def show_feature_importances(feature_names, feature_importances, get_top=None):
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
    feature_importances = feature_importances.sort_values('importance', ascending=False)
       
    plt.figure(figsize = (20, len(feature_importances) * 0.355))
    
    sns.barplot(feature_importances['importance'], feature_importances['feature'])
    
    plt.xlabel('Importance')
    plt.title('Importance of features')
    plt.show()
    
    if get_top is not None:
        return feature_importances['feature'][:get_top].tolist()
def plot_feature_importance(importance,names,model_type):
    
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    
    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    
    #Define size of bar plot
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

Paths to directories and files

TARGET_NAME = 'Credit Default'

TRAIN_DATASET_PATH = 'data/train.csv'
TEST_DATASET_PATH = 'data/test.csv'
SCALER_FILE_PATH = 'data/scaler.pkl'

TRAIN_PART_PATH = 'data/training_project_train_part.csv'
TEST_PART_PATH = 'data/training_project_test_part.csv'

Loading data

train_df = pd.read_csv(TRAIN_DATASET_PATH)
train_df.head()
Home Ownership Annual Income Years in current job Tax Liens Number of Open Accounts Years of Credit History Maximum Open Credit Number of Credit Problems Months since last delinquent Bankruptcies Purpose Term Current Loan Amount Current Credit Balance Monthly Debt Credit Score Credit Default
0 Own Home 482,087.00 NaN 0.00 11.00 26.30 685,960.00 1.00 NaN 1.00 debt consolidation Short Term 99,999,999.00 47,386.00 7,914.00 749.00 0
1 Own Home 1,025,487.00 10+ years 0.00 15.00 15.30 1,181,730.00 0.00 NaN 0.00 debt consolidation Long Term 264,968.00 394,972.00 18,373.00 737.00 1
2 Home Mortgage 751,412.00 8 years 0.00 11.00 35.00 1,182,434.00 0.00 NaN 0.00 debt consolidation Short Term 99,999,999.00 308,389.00 13,651.00 742.00 0
3 Own Home 805,068.00 6 years 0.00 8.00 22.50 147,400.00 1.00 NaN 1.00 debt consolidation Short Term 121,396.00 95,855.00 11,338.00 694.00 0
4 Rent 776,264.00 8 years 0.00 13.00 13.60 385,836.00 1.00 NaN 0.00 debt consolidation Short Term 125,840.00 93,309.00 7,180.00 719.00 0
test_df = pd.read_csv(TEST_DATASET_PATH)
test_df.head()
Home Ownership Annual Income Years in current job Tax Liens Number of Open Accounts Years of Credit History Maximum Open Credit Number of Credit Problems Months since last delinquent Bankruptcies Purpose Term Current Loan Amount Current Credit Balance Monthly Debt Credit Score
0 Rent NaN 4 years 0.00 9.00 12.50 220,968.00 0.00 70.00 0.00 debt consolidation Short Term 162,470.00 105,906.00 6,813.00 NaN
1 Rent 231,838.00 1 year 0.00 6.00 32.70 55,946.00 0.00 8.00 0.00 educational expenses Short Term 78,298.00 46,037.00 2,318.00 699.00
2 Home Mortgage 1,152,540.00 3 years 0.00 10.00 13.70 204,600.00 0.00 NaN 0.00 debt consolidation Short Term 200,178.00 146,490.00 18,729.00 7,260.00
3 Home Mortgage 1,220,313.00 10+ years 0.00 16.00 17.00 456,302.00 0.00 70.00 0.00 debt consolidation Short Term 217,382.00 213,199.00 27,559.00 739.00
4 Home Mortgage 2,340,952.00 6 years 0.00 11.00 23.60 1,207,272.00 0.00 NaN 0.00 debt consolidation Long Term 777,634.00 425,391.00 42,605.00 706.00

1. Overview of the training dataset

print(train_df.shape, test_df.shape)
(7500, 17) (2500, 16)
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Home Ownership                7500 non-null   object 
 1   Annual Income                 5943 non-null   float64
 2   Years in current job          7129 non-null   object 
 3   Tax Liens                     7500 non-null   float64
 4   Number of Open Accounts       7500 non-null   float64
 5   Years of Credit History       7500 non-null   float64
 6   Maximum Open Credit           7500 non-null   float64
 7   Number of Credit Problems     7500 non-null   float64
 8   Months since last delinquent  3419 non-null   float64
 9   Bankruptcies                  7486 non-null   float64
 10  Purpose                       7500 non-null   object 
 11  Term                          7500 non-null   object 
 12  Current Loan Amount           7500 non-null   float64
 13  Current Credit Balance        7500 non-null   float64
 14  Monthly Debt                  7500 non-null   float64
 15  Credit Score                  5943 non-null   float64
 16  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(1), object(4)
memory usage: 996.2+ KB
columns_name = train_df.columns
train_df.nunique(dropna=False)
Home Ownership                     4
Annual Income                   5479
Years in current job              12
Tax Liens                          8
Number of Open Accounts           39
Years of Credit History          408
Maximum Open Credit             6963
Number of Credit Problems          8
Months since last delinquent      90
Bankruptcies                       6
Purpose                           15
Term                               2
Current Loan Amount             5386
Current Credit Balance          6592
Monthly Debt                    6716
Credit Score                     269
Credit Default                     2
dtype: int64
columns_name
Index(['Home Ownership', 'Annual Income', 'Years in current job', 'Tax Liens',
       'Number of Open Accounts', 'Years of Credit History',
       'Maximum Open Credit', 'Number of Credit Problems',
       'Months since last delinquent', 'Bankruptcies', 'Purpose', 'Term',
       'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
       'Credit Score', 'Credit Default'],
      dtype='object')
# I will also highlight the analyzed variable

y = train_df['Credit Default']

cat_col = ['Home Ownership', 'Years in current job', 'Tax Liens',
           'Number of Credit Problems','Bankruptcies', 'Purpose', 'Term']

num_col = ['Annual Income', 'Number of Open Accounts', 'Maximum Open Credit',
           'Years of Credit History', 'Months since last delinquent', 
           'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
           'Credit Score',]
cat_df = train_df[cat_col]
cat_df = cat_df.astype(str)
num_df = train_df[num_col]
num_df = num_df.astype(float)
fig, ax = plt.subplots(4,2, figsize=(40,35))

sns.countplot(train_df['Home Ownership'], ax=ax[0,0])
sns.countplot(train_df['Years in current job'], ax=ax[0,1])
sns.countplot(train_df['Tax Liens'], ax=ax[1,0])
sns.countplot(train_df['Number of Credit Problems'], ax=ax[1,1])
sns.countplot(train_df['Bankruptcies'], ax=ax[2,0])
sns.countplot(train_df['Purpose'], ax=ax[2,1])
sns.countplot(train_df['Term'], ax=ax[3,0])

fig.show()
for c in cat_df.columns:
    print ("---- %s ---" % c)
    print (cat_df[c].value_counts())
---- Home Ownership ---
Home Mortgage    3637
Rent             3204
Own Home          647
Have Mortgage      12
Name: Home Ownership, dtype: int64
---- Years in current job ---
10+ years    2332
2 years       705
3 years       620
< 1 year      563
5 years       516
1 year        504
4 years       469
6 years       426
7 years       396
nan           371
8 years       339
9 years       259
Name: Years in current job, dtype: int64
---- Tax Liens ---
0.0    7366
1.0      83
2.0      30
3.0      10
4.0       6
6.0       2
5.0       2
7.0       1
Name: Tax Liens, dtype: int64
---- Number of Credit Problems ---
0.0    6469
1.0     882
2.0      93
3.0      35
4.0       9
5.0       7
6.0       4
7.0       1
Name: Number of Credit Problems, dtype: int64
---- Bankruptcies ---
0.0    6660
1.0     786
2.0      31
nan      14
3.0       7
4.0       2
Name: Bankruptcies, dtype: int64
---- Purpose ---
debt consolidation      5944
other                    665
home improvements        412
business loan            129
buy a car                 96
medical bills             71
major purchase            40
take a trip               37
buy house                 34
small business            26
wedding                   15
moving                    11
educational expenses      10
vacation                   8
renewable energy           2
Name: Purpose, dtype: int64
---- Term ---
Short Term    5556
Long Term     1944
Name: Term, dtype: int64
h = num_df.hist(bins=25,figsize=(16,16),xlabelsize='10',ylabelsize='10',xrot=-15)
sns.despine(left=True, bottom=True)
[x.title.set_size(12) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];
mask = np.zeros_like(num_df.corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 

f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)


sns.heatmap(num_df.corr(),linewidths=0.25,vmax=0.7,square=True,cmap="BuGn", #"BuGn_r" to reverse 
            linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});

2. Processing of omissions

train_df.isna().sum()
Home Ownership                     0
Annual Income                   1557
Years in current job             371
Tax Liens                          0
Number of Open Accounts            0
Years of Credit History            0
Maximum Open Credit                0
Number of Credit Problems          0
Months since last delinquent    4081
Bankruptcies                      14
Purpose                            0
Term                               0
Current Loan Amount                0
Current Credit Balance             0
Monthly Debt                       0
Credit Score                    1557
Credit Default                     0
dtype: int64
train_df.describe()
Annual Income Tax Liens Number of Open Accounts Years of Credit History Maximum Open Credit Number of Credit Problems Months since last delinquent Bankruptcies Current Loan Amount Current Credit Balance Monthly Debt Credit Score Credit Default
count 5,943.00 7,500.00 7,500.00 7,500.00 7,500.00 7,500.00 3,419.00 7,486.00 7,500.00 7,500.00 7,500.00 5,943.00 7,500.00
mean 1,366,391.72 0.03 11.13 18.32 945,153.73 0.17 34.69 0.12 11,873,177.45 289,833.24 18,314.45 1,151.09 0.28
std 845,339.20 0.27 4.91 7.04 16,026,216.67 0.50 21.69 0.35 31,926,122.97 317,871.38 11,926.76 1,604.45 0.45
min 164,597.00 0.00 2.00 4.00 0.00 0.00 0.00 0.00 11,242.00 0.00 0.00 585.00 0.00
25% 844,341.00 0.00 8.00 13.50 279,229.50 0.00 16.00 0.00 180,169.00 114,256.50 10,067.50 711.00 0.00
50% 1,168,386.00 0.00 10.00 17.00 478,159.00 0.00 32.00 0.00 309,573.00 209,323.00 16,076.50 731.00 0.00
75% 1,640,137.00 0.00 14.00 21.80 793,501.50 0.00 50.00 0.00 519,882.00 360,406.25 23,818.00 743.00 1.00
max 10,149,344.00 7.00 43.00 57.70 1,304,726,170.00 7.00 118.00 4.00 99,999,999.00 6,506,797.00 136,679.00 7,510.00 1.00

First, I will fill in the Annual Income using median values

median_income = train_df['Annual Income'].median()

train_df['Annual Income'] = train_df['Annual Income'].fillna(median_income)
test_df['Annual Income'] = test_df['Annual Income'].fillna(median_income)

Then I'm doing Years in current job. At the moment they have a textual meaning. So I will select the most common ones by frequency and replace the missing value with them

max_YCJ = train_df['Years in current job'].value_counts().index[0]

train_df['Years in current job'] = train_df['Years in current job'].fillna(max_YCJ)
test_df['Years in current job'] = test_df['Years in current job'].fillna(max_YCJ)

I will also replace the values with figures that are more acceptable for further use in calculations.

train_df['Years in current job'] = train_df['Years in current job'].replace({'10+ years':10,'2 years':2, '3 years':3,
                                                                             '< 1 year':0, '5 years':5, '1 year':1,
                                                                             '4 years':4, '6 years':6,'7 years':7,
                                                                             '8 years':8, '9 years':9})
test_df['Years in current job'] = test_df['Years in current job'].replace({'10+ years':10,'2 years':2, '3 years':3,
                                                                             '< 1 year':0, '5 years':5, '1 year':1,
                                                                             '4 years':4, '6 years':6,'7 years':7,
                                                                             '8 years':8, '9 years':9})

With Months since last delinquent, I will do the same as Annual Income - I use the median. But maybe it makes sense to delete this column altogether, since most of the values are missing

median_delinquent = train_df['Months since last delinquent'].median()

train_df['Months since last delinquent'] = train_df['Months since last delinquent'].fillna(median_delinquent)
test_df['Months since last delinquent'] = test_df['Months since last delinquent'].fillna(median_delinquent)

For bankrupts, when checking for frequency, it turns out that most were not bankrupts. Accordingly, I will replace the missing values with 0.

train_df['Bankruptcies'].value_counts()
0.00    6660
1.00     786
2.00      31
3.00       7
4.00       2
Name: Bankruptcies, dtype: int64
train_df['Bankruptcies'] = train_df['Bankruptcies'].fillna(0.00)
test_df['Bankruptcies'] = test_df['Bankruptcies'].fillna(0.00)

The last column with NaN values remains - Credit Score. I will fill it with the already familiar method - medians

median_CS = train_df['Credit Score'].median()

train_df['Credit Score'] = train_df['Credit Score'].fillna(median_CS)
test_df['Credit Score'] = test_df['Credit Score'].fillna(median_CS)

3. Handling emissions

I look at emissions by graphs and frequency.

I leave the categorical columns unchanged - 'Years in current job', 'Tax License','Number of CreditProblems','Bankruptcy', 'Purpose', 'Term'.

From the frequency calculation in HomeOwnership, the value for Have Mortgage is knocked out as too low. Most likely, this is an erroneous entry for Home Mortgage.

test_df['Home Ownership'].value_counts()
Home Mortgage    1225
Rent             1020
Own Home          248
Have Mortgage       7
Name: Home Ownership, dtype: int64
train_df.loc[train_df['Home Ownership'] == 'Have Mortgage', 'Home Ownership'] = 'Home Mortgage'
test_df.loc[test_df['Home Ownership'] == 'Have Mortgage', 'Home Ownership'] = 'Home Mortgage'

Since there is no such goal as renewable energy in the test DataFrame, I combine them with travel and call it something else

train_df['Purpose'] = train_df['Purpose'].replace({'vacation':'other','renewable energy':'other'})
test_df['Purpose'] = test_df['Purpose'].replace({'vacation':'other','renewable energy':'other'})

Next, I remove strong outliers by annual income, loans issued, the largest open loan, the number of years of credit history, the number of months since the last overdue payment, the current credit balance, monthly debt and credit score. If the value goes out for three sigma, then I assign it an average plus three sigma.

There are no exceedances of more than three sigma in the current loan amount, so I leave it unchanged.

mean_AI = train_df['Annual Income'].mean()
sigma_AI = train_df['Annual Income'].std()
train_df.loc[train_df['Annual Income'] > (mean_AI + 3 * sigma_AI), 'Annual Income'] = (mean_AI + 3 * sigma_AI)
test_df.loc[test_df['Annual Income'] > (mean_AI + 3 * sigma_AI), 'Annual Income'] = (mean_AI + 3 * sigma_AI)
mean_NOA = train_df['Number of Open Accounts'].mean()
sigma_NOA = train_df['Number of Open Accounts'].std()

train_df.loc[train_df['Number of Open Accounts'] > (mean_NOA + 3 * sigma_NOA), 'Number of Open Accounts'] = (mean_NOA + 3 * sigma_NOA)
test_df.loc[test_df['Number of Open Accounts'] > (mean_NOA + 3 * sigma_NOA), 'Number of Open Accounts'] = (mean_NOA + 3 * sigma_NOA)
mean_MOC = train_df['Maximum Open Credit'].mean()
sigma_MOC = train_df['Maximum Open Credit'].std()

train_df.loc[train_df['Maximum Open Credit'] > (mean_MOC + 3 * sigma_MOC), 'Maximum Open Credit'] = (mean_MOC + 3 * sigma_MOC)
test_df.loc[test_df['Maximum Open Credit'] > (mean_MOC + 3 * sigma_MOC), 'Maximum Open Credit'] = (mean_MOC + 3 * sigma_MOC)
mean_YCH = train_df['Years of Credit History'].mean()
sigma_YCH = train_df['Years of Credit History'].std()

train_df.loc[train_df['Years of Credit History'] > (mean_YCH + 3 * sigma_YCH), 'Years of Credit History'] = (mean_YCH + 3 * sigma_YCH)
test_df.loc[test_df['Years of Credit History'] > (mean_YCH + 3 * sigma_YCH), 'Years of Credit History'] = (mean_YCH + 3 * sigma_YCH)
mean_MSLD = train_df['Months since last delinquent'].mean()
sigma_MSLD = train_df['Months since last delinquent'].std()

train_df.loc[train_df['Months since last delinquent'] > (mean_MSLD + 3 * sigma_MSLD), 'Months since last delinquent'] = (mean_MSLD + 3 * sigma_MSLD)
test_df.loc[test_df['Months since last delinquent'] > (mean_MSLD + 3 * sigma_MSLD), 'Months since last delinquent'] = (mean_MSLD + 3 * sigma_MSLD)
mean_CCB = train_df['Current Credit Balance'].mean()
sigma_CCB = train_df['Current Credit Balance'].std()

train_df.loc[train_df['Current Credit Balance'] > (mean_CCB + 3 * sigma_CCB), 'Current Credit Balance'] = (mean_CCB + 3 * sigma_CCB)
test_df.loc[test_df['Current Credit Balance'] > (mean_CCB + 3 * sigma_CCB), 'Current Credit Balance'] = (mean_CCB + 3 * sigma_CCB)
mean_MD = train_df['Monthly Debt'].mean()
sigma_MD = train_df['Monthly Debt'].std()

train_df.loc[train_df['Monthly Debt'] > (mean_MD + 3 * sigma_MD), 'Monthly Debt'] = (mean_MD + 3 * sigma_MD)
test_df.loc[test_df['Monthly Debt'] > (mean_MD + 3 * sigma_MD), 'Monthly Debt'] = (mean_MD + 3 * sigma_MD)
mean_CS = train_df['Credit Score'].mean()
sigma_CS = train_df['Credit Score'].std()

train_df.loc[train_df['Credit Score'] > (mean_CS + 3 * sigma_CS), 'Credit Score'] = (mean_CS + 3 * sigma_CS)
test_df.loc[test_df['Credit Score'] > (mean_CS + 3 * sigma_CS), 'Credit Score'] = (mean_CS + 3 * sigma_CS)

4. Preparing data for analysis

The first step is to make a table with fictitious values of categorical data and normalize numeric values. After that, I save the received Data Frame for further analysis

cat_dum_train = pd.get_dummies(train_df[cat_col])
cat_dum_test = pd.get_dummies(test_df[cat_col])
scaler = StandardScaler()

num_norm_train = pd.DataFrame(scaler.fit_transform(train_df[num_col]), columns = num_col)
num_norm_test = pd.DataFrame(scaler.transform(test_df[num_col]), columns = num_col)
train_new = pd.concat([cat_dum_train, num_norm_train], axis=1)
test_new = pd.concat([cat_dum_test, num_norm_test], axis=1)
X_train, X_test, y_train, y_test = train_test_split(train_new, y, shuffle=True, test_size=0.2, random_state=56)

5. Balancing the target variable

y.value_counts()
0    5387
1    2113
Name: Credit Default, dtype: int64
df_for_balancing = pd.concat([X_train, y_train], axis=1)
df_balanced = balance_df_by_target(df_for_balancing, TARGET_NAME)
    
df_balanced[TARGET_NAME].value_counts()
0    4292
1    3416
Name: Credit Default, dtype: int64
X_train = df_balanced.drop(columns=TARGET_NAME)
y_train = df_balanced[TARGET_NAME]

6. Construction and evaluation of basic models

Logistic regression

model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

y_train_pred = model_lr.predict(X_train)
y_test_pred = model_lr.predict(X_test)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
TRAIN

              precision    recall  f1-score   support

           0       0.69      0.83      0.75      4292
           1       0.71      0.54      0.61      3416

    accuracy                           0.70      7708
   macro avg       0.70      0.68      0.68      7708
weighted avg       0.70      0.70      0.69      7708

TEST

              precision    recall  f1-score   support

           0       0.82      0.80      0.81      1095
           1       0.50      0.53      0.51       405

    accuracy                           0.73      1500
   macro avg       0.66      0.66      0.66      1500
weighted avg       0.73      0.73      0.73      1500

CONFUSION MATRIX

col_0             0    1
Credit Default          
0               879  216
1               192  213

k nearest neighbors

model_knn = KNeighborsClassifier()
model_knn.fit(X_train, y_train)

y_train_pred = model_knn.predict(X_train)
y_test_pred = model_knn.predict(X_test)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
TRAIN

              precision    recall  f1-score   support

           0       0.79      0.85      0.82      4292
           1       0.80      0.72      0.76      3416

    accuracy                           0.80      7708
   macro avg       0.80      0.79      0.79      7708
weighted avg       0.80      0.80      0.79      7708

TEST

              precision    recall  f1-score   support

           0       0.81      0.77      0.79      1095
           1       0.44      0.50      0.47       405

    accuracy                           0.70      1500
   macro avg       0.63      0.63      0.63      1500
weighted avg       0.71      0.70      0.70      1500

CONFUSION MATRIX

col_0             0    1
Credit Default          
0               839  256
1               201  204

Boosting algorithms

XGBoost

model_xgb = xgb.XGBClassifier(random_state=56)
model_xgb.fit(X_train, y_train)

y_train_pred = model_xgb.predict(X_train)
y_test_pred = model_xgb.predict(X_test)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
[20:38:14] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
TRAIN

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      4292
           1       0.97      0.96      0.96      3416

    accuracy                           0.97      7708
   macro avg       0.97      0.97      0.97      7708
weighted avg       0.97      0.97      0.97      7708

TEST

              precision    recall  f1-score   support

           0       0.81      0.81      0.81      1095
           1       0.48      0.47      0.48       405

    accuracy                           0.72      1500
   macro avg       0.64      0.64      0.64      1500
weighted avg       0.72      0.72      0.72      1500

CONFUSION MATRIX

col_0             0    1
Credit Default          
0               886  209
1               213  192

LightGBM

model_lgbm = lgbm.LGBMClassifier(random_state=56)
model_lgbm.fit(X_train, y_train)

y_train_pred = model_lgbm.predict(X_train)
y_test_pred = model_lgbm.predict(X_test)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
TRAIN

              precision    recall  f1-score   support

           0       0.91      0.95      0.93      4292
           1       0.93      0.88      0.90      3416

    accuracy                           0.92      7708
   macro avg       0.92      0.91      0.92      7708
weighted avg       0.92      0.92      0.92      7708

TEST

              precision    recall  f1-score   support

           0       0.81      0.84      0.83      1095
           1       0.52      0.48      0.50       405

    accuracy                           0.74      1500
   macro avg       0.67      0.66      0.66      1500
weighted avg       0.73      0.74      0.74      1500

CONFUSION MATRIX

col_0             0    1
Credit Default          
0               917  178
1               211  194

CatBoost

model_catb = catb.CatBoostClassifier(silent=True, random_state=56)
model_catb.fit(X_train, y_train)

y_train_pred = model_catb.predict(X_train)
y_test_pred = model_catb.predict(X_test)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
TRAIN

              precision    recall  f1-score   support

           0       0.89      0.95      0.92      4292
           1       0.93      0.86      0.89      3416

    accuracy                           0.91      7708
   macro avg       0.91      0.90      0.90      7708
weighted avg       0.91      0.91      0.91      7708

TEST

              precision    recall  f1-score   support

           0       0.81      0.83      0.82      1095
           1       0.51      0.49      0.50       405

    accuracy                           0.74      1500
   macro avg       0.66      0.66      0.66      1500
weighted avg       0.73      0.74      0.73      1500

CONFUSION MATRIX

col_0             0    1
Credit Default          
0               907  188
1               208  197

On the test, I have the best results for the Adaboost classifier, and I will select the parameters for it

7. Choosing the best model and selecting hyperparameters

model_catb = catb.CatBoostClassifier(class_weights=[1, 3.5], silent=True, random_state=56)
params_1 = {'n_estimators':[1500, 1800, 2100],
          'max_depth':[1, 2, 3]}
%%time

rs = RandomizedSearchCV(model_catb, params_1, scoring='f1', cv=cv, n_jobs=-1)
rs.fit(train_new, y)
CPU times: user 17.6 s, sys: 4.03 s, total: 21.6 s
Wall time: 1min 56s
RandomizedSearchCV(cv=KFold(n_splits=3, random_state=56, shuffle=True),
                   estimator=<catboost.core.CatBoostClassifier object at 0x7f8b78854dc0>,
                   n_jobs=-1,
                   param_distributions={'max_depth': [1, 2, 3],
                                        'n_estimators': [1500, 1800, 2100]},
                   scoring='f1')
rs.best_params_
{'n_estimators': 1500, 'max_depth': 3}
rs.best_score_
0.5418973339546257

Training and evaluation of the final model

%%time

final_model = catb.CatBoostClassifier(n_estimators=1500, max_depth=3,
                                      silent=True, random_state=56)
final_model.fit(X_train, y_train)

y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)

get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
TRAIN

              precision    recall  f1-score   support

           0       0.76      0.87      0.81      4292
           1       0.80      0.66      0.73      3416

    accuracy                           0.78      7708
   macro avg       0.78      0.77      0.77      7708
weighted avg       0.78      0.78      0.78      7708

TEST

              precision    recall  f1-score   support

           0       0.82      0.82      0.82      1095
           1       0.52      0.52      0.52       405

    accuracy                           0.74      1500
   macro avg       0.67      0.67      0.67      1500
weighted avg       0.74      0.74      0.74      1500

CONFUSION MATRIX

col_0             0    1
Credit Default          
0               899  196
1               194  211
CPU times: user 16.2 s, sys: 4.13 s, total: 20.4 s
Wall time: 3.53 s

Getting the result

%%time

final_model = catb.CatBoostClassifier(n_estimators=1500, max_depth=3,
                                      silent=True, random_state=56)
final_model.fit(train_new, y)
CPU times: user 15.6 s, sys: 4.25 s, total: 19.8 s
Wall time: 3.43 s
<catboost.core.CatBoostClassifier at 0x7f8b795233a0>
y_pred = final_model.predict(test_new)
result=pd.DataFrame({'Id':np.arange(2500), 'Credit Default': y_pred})
RESULT_PATH='solutions.csv'
result.to_csv(RESULT_PATH, index=False)
result.to_csv('solutions.csv', index=False)

In the overall rating table, I didn't get far and got a result of 0.45695