- Problem statement
- An approximate description of the stages of the course project
- Обзор данных
- 1. Overview of the training dataset
- 2. Processing of omissions
- 3. Handling emissions
- 4. Preparing data for analysis
- 5. Balancing the target variable
- 6. Construction and evaluation of basic models
- 7. Choosing the best model and selecting hyperparameters
Task
It is required, based on the available data on the bank's customers, to build a model using a training dataset to predict the non-fulfillment of debt obligations on a current loan. Make a forecast for the examples from the test dataset.
Name of data files
course_project_train.csv - training dataset
course_project_test.csv - test dataset
Target variable
Credit Default - the fact of non-fulfillment of credit obligations
Quality metric
F1-score (sklearn.metrics.f1_score)
Solution requirements
Target metric
- F1 > 0.5
- The metric is evaluated by the quality of the forecast for the main class (1 - loan delinquency)
The solution must contain
- Jupyter Notebook with the code of your solution, named after the pattern {FULL name}_solution.ipynb, example SShirkin_solution.ipynb
- CSV file with forecasts of the target variable for the test dataset, named after the sample {FULL name}_predictions.csv, example SShirkin_predictions.csv
Recommendations for the code file (ipynb)
- The file must contain headers and comments (markdown)
- It is better to design repetitive operations in the form of functions
- Do not output a large number of rows of tables (5-10 is enough)
- If possible, add graphs describing the data (about 3-5)
- Add only the best model, that is, do not include in the code all the options for solving the project
- The project script should work from the beginning to the end (from loading data to unloading predictions)
- The whole project should be in one script (ipynb file).
- It is allowed to use Python libraries and machine learning models that were in this course.
Deadlines for delivery
You need to submit the project within 5 days after the end of the last webinar. Estimates of works submitted before the deadline will be presented in the form of a rating ranked according to a given quality metric. Projects submitted after the deadline or submitted again do not get into the rating, but you will be able to find out the result.
Building a classification model
- Overview of the training dataset
- Handling emissions
- Processing of omissions
- Data analysis
- Selection of features
- Balancing classes
- Selection of models, obtaining a baseline
- Choosing the best model, setting hyperparameters
- Quality control, fight against retraining
- Interpretation of results
Forecasting on a test dataset
- Perform the same stages of processing and building features for the test dataset
- Predict the target variable using a model based on a training dataset
- Forecasts should be for all examples from the test dataset (for all rows)
- Observe the original order of the examples from the test dataset
Description of the dataset
- Home Ownership - Home ownership
- Annual Income - annual income
- Years in current job - the number of years at the current job
- Tax Liens - tax encumbrances
- Number of Open Accounts - number of open accounts
- Years of Credit History - number of years of credit history
- Maximum Open Credit - the largest open credit
- Number of Credit Problems - number of credit problems
- Months since last delinquent - the number of months since the last payment delay
- Bankruptcies - bankruptcies
- Purpose - purpose of the loan
- Term - loan term
- Current Loan Amount - current loan amount
- Current Credit Balance - current credit balance
- Monthly Debt - monthly debt
- Credit Default - the fact of non-fulfillment of credit obligations (0 - repaid on time, 1 - overdue)
Connecting the script library
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, KFold
import xgboost as xgb, lightgbm as lgbm, catboost as catb
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt
import warnings
warnings.filterwarnings('ignore')
sns.set(style='whitegrid')
sns.set_context("paper", font_scale=1.5)
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_rows', 50)
def get_classification_report(y_train_true, y_train_pred, y_test_true, y_test_pred):
print('TRAIN\n\n' + classification_report(y_train_true, y_train_pred))
print('TEST\n\n' + classification_report(y_test_true, y_test_pred))
print('CONFUSION MATRIX\n')
print(pd.crosstab(y_test_true, y_test_pred))
def balance_df_by_target(df, target_name):
target_counts = df[target_name].value_counts()
major_class_name = target_counts.argmax()
minor_class_name = target_counts.argmin()
disbalance_coeff = int(target_counts[major_class_name] / target_counts[minor_class_name]) - 1
for i in range(disbalance_coeff):
sample = df[df[target_name] == minor_class_name].sample(target_counts[minor_class_name])
df = df.append(sample, ignore_index=True)
return df.sample(frac=1)
def show_learning_curve_plot(estimator, X, y, cv=3, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y,
cv=cv,
scoring='f1',
train_sizes=train_sizes,
n_jobs=n_jobs)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(15,8))
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.title(f"Learning curves ({type(estimator).__name__})")
plt.xlabel("Training examples")
plt.ylabel("Score")
plt.legend(loc="best")
plt.grid()
plt.show()
def show_proba_calibration_plots(y_predicted_probs, y_true_labels):
preds_with_true_labels = np.array(list(zip(y_predicted_probs, y_true_labels)))
thresholds = []
precisions = []
recalls = []
f1_scores = []
for threshold in np.linspace(0.1, 0.9, 9):
thresholds.append(threshold)
precisions.append(precision_score(y_true_labels, list(map(int, y_predicted_probs > threshold))))
recalls.append(recall_score(y_true_labels, list(map(int, y_predicted_probs > threshold))))
f1_scores.append(f1_score(y_true_labels, list(map(int, y_predicted_probs > threshold))))
scores_table = pd.DataFrame({'f1':f1_scores,
'precision':precisions,
'recall':recalls,
'probability':thresholds}).sort_values('f1', ascending=False).round(3)
figure = plt.figure(figsize = (15, 5))
plt1 = figure.add_subplot(121)
plt1.plot(thresholds, precisions, label='Precision', linewidth=4)
plt1.plot(thresholds, recalls, label='Recall', linewidth=4)
plt1.plot(thresholds, f1_scores, label='F1', linewidth=4)
plt1.set_ylabel('Scores')
plt1.set_xlabel('Probability threshold')
plt1.set_title('Probabilities threshold calibration')
plt1.legend(bbox_to_anchor=(0.25, 0.25))
plt1.table(cellText = scores_table.values,
colLabels = scores_table.columns,
colLoc = 'center', cellLoc = 'center', loc = 'bottom', bbox = [0, -1.3, 1, 1])
plt2 = figure.add_subplot(122)
plt2.hist(preds_with_true_labels[preds_with_true_labels[:, 1] == 0][:, 0],
label='Another class', color='royalblue', alpha=1)
plt2.hist(preds_with_true_labels[preds_with_true_labels[:, 1] == 1][:, 0],
label='Main class', color='darkcyan', alpha=0.8)
plt2.set_ylabel('Number of examples')
plt2.set_xlabel('Probabilities')
plt2.set_title('Probability histogram')
plt2.legend(bbox_to_anchor=(1, 1))
plt.show()
def show_feature_importances(feature_names, feature_importances, get_top=None):
feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
feature_importances = feature_importances.sort_values('importance', ascending=False)
plt.figure(figsize = (20, len(feature_importances) * 0.355))
sns.barplot(feature_importances['importance'], feature_importances['feature'])
plt.xlabel('Importance')
plt.title('Importance of features')
plt.show()
if get_top is not None:
return feature_importances['feature'][:get_top].tolist()
def plot_feature_importance(importance,names,model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
#Define size of bar plot
plt.figure(figsize=(10,8))
#Plot Searborn bar chart
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
#Add chart labels
plt.title(model_type + 'FEATURE IMPORTANCE')
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
Paths to directories and files
TARGET_NAME = 'Credit Default'
TRAIN_DATASET_PATH = 'data/train.csv'
TEST_DATASET_PATH = 'data/test.csv'
SCALER_FILE_PATH = 'data/scaler.pkl'
TRAIN_PART_PATH = 'data/training_project_train_part.csv'
TEST_PART_PATH = 'data/training_project_test_part.csv'
Loading data
train_df = pd.read_csv(TRAIN_DATASET_PATH)
train_df.head()
test_df = pd.read_csv(TEST_DATASET_PATH)
test_df.head()
print(train_df.shape, test_df.shape)
train_df.info()
columns_name = train_df.columns
train_df.nunique(dropna=False)
columns_name
# I will also highlight the analyzed variable
y = train_df['Credit Default']
cat_col = ['Home Ownership', 'Years in current job', 'Tax Liens',
'Number of Credit Problems','Bankruptcies', 'Purpose', 'Term']
num_col = ['Annual Income', 'Number of Open Accounts', 'Maximum Open Credit',
'Years of Credit History', 'Months since last delinquent',
'Current Loan Amount', 'Current Credit Balance', 'Monthly Debt',
'Credit Score',]
cat_df = train_df[cat_col]
cat_df = cat_df.astype(str)
num_df = train_df[num_col]
num_df = num_df.astype(float)
fig, ax = plt.subplots(4,2, figsize=(40,35))
sns.countplot(train_df['Home Ownership'], ax=ax[0,0])
sns.countplot(train_df['Years in current job'], ax=ax[0,1])
sns.countplot(train_df['Tax Liens'], ax=ax[1,0])
sns.countplot(train_df['Number of Credit Problems'], ax=ax[1,1])
sns.countplot(train_df['Bankruptcies'], ax=ax[2,0])
sns.countplot(train_df['Purpose'], ax=ax[2,1])
sns.countplot(train_df['Term'], ax=ax[3,0])
fig.show()
for c in cat_df.columns:
print ("---- %s ---" % c)
print (cat_df[c].value_counts())
h = num_df.hist(bins=25,figsize=(16,16),xlabelsize='10',ylabelsize='10',xrot=-15)
sns.despine(left=True, bottom=True)
[x.title.set_size(12) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];
mask = np.zeros_like(num_df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)
sns.heatmap(num_df.corr(),linewidths=0.25,vmax=0.7,square=True,cmap="BuGn", #"BuGn_r" to reverse
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});
train_df.isna().sum()
train_df.describe()
First, I will fill in the Annual Income using median values
median_income = train_df['Annual Income'].median()
train_df['Annual Income'] = train_df['Annual Income'].fillna(median_income)
test_df['Annual Income'] = test_df['Annual Income'].fillna(median_income)
Then I'm doing Years in current job. At the moment they have a textual meaning. So I will select the most common ones by frequency and replace the missing value with them
max_YCJ = train_df['Years in current job'].value_counts().index[0]
train_df['Years in current job'] = train_df['Years in current job'].fillna(max_YCJ)
test_df['Years in current job'] = test_df['Years in current job'].fillna(max_YCJ)
I will also replace the values with figures that are more acceptable for further use in calculations.
train_df['Years in current job'] = train_df['Years in current job'].replace({'10+ years':10,'2 years':2, '3 years':3,
'< 1 year':0, '5 years':5, '1 year':1,
'4 years':4, '6 years':6,'7 years':7,
'8 years':8, '9 years':9})
test_df['Years in current job'] = test_df['Years in current job'].replace({'10+ years':10,'2 years':2, '3 years':3,
'< 1 year':0, '5 years':5, '1 year':1,
'4 years':4, '6 years':6,'7 years':7,
'8 years':8, '9 years':9})
With Months since last delinquent, I will do the same as Annual Income - I use the median. But maybe it makes sense to delete this column altogether, since most of the values are missing
median_delinquent = train_df['Months since last delinquent'].median()
train_df['Months since last delinquent'] = train_df['Months since last delinquent'].fillna(median_delinquent)
test_df['Months since last delinquent'] = test_df['Months since last delinquent'].fillna(median_delinquent)
For bankrupts, when checking for frequency, it turns out that most were not bankrupts. Accordingly, I will replace the missing values with 0.
train_df['Bankruptcies'].value_counts()
train_df['Bankruptcies'] = train_df['Bankruptcies'].fillna(0.00)
test_df['Bankruptcies'] = test_df['Bankruptcies'].fillna(0.00)
The last column with NaN values remains - Credit Score. I will fill it with the already familiar method - medians
median_CS = train_df['Credit Score'].median()
train_df['Credit Score'] = train_df['Credit Score'].fillna(median_CS)
test_df['Credit Score'] = test_df['Credit Score'].fillna(median_CS)
I look at emissions by graphs and frequency.
I leave the categorical columns unchanged - 'Years in current job', 'Tax License','Number of CreditProblems','Bankruptcy', 'Purpose', 'Term'.
From the frequency calculation in HomeOwnership, the value for Have Mortgage is knocked out as too low. Most likely, this is an erroneous entry for Home Mortgage.
test_df['Home Ownership'].value_counts()
train_df.loc[train_df['Home Ownership'] == 'Have Mortgage', 'Home Ownership'] = 'Home Mortgage'
test_df.loc[test_df['Home Ownership'] == 'Have Mortgage', 'Home Ownership'] = 'Home Mortgage'
Since there is no such goal as renewable energy in the test DataFrame, I combine them with travel and call it something else
train_df['Purpose'] = train_df['Purpose'].replace({'vacation':'other','renewable energy':'other'})
test_df['Purpose'] = test_df['Purpose'].replace({'vacation':'other','renewable energy':'other'})
Next, I remove strong outliers by annual income, loans issued, the largest open loan, the number of years of credit history, the number of months since the last overdue payment, the current credit balance, monthly debt and credit score. If the value goes out for three sigma, then I assign it an average plus three sigma.
There are no exceedances of more than three sigma in the current loan amount, so I leave it unchanged.
mean_AI = train_df['Annual Income'].mean()
sigma_AI = train_df['Annual Income'].std()
train_df.loc[train_df['Annual Income'] > (mean_AI + 3 * sigma_AI), 'Annual Income'] = (mean_AI + 3 * sigma_AI)
test_df.loc[test_df['Annual Income'] > (mean_AI + 3 * sigma_AI), 'Annual Income'] = (mean_AI + 3 * sigma_AI)
mean_NOA = train_df['Number of Open Accounts'].mean()
sigma_NOA = train_df['Number of Open Accounts'].std()
train_df.loc[train_df['Number of Open Accounts'] > (mean_NOA + 3 * sigma_NOA), 'Number of Open Accounts'] = (mean_NOA + 3 * sigma_NOA)
test_df.loc[test_df['Number of Open Accounts'] > (mean_NOA + 3 * sigma_NOA), 'Number of Open Accounts'] = (mean_NOA + 3 * sigma_NOA)
mean_MOC = train_df['Maximum Open Credit'].mean()
sigma_MOC = train_df['Maximum Open Credit'].std()
train_df.loc[train_df['Maximum Open Credit'] > (mean_MOC + 3 * sigma_MOC), 'Maximum Open Credit'] = (mean_MOC + 3 * sigma_MOC)
test_df.loc[test_df['Maximum Open Credit'] > (mean_MOC + 3 * sigma_MOC), 'Maximum Open Credit'] = (mean_MOC + 3 * sigma_MOC)
mean_YCH = train_df['Years of Credit History'].mean()
sigma_YCH = train_df['Years of Credit History'].std()
train_df.loc[train_df['Years of Credit History'] > (mean_YCH + 3 * sigma_YCH), 'Years of Credit History'] = (mean_YCH + 3 * sigma_YCH)
test_df.loc[test_df['Years of Credit History'] > (mean_YCH + 3 * sigma_YCH), 'Years of Credit History'] = (mean_YCH + 3 * sigma_YCH)
mean_MSLD = train_df['Months since last delinquent'].mean()
sigma_MSLD = train_df['Months since last delinquent'].std()
train_df.loc[train_df['Months since last delinquent'] > (mean_MSLD + 3 * sigma_MSLD), 'Months since last delinquent'] = (mean_MSLD + 3 * sigma_MSLD)
test_df.loc[test_df['Months since last delinquent'] > (mean_MSLD + 3 * sigma_MSLD), 'Months since last delinquent'] = (mean_MSLD + 3 * sigma_MSLD)
mean_CCB = train_df['Current Credit Balance'].mean()
sigma_CCB = train_df['Current Credit Balance'].std()
train_df.loc[train_df['Current Credit Balance'] > (mean_CCB + 3 * sigma_CCB), 'Current Credit Balance'] = (mean_CCB + 3 * sigma_CCB)
test_df.loc[test_df['Current Credit Balance'] > (mean_CCB + 3 * sigma_CCB), 'Current Credit Balance'] = (mean_CCB + 3 * sigma_CCB)
mean_MD = train_df['Monthly Debt'].mean()
sigma_MD = train_df['Monthly Debt'].std()
train_df.loc[train_df['Monthly Debt'] > (mean_MD + 3 * sigma_MD), 'Monthly Debt'] = (mean_MD + 3 * sigma_MD)
test_df.loc[test_df['Monthly Debt'] > (mean_MD + 3 * sigma_MD), 'Monthly Debt'] = (mean_MD + 3 * sigma_MD)
mean_CS = train_df['Credit Score'].mean()
sigma_CS = train_df['Credit Score'].std()
train_df.loc[train_df['Credit Score'] > (mean_CS + 3 * sigma_CS), 'Credit Score'] = (mean_CS + 3 * sigma_CS)
test_df.loc[test_df['Credit Score'] > (mean_CS + 3 * sigma_CS), 'Credit Score'] = (mean_CS + 3 * sigma_CS)
The first step is to make a table with fictitious values of categorical data and normalize numeric values. After that, I save the received Data Frame for further analysis
cat_dum_train = pd.get_dummies(train_df[cat_col])
cat_dum_test = pd.get_dummies(test_df[cat_col])
scaler = StandardScaler()
num_norm_train = pd.DataFrame(scaler.fit_transform(train_df[num_col]), columns = num_col)
num_norm_test = pd.DataFrame(scaler.transform(test_df[num_col]), columns = num_col)
train_new = pd.concat([cat_dum_train, num_norm_train], axis=1)
test_new = pd.concat([cat_dum_test, num_norm_test], axis=1)
X_train, X_test, y_train, y_test = train_test_split(train_new, y, shuffle=True, test_size=0.2, random_state=56)
y.value_counts()
df_for_balancing = pd.concat([X_train, y_train], axis=1)
df_balanced = balance_df_by_target(df_for_balancing, TARGET_NAME)
df_balanced[TARGET_NAME].value_counts()
X_train = df_balanced.drop(columns=TARGET_NAME)
y_train = df_balanced[TARGET_NAME]
Logistic regression
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
y_train_pred = model_lr.predict(X_train)
y_test_pred = model_lr.predict(X_test)
get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
k nearest neighbors
model_knn = KNeighborsClassifier()
model_knn.fit(X_train, y_train)
y_train_pred = model_knn.predict(X_train)
y_test_pred = model_knn.predict(X_test)
get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
Boosting algorithms
XGBoost
model_xgb = xgb.XGBClassifier(random_state=56)
model_xgb.fit(X_train, y_train)
y_train_pred = model_xgb.predict(X_train)
y_test_pred = model_xgb.predict(X_test)
get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
LightGBM
model_lgbm = lgbm.LGBMClassifier(random_state=56)
model_lgbm.fit(X_train, y_train)
y_train_pred = model_lgbm.predict(X_train)
y_test_pred = model_lgbm.predict(X_test)
get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
CatBoost
model_catb = catb.CatBoostClassifier(silent=True, random_state=56)
model_catb.fit(X_train, y_train)
y_train_pred = model_catb.predict(X_train)
y_test_pred = model_catb.predict(X_test)
get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
On the test, I have the best results for the Adaboost classifier, and I will select the parameters for it
model_catb = catb.CatBoostClassifier(class_weights=[1, 3.5], silent=True, random_state=56)
params_1 = {'n_estimators':[1500, 1800, 2100],
'max_depth':[1, 2, 3]}
%%time
rs = RandomizedSearchCV(model_catb, params_1, scoring='f1', cv=cv, n_jobs=-1)
rs.fit(train_new, y)
rs.best_params_
rs.best_score_
Training and evaluation of the final model
%%time
final_model = catb.CatBoostClassifier(n_estimators=1500, max_depth=3,
silent=True, random_state=56)
final_model.fit(X_train, y_train)
y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)
get_classification_report(y_train, y_train_pred, y_test, y_test_pred)
Getting the result
%%time
final_model = catb.CatBoostClassifier(n_estimators=1500, max_depth=3,
silent=True, random_state=56)
final_model.fit(train_new, y)
y_pred = final_model.predict(test_new)
result=pd.DataFrame({'Id':np.arange(2500), 'Credit Default': y_pred})
RESULT_PATH='solutions.csv'
result.to_csv(RESULT_PATH, index=False)
result.to_csv('solutions.csv', index=False)
In the overall rating table, I didn't get far and got a result of 0.45695