As a basis, I took a notebook published on colab for oil. This notebook examines the analysis of gold prices and shares of gold mining companies using machine analysis methods: linear regression, cluster analysis, and random forest. I immediately warn you that this post does not attempt to show the current situation and predict the future direction. Just like the author for oil, this article does not aim to raise or refute the possibilities of machine learning for analyzing stock prices or other tools. I upgraded the code for gold research in order to encourage those who are interested in further reflection and listen to constructive criticism in their address.
import yfinance as yf
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
For the price of gold, take the value of the exchange-traded investment Fund SPDR Gold Trust, whose shares are 100% backed by precious metal. The quotes will be compared with the prices of gold mining companies ' shares:
- Newmont Goldcorp (NMM)
- Barrick Gold (GOLD)
- AngloGold Ashanti (AU)
- Kinross Gold (KGC)
- Newcrest Mining (ENC)
- Polyus (PLZL)
- Polymetal (POLY)
- Seligdar (SELG)
gold = pd.DataFrame(yf.download("GLD", start="2010-01-01", end="2019-12-31")['Adj Close'])
gold = gold.reset_index()
gold.columns = ["Date","gold_price"]
gold['Date'] = pd.to_datetime(gold['Date'])
gold.head()
It is necessary to move the price of gold, as we will be interested in how yesterday's price affected today's stock price.
gold["gold_price"] = gold["gold_price"].shift(1)
shares=["NMM.SG","GOLD","AU","KGC","NCM.AX","PLZL.ME","POLY.ME","SELG.ME"]
data= yf.download(shares, start="2010-01-01", end="2019-12-31")['Adj Close']
data = data.reset_index()
data.head()
data['Date'] = pd.to_datetime(data['Date'])
all_data=pd.DataFrame()
for index in range(len(shares)):
stock=pd.DataFrame()
# transform the data
stock=data.loc[:, ("Date",shares[index])]
stock["Date"]=stock["Date"].astype('datetime64[ns]')
stock.columns=["Date","share_price"]
test=pd.DataFrame(gold)
output=stock.merge(test,on="Date",how="left") #combining two data sets
stock["gold_price"]=output["gold_price"]
stock['share_price']=pd.to_numeric(stock['share_price'], errors='coerce').dropna(0)
stock['gold_price']=pd.to_numeric(stock['gold_price'], errors='coerce').dropna(0)
stock["year"]=pd.to_datetime(stock["Date"]).dt.year #Create a column with years for subsequent filtering
stock["name"]=shares[index]
stock = stock.dropna() #delete all NAN lines
#creating a column with a scaled share price
scaler=MinMaxScaler()
stock["share_price_scaled"]=scaler.fit_transform(stock["share_price"].to_frame())
#add data to the main dataframe
all_data=all_data.append(stock) #add the data
all_data_15 = all_data[(all_data['year']>2014)&(all_data['year']<2020)]
all_data_15.head()
It is best to start analyzing data by presenting it visually, which will help you understand it better.
gold[['Date','gold_price']].set_index('Date').plot(color="green", linewidth=1.0)
plt.show()
palette=sns.cubehelix_palette(18, start=2, rot=0, dark=0, light=.95, reverse=False)
g = sns.pairplot(all_data[(all_data['name']=="POLY.ME")&(all_data['year']>2014)&(all_data['year']<2020)].\
drop(["share_price_scaled"],axis=1), hue="year",height=4)
g.fig.suptitle("Polyuse", y=1.08)
palette=sns.cubehelix_palette(18, start=2, rot=0, dark=0, light=.95, reverse=False)
f = sns.pairplot(all_data[(all_data['name']=="GOLD")&(all_data['year']>2014)&(all_data['year']<2020)].\
drop(["share_price_scaled"],axis=1), hue="year",height=4)
f.fig.suptitle('Barrick Gold', y=1.08)
plt.show()
A paired graph allows you to see the distribution of data by showing the paired relationships in the data set and the univariate distribution of data for each variable. You can also use the palette to see how this data changed in different years.
The chart is particularly interesting for 2016 and 2019, as it looks like the price of the Pole stock, Barrick Gold and the price of gold are lined up along the same line. We can also conclude from the distribution charts that the price of gold and stocks moved gradually towards higher values.
plt.figure(figsize=(10,10))
sns.set_style("whitegrid")
palette=sns.cubehelix_palette(5, start=2.8, rot=0, dark=0.2, light=0.8, reverse=False)
sns.violinplot(x="year", y="gold_price", data=all_data_15[["gold_price","year"]],
inner="quart", palette=palette, trim=True)
plt.xlabel("Year")
plt.ylabel("Price gold")
plt.show()
sns.catplot(x="year", y="share_price_scaled", col='name', col_wrap=3,kind="violin",
split=True, data=all_data_15,inner="quart", palette=palette, trim=True, height=4, aspect=1.2)
sns.despine(left=True)
A large fluctuation in gold prices was noted according to the charts in 2016 and 2019. As you can see from the graphs in the following figure, some companies such as Newmont Mining, Barrick Gold, AngloGold Ashanti, Newcrest Mining and Polymetal were also affected. It should also be noted that all prices are marked in the range from 0 to 1 and this may lead to inaccuracies in the interpretation.
Next, we will build distribution charts for one Russian company - Polymetal and one foreign company - Barrick Gold
sns.jointplot("gold_price", "share_price",data=all_data_15[all_data_15['name']=="POLY.ME"],kind="kde",
height=6,ratio=2,color="red").plot_joint(sns.kdeplot, zorder=0, n_levels=20)
sns.jointplot("gold_price", "share_price",data=all_data_15[all_data_15['name']=="GOLD"],kind="kde",
height=6,ratio=2,color="red").plot_joint(sns.kdeplot, zorder=0, n_levels=20)
plt.show()
It is necessary to pay attention to the distribution of the share price for the two companies and it will become clear that the shape of the density graph is the same for them.
sns.lmplot(x="gold_price", y="share_price_scaled", col="name",ci=None, col_wrap=3,
data=all_data_15, order=1,line_kws={'color': 'blue'},scatter_kws={'color': 'grey'}).set(ylim=(0, 1))
plt.show()
In fact, you won't be able to see much on these charts, although some stocks seem to have a relationship.
The next step is to try to color the charts depending on the years.
palette=sns.cubehelix_palette(5, start=2, rot=0, dark=0, light=.95, reverse=False)
sns.lmplot(x="gold_price", y="share_price_scaled",hue="year", col="name",ci=None,
col_wrap=3, data=all_data_15, order=1,palette=palette,height=4).set(ylim=(0, 1))
plt.show()
Here the picture is a little better in the sense that some companies have a data cloud stretching along a straight line in some years, which may indicate the existence of a dependency.
I will give a definition for machine learning from Wikipedia: Machine learning is a class of artificial intelligence methods that are characterized not by direct problem solving, but by learning in the process of applying solutions to many similar problems. To build such methods, we use mathematical statistics, numerical methods, optimization methods, probability theory, graph theory, and various techniques for working with data in digital form.
Usually, machine learning algorithms can be classified into the following categories: learning with a teacher and learning without a teacher. Here is their definition from one of the sites:
Supervised learning is one of the sections of machine learning dedicated to solving the following problem. There is a set of objects (situations) and the set of possible answers (responses, reactions). There is some relationship between responses and objects, but it is unknown. Only a finite set of use cases is known — the "object, response" pairs, called the training sample. Based on this data, you need to restore the dependency, that is, build an algorithm that can give a fairly accurate answer for any object. To measure the accuracy of responses, a quality functional is introduced in a certain way. see the Links)
Unsupervised learning is one of the sections of machine learning. Studies a wide class of data processing problems in which only descriptions of a set of objects (training sample) are known, and it is required to detect internal relationships, dependencies, and patterns that exist between objects. Learning without a teacher is often contrasted with learning with a teacher, when each training object is given a "correct answer", and you need to find the relationship between the objects and the answers. see links)
The following machine learning methods will be discussed later:
- Cluster analysis
- Linear regression
- Random forest
Using these algorithms, you can evaluate overvalued or undervalued stocks relative to the price of gold and possible movement on the next day. I remind you that you must be very careful and use the conclusions from this post at your own risk. I also remind you that my main goal is to show the potential of machine learning for stock valuation.
Clustering is the task of dividing a set of objects into groups called clusters. Each group should contain "similar" objects, and objects from different groups should be as different as possible.
from sklearn.cluster import KMeans
poly=all_data_15[all_data_15['name']=="GOLD"]
# We need to scale also gold price, so clustering is not influenced by the relative size of one axis.
poly=pd.DataFrame(poly)
poly['gold_price_scaled'] = scaler.fit_transform(poly["gold_price"].to_frame())
poly["cluster"] = KMeans(n_clusters=5, random_state=1).fit_predict(poly[["share_price_scaled","gold_price_scaled"]])
# The 954 most common RGB monitor colors https://xkcd.com/color/rgb/
colors = ["baby blue", "amber", "scarlet", "grey","milk chocolate", "windows blue"]
palette=sns.xkcd_palette(colors)
sns.lmplot(x="gold_price", y="share_price_scaled",ci=None,palette=palette, hue="cluster",fit_reg=0 ,data=poly)
plt.show()
Cluster analysis is used in a large number of machine learning tasks. But I have given it only for informational purposes, since in this form it does not bring much benefit to our analysis.
Next, we will build a regular linear regression using training with a teacher. The goal is to estimate the forecast of data for the last 100 days of 2019 based on data from 2018/2019 (excluding estimated ones). Training data is the data used to build the model, and test data is the data that we will try to predict.
for sh in shares:
print(sh)
#Data Preparation
share_18=pd.DataFrame()
share_18=all_data_15[(all_data_15['name']==sh)] # Get data 2018/19
share_18=share_18[["share_price","gold_price"]].reset_index()
# Just using 1 variable for linear regression. Split the data into training/testing sets
train = share_18[:-100]
test = share_18[-100:]
x_train=train["gold_price"].to_frame()
y_train=train['share_price'].to_frame()
x_test=test["gold_price"].to_frame()
y_test=test['share_price'].to_frame()
regr = LinearRegression() #Create linear regression object
regr.fit(x_train,y_train) #Train the model using the training sets
print("Coefficients: ", float(regr.coef_))
print(np.corrcoef(x_train,y_train, rowvar=False))
y_pred = regr.predict(x_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# Plot outputs using matplotlib
plt_train=plt.scatter(x_train["gold_price"],y_train, color='grey')
plt_test=plt.scatter(x_test["gold_price"],y_test, color='green')
plt_pred=plt.scatter(x_test["gold_price"], y_pred, color='black')
plt.xlabel("gold_price")
plt.ylabel("share_price")
plt.legend((plt_train, plt_test,plt_pred),("train data", "test data","prediction"))
plt.show()
From the above charts, we can conclude that the price of gold predicts the price of shares of foreign companies on the next day quite well. In Russian companies, this picture looks much worse. Of course, there may be a false impression about Seligdar shares. But visual analysis of the chart allows you to discard this assumption.
Random forest is a machine learning algorithm that uses a Committee (ensemble) of decision trees. The main idea is to use a large ensemble of decision trees, each of which in itself gives a very low quality of classification, but due to their large number, the result is good.
The random forest algorithm accepts more than one variable in the input data to predict the output data. It works very efficiently on large amounts of data, can handle many input variables, has efficient methods for estimating missing data, and many other advantages. The main disadvantages are:
- Random forests is slow to generate forecasts because it has many decision trees. Whenever it makes a forecast, all the trees in the forest must make a forecast for the same given input and then vote on it. This whole process takes a long time.
- the Model is difficult to interpret compared to the decision tree, where you can easily make a decision by following the path in the tree.
One of the great advantages of a random forest is that it can be used for both classification and regression problems, which make up most of today's machine learning systems. I will talk about random forests in classification, since classification is sometimes considered a building block of machine learning. Below you can see what a random forest with two trees looks like:
In addition to the gold price, we will use other variables to forecast the Newmont Goldcorp share price. This will be the share prices of other foreign gold mining companies. I know it doesn't make a lot of sense, but we just want to see how to build this type of model. This will allow us to see the impact of each of them on the final forecast.Random forest is a machine learning algorithm that uses a Committee (ensemble) of decision trees. The main idea is to use a large ensemble of decision trees, each of which in itself gives a very low quality of classification, but due to their large number, the result is good.
from sklearn.ensemble import RandomForestRegressor
# 1.- Data Preparation
nmm15=pd.DataFrame()
nmm15=all_data_15[(all_data_15['name']=="NMM.SG") & (all_data_15['year']>2016 )]
nmm15=nmm15[["share_price","gold_price"]].reset_index()
# Load share price of other variables
nmm15['GOLD']=all_data_15[(all_data_15['name']=="GOLD")][-980:].reset_index()['share_price']
nmm15['GOLD'] = nmm15['GOLD'].shift(1)
nmm15['AU']=all_data_15[(all_data_15['name']=="AU")][-980:].reset_index()['share_price']
nmm15['AU'] = nmm15['AU'].shift(1)
nmm15['KGC']=all_data_15[(all_data_15['name']=="KGC")][-980:].reset_index()['share_price']
nmm15['KGC'] = nmm15['KGC'].shift(1)
nmm15['NCM.AX']=all_data_15[(all_data_15['name']=="NCM.AX")][-980:].reset_index()['share_price']
nmm15['NCM.AX'] = nmm15['NCM.AX'].shift(1)
nmm15 = nmm15.drop(nmm15.index[0])
train = nmm15[:-100]
test = nmm15[-100:]
x_train=train[["gold_price","GOLD","AU","KGC","NCM.AX"]]
y_train=train['share_price']
x_test=test[["gold_price","GOLD","AU","KGC","NCM.AX",]]
y_test=test['share_price'].to_frame()
# 2.- Create Randomforest object usinig a max depth=5
regressor = RandomForestRegressor(n_estimators=200, max_depth=5 )
# 3.- Train data
clf=regressor.fit(x_train, y_train)
# 4.- Predict!
y_pred=regressor.predict(x_test)
y_pred_list = list(y_pred)
y_pred=pd.DataFrame(y_pred)
plt_train=plt.scatter(x_train["gold_price"],y_train, color='grey')
plt_pred=plt.scatter(nmm15["gold_price"], regressor.predict(nmm15[["gold_price","GOLD","AU","KGC","NCM.AX"]]), color='black')
plt.xlabel("gold_price")
plt.ylabel("share_price")
plt.legend((plt_train,plt_pred),("train data","prediction"))
plt.show()
The resulting model looks really good in addition, we must remember that Random Forest has many more parameters to configure, but the key one is the maximum depth, which is unlimited by default. Next, we'll check how this model predicts or tests data.
plt_train=plt.scatter(x_train["gold_price"],y_train, color='grey')
plt_test=plt.scatter(x_test["gold_price"],y_test, color='green')
plt_pred=plt.scatter(x_test["gold_price"], y_pred, color='black')
plt.xlabel("gold_price")
plt.ylabel("share_price")
plt.legend((plt_train, plt_test,plt_pred),("train data", "test data","prediction"))
plt.show()
y_pred = clf.predict(x_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
importances=regressor.feature_importances_
indices=list(x_train)
print("Feature ranking:")
for f in range(x_train.shape[1]):
print("Feature %s (%f)" % (indices[f], importances[f]))
f, (ax1) = plt.subplots(1, 1, figsize=(8, 6), sharex=True)
sns.barplot(indices, importances, palette="BrBG", ax=ax1)
ax1.set_ylabel("Importance")
By the importance of the signs, it immediately becomes clear how strong the value of gold is.
In short, I hope I was able to reveal to you the beginnings of a project on using machine learning to study stock prices, and I hope to hear your comments.