Anda di halaman 1dari 4

Stock Market Prediction with Uncertainties using LSTM Networks and

Monte Carlo Dropouts


Edvin Aspelin

Abstract— This project aims to explore the field of stock mar- Making predictions of the future in stock markets could
ket predictions using deep neural networks. More specifically, be expressed as associating observations of its previous price
with LSTM networks using Monte Carlo dropouts to model its trends with future yield. Somehow the network will need to
prediction uncertainty.
capture these trends in time and make appropriate predictions
Previously, exponentially moving average has shown to be
very capable of predicting stock prices from day to day. But
from it. Hence, the central work point will be around an
the aim is now to do short term predictions, where the time LSTM-based network. While there is no attempt to say that
span of the prediction is several days. The performance of the this is an obvious choice, it has proven to very effective in
network will be evaluated with different time horizons. Where processing sequential data.
both the mean and variance of a prediction are illustrated.
Moreover, it will go through the details regarding imple-
menting such a network and improving upon it. However, the II. BACKGROUND THEORY
network is limited by using a dataset with only volume data of
stock trades each day. A. Monte Carlo Dropout
As with Monte Carlo techniques in reinforcement learning
(RL), the goal of the method is to create unique scenarios
I. INTRODUCTION
each time a sequence is run through. Compared to RL, one
The ability to reliably predict the stock market is obviously implementation of the same behaviour in networks with for
a rewarding concept, as the result would be rather profitable. example LSTM layers is introducing dropouts. Generally
But no matter how tempting that implementation may be, dropouts are used to inhibit overfitting. The dropout layer
this task turns out to be a very difficult one. As one might is then used during training in an attempt to make the
imagine, there has been a lot of tries. For example in [1], expressiveness of the network more general.
where they with some success are able to catch the stock
behaviour with one-day ahead predictions using various ma-
chine learning techniques. But generally, predicting further
into the future has not proven to be reliable yet.
An interesting note is that most of the proposed networks
in current articles have a deterministic approach. Where the
prediction outputs a value, and the model is evaluated with
its accuracy or loss. This method implies using for example
mean square error loss to determine the performance when
training. However, if the network would to be used on a stock
Fig. 1. Illustration of how dropouts affect a classic network with nodes,
market, there is no way of telling the loss of the current where blacked out nodes are dropped out. Network to the right without
prediction, or risk of an investment. And the key problem dropout and to the left with dropout.
with using such a method is that there is not really a way of
determining the accuracy on real-time data, where the answer Classically, the approach shown in Fig. 1 are used when
is unknown. training. The difference with Monte Carlo dropouts are that
Risk is an important factor in stock market trading. It is an the same behaviour also applies to predictions. It then follows
indication of how uncertain an investment is. In other words, that each prediction will produce a different result, thus one
there is a want of knowing how certain our predictions are. can extract a mean and variance. Suppose a trained network
Behaviour of stock markets tends to be stochastic, and has as a generic function, y(x) = f (x). If the function is stochastic
often a lot of uncorrelated behaviour in respect of solely its and we fixate the input x, the properties of the output y can
previous values. If we were to succumb to the fact that we do be analysed like following.
not have the tools to make reliable stock market predictions,
can we instead measure the uncertainty of it? Implementing 1 N
Monte Carlo dropout in prediction leads the model to inherit E(y) ≈ ∑ ŷn (1)
N n=1
Bayesian properties[2] , and satisfying the want of having a
way to measure uncertainty. Thus, creating a tool to analyse ∑Nn=1 (yn − E(y))
Var(y) ≈ (2)
the prediction with a Bayesian approach. N −1
With Eq. (1) and (2) in combination with the law of distribution and the prediction. SE is a measurement of how
large numbers, the mean and variance will converge as N many standard deviations the prediction is from the truth. To
increases. summarise, the loss function in training will be MSE and
It is not obvious that this is a Gaussian process. However, when evaluating the predicting SE is used.
the proof of it is beyond the scope of this project and is cov-
ered by [2]. But the important conclusion from the derivation
is that averaging forward passes through the network is in
fact equal to Monte Carlo integration over a Gaussian process
posterior estimate.

B. LSTM networks
A major shortcoming in traditional neural networks is
the inability to capture previous events in a sequence to
Fig. 3. An example of a Gaussian distribution with standard deviations
make better predictions. Long Short Term Memory (LSTM) and percentiles illustrated.
networks tries to solve this problem by introducing a cell
state, where the LSTM layer has the ability to read and From Fig. 4 we can conclude that minimising the SE
write to this cell state. The cell state is forwarded during would express the error of our prediction in terms of how
the whole sequence, hence why information easily can be the distribution looks like. In other words, a bad prediction
handed forward through the sequence. will be measured differently depending on the uncertainty of
the prediction.
The SE, σe , is calculated with the prediction ŷ and its
variance µŷ , compared to the ground truth y.

µŷ = σŷ2 (3)


p
(ŷ − y)2
σe = (4)
σŷ
Eq. 4 will then express the performance for the final
prediction test on the trained network.

B. Data Preprocessing
Fig. 2. A single cell in an LSTM network. Note that there are variations
to this, but it is the structure used in this report. When training the model, the dataset called Huge Stock
Market Dataset[3] is used. The dataset is available at Kaggle
As illustrated in Fig. 2, the cell state c is passed forward free of charge. It contains information about stock prices
through the sequence of several LSTM cells. Applying this from the very beginning of NYSE, NASDAQ, and NYSE
structure to stock market prices would mean that the network MKT. Data stored are limited to {Date, Open, High, Low,
more easily could capture price behaviours through time, and Close, Volume, OpenInt}, an important note is that the price
potentially getting a better prediction. is adjusted for dividends and splits. Training data for the
network is taken from the closing price each day and fed as
III. METHOD a time sequence.

A. Performance Evaluation
When evaluating a network, one generally use classic loss
functions like cross entropy, mean square error and so on.
The model should learn to minimise the absolute value of the Fig. 4. Illustration of how the data from the major time sequence is split
euclidean distance between the true value and the prediction. up in x and y data for the network to train on.
Thus the model is set to learn minimising the Mean Square
Error (MSE) between prediction and ground truth, as it is A key component when training networks are data prepro-
simply the squared euclidean distance. However, the output cessing. Without it the network might learn a different task
of the model in prediction will be a Gaussian distribution. To than intended. Stock prices can differ vastly just between a
evaluate the networks ability to capture the distribution, it is couple of different stocks, and the only reasonable way to
no longer sufficient to solely calculate the MSE. Instead, the measure the yield or loss is by analysing it with percentages.
measurement for the networks performance in prediction will Predicting stock markets forces some kind of generalisation
be to minimise the Standard Error (SE) between the posterior of the data input to not mislearn from stock prices that are
different from the one to predict. This can be applied in Fig. 5 shows that the models’ performance converges with
various ways. Suppose input sequence x, and normalised sufficient number of training epochs. An explanation of this
input x̃. could be the lack of input data. Currently, the input is a 2-
dimensional vector with time and price. If more information
x were to be fed to the deeper networks, it may put the
x̃ = (5)
|x|∞ increased expressiveness to use.

y
ỹ = (6)
|x|∞

With input data x̃ extracted with Eq. 5 and 6, the problem


of differing stock prices would be solved. Note that the
normalisation is done at each input sequence, not the stocks D. Model Prediction Uncertainty
entire price sequence. Avoiding the problem of learning
differently from stocks that have a lot of increase or decrease.
As stated previously the mean and variance are extracted
However, this is not completely true. The same problem still
from the following equations.
persist, just in a smaller time frame. But the small bias it
produces in the input is then considered negligible.

C. Training the Model


1 N
E(y) ≈ ∑ ŷn
To find an optimal design of the network, several models N n=1
were designed and tested. All with only LSTM and dropout ∑Nn=1 (yn − E(y))
layers, but with various layer widths and depths. There Var(y) ≈
N −1
was some experiments done having varying layer width
throughout the network, but it was concluded that the best
result was achieved when having roughly the same width in
all layers. Training our network, in a classical manner, yields the
trained parameters used in the prediction. Then using Monte
TABLE I
Carlo dropouts in the prediction and iterating over the
M ODEL WITH BEST PERFORMANCE .
prediction gives the mean and variance. To make sure that
we got the correct mean and covariance, we iterated until the
Model No. of layers No. of nodes in each layer values converged (≈ 1000 times).
Best model 2 512

In all models there was only one dropout layer, which was
applied on the last LSTM layer. When trying to find the best
model, deeper models was also experimented with. But the
depth did not increase performance noticeably beyond the
point of the best model, and only increased the time to train
it.

Fig. 6. Using the network to predict three days in the future. Illustrating
how the stock price compares to the prediction, with both mean and 3σ -lines
to show how it aligns with the distribution.

Testing the best model on predicting the stock price seems


to have some bias. During several tests the same behaviour,
Fig. 5. Training loss with various models over epochs. as seen Fig. 6, was persistent and no solution to it was found.
ratios and debt ratios are all important information for the
investor. So why would it not be for the network? The
short answer is that it probably would be beneficial for
the network, as more information which is correlated with
the prediction generally improves performance. The catch is
that none of the datasets available free of charge includes
such data. It would be interesting to see how the network
would analyse a more complex input, and to see what figures
actually matters the most.
It is important to note that the uncertainty of the prediction
is not measured in terms of true uncertainty, or variance, of a
stock price. As getting the true uncertainty would mean that
we in hindsight would be able to tell how a stock price has
aligned with its distribution. We are not saying that there
Fig. 7. Predicting stock price in a single given point in time, with its is no way of approximating this, but that is not what the
prediction horizon increasing. network does. Output of the network will tell how it believes
the price will develop as well as its own uncertainty about the
Now predicting from a single point in time, with an prediction. If the implementation of the network with Monte
increasing time horizon for the prediction. In Fig. 7 the Carlo is just right, the uncertainty may very well approximate
prediction horizon goes from 1 to 10 days in the future. the true distribution of the stock price, but we have no way
When looking carefully, the variance actually behaves as one of verifying it with the tools used in this project.
would expect. It increases through time. Often in these cases
the prediction was actually better in the end of then in the V. CONCLUSIONS
middle. For example, compare the prediction at day 5 and
day 10. This could be explained by the stock markets natural In disregard of stock market predictions, introducing un-
movement is a slightly increasing line which the stock market certainty in a network is an interesting take on the classical
oscillates around, and our network just happen to catch that deterministic approach. In some cases one can already re-
in the given time frame. trieve the networks probabilities from the softmax layer, in
for example a simple classification network. But Gaussian
processes have some interesting properties to them, and
IV. DISCUSSION many methods assume the distribution as just that. Hence,
The result from predicting three days into the future seen applying a Bayesian approach might open up doors to further
in Fig. 6 may resemble the correct price rather well. But improving our networks.
keep in mind that the error from day to day often is around
1$ out of a total ∼ 65$, which is roughly an error 1.5%! R EFERENCES
And the mean error is around 1%. We concluded quite early
[1] Hegazy, Osman Soliman, Omar S. Abdul Salam, Mustafa. (2013). A
that stock market predictions was going to be unreliable. Machine Learning Model for Stock Market Prediction. International
Predicting a stock price with a mean square error of Journal of Computer Science and Telecommunications.
[2] Gal, Yarin Ghahramani, Zoubin. (2015). What My
roughly 1% in just one day is not a very good prediction. A Deep Model Doesn’t Know... [Blog] Yarin. Available at:
stock is nearly expected to move by that in a one day pre- http://mlg.eng.cam.ac.uk/yarin/blog 3d801aa532c1ce.html [Accessed
diction. However, it was very interesting to look at the error 27 Oct. 2018].
[3] B. Marjanovic, Huge Stock Market Dataset. Kaggle.
in standard deviations instead, giving it another meaning. 2017. Accessed on: 10th Oktober, 2018. Available:
We could then conclude that the network actually was quite https://www.kaggle.com/borismarjanovic/price-volume-data-for-
uncertain of its predictions. With the 3σ -lines bounding all all-us-stocks-etfs
true stock market prices. Even though the poor estimation,
there is a great strength in knowing the uncertainty of the
network. Another approach to introduce a similar behaviour
could be to use a Bayesian Neural Network (BNN), which
introduces weight uncertainty.
It was also quite difficult to improve upon the model. The
best model was a quite simple one with only two LSTM
layers, and other attempts to make it better did not really
have a take.
Further improvements to the network would definitely be
increasing the information fed to the network. In reality, an
investor looks at vast amount of data to assess a companies
value or future yield. Key figures like solvency ratio, liquidity