On October 14, 2013, two long-distance phone calls were placed from Stockholm, Sweden; one to Eugene Fama in Chicago and other to Robert Schiller in New Haven, both in USA. It was rainy and gloomy Monday, not ideal start of the week by any standards and The Boomtown Rats would certainly agree, however two gentlemen were in high spirits as they hung up the phone and who could blame them – they were just informed that they were awarded Nobel prize in Economics. Behind every Nobel prize there is interesting story of lifelong commitments, sacrifices and professional up & downs, which culminates in mind-boggling discovery that changes our understanding of the world. This one was no different, but it came with additional twist. Mr. Fama and Mr. Schiller both received award for their groundbreaking work on the same subject, i.e. what drives asset prices, but at this point things get very interesting – both think that other one is wrong.
Just to illustrate how crazy this is, try to imagine a Nobel Prize in physics being shared by a guy famous for advancing a particular hypothesis and a guy famous for relentlessly attacking that hypothesis. Fama won his economics Nobel for arguing that markets are efficient. His Efficient Market Hypothesis (EMH) states that prices rationally incorporate all available information, making it impossible for investors to either purchase undervalued stocks or sell stocks for inflated prices. As such, it should be impossible to outperform the overall market through expert stock selection or market timing, and the only way an investor can possibly obtain higher returns is by purchasing riskier investments. On the other hand, Schiller, father of behavioral finance, is diehard critic of rationality, so much so that he described market efficiency as “one of the most remarkable errors in the history of economic thoughts”. Shiller holds that investors, being human, can be swayed by psychology which can lead to significant capital misallocations and market bubbles.
So who’s right? Well, both. Yeah, yeah, you probably expected some kind of drama, winner-loser kind of thing, but in reality both of them are right… and kinda wrong in the same time. President Harry Truman once exclaimed in frustration: “Give me a one-handed economist! All my economists say, ‘on the one hand…on the other’”. Well, this is one-hand-other-hand kind of situation, depends how you look at it… and when you look at it.
EMH believers argue it is pointless to search for undervalued stocks or to try to predict trends in the market through either fundamental or technical analysis… and bar few anomalies research proves their point. One of those anomalies that goes by the name of Warren Buffett, in 2007. bet a million dollars that an index fund (i.e. market) would outperform a collection of hedge funds (i.e. active investors) over the course of 10 years. Last year he won the bet. This is anecdotal evidence, but research gives us following stats: over the 15-year investment horizon, 92.33% of large-cap managers, 94.81% of mid-cap managers, and 95.73% of small-cap managers just fail to beat the market. EMH opponents on the other hand point to many events depicted by the graph above, such as Stock Market Crash of 1987, Dot.com Bust of 1999-2000 or “Great Recession” Crash of 2008, when indexes fell by over 20% in a single day, as evidence that stock prices can seriously deviate from their fair values. To me, it’s fair to say that these events have undermined the validity of EMH’s main idea – that market prices are always “right” (near the fair value) but have underlined the validity of its main implication for most investors – that beating the markets is extremely difficult, i.e. no free lunches.
When it comes to stock price forecasting, there are two main methods/approaches: fundamental and technical analysis. Fundamental analysis is a type of investment analysis where the share value of a company is estimated by analyzing its sales, profits and other financial & economic factors. Basically, fundamental analysis looks at a company’s business to determine if its stock is properly valued. This method is best suited for long term forecasting. On the other hand, technical analysis uses past price movements of stocks to determine where stocks will go next and it’s most suitable for short term predictions. Technicians or chartists analyze time series data to identify “existing” patterns, trends or cycles. While fundamental analysis possesses some track record within active investing realm, research is unanimous when it comes to technical analysis – it does not work. Still, the rise of sophisticated deep learning models gives us reason to test this for ourselves and to again investigate EMH hypothesis. (To be honest, I just want to play with these fancy neural networks, so I wrote two pages of intro just to hook you. I’m kidding, we are going to make a shit load of money, so stay tuned.) Anyway, back to the subject, we are wondering can these powerful models, with their high degree of nonlinearity, utilize historical data to predict market returns? Furthermore, do the methods perform differently on different markets? The purpose of this post is to investigate the weak form of the EMH on the American and Croatian stock market.
Time-series forecasting involves basically two classes of algorithms: linear and non-linear models. However, basic assumption that forecasted time-series are stationary and linear does not hold when it comes to financial/stock time-series, so linear models like AR/ARMA/ARIMA/etc. are not very useful. That’s were non-linear models entered the picture, first methods like ARCH/GARCH/TAR and later heavy artillery like ANNs and deep learning models. So let’s take a step back and do short intro on the subject of artificial neural networks (ANNs) and deep learning (DL).
ANNs represent a class of machine learning models, loosely inspired by the study of our brains, i.e. neocortex, the wrinkly 80 percent of the brain where thinking occurs. Each net is made up of several interconnected neurons, organized in layers, which exchange messages (they fire, in jargon) when certain conditions happen. These models learn, in a very real sense, to recognize patterns in digital representations of sounds, images, and other type of data by approximating/modeling underlying non-linear functions/ processes.
To make things more clear, single perceptron (basic building block of ANNs) will pass a message to another neuron if the sum of weighted input signals from one or more neurons (summation) into it is great enough (exceeds a threshold) to cause the message transmission (activation). In addition, each perceptron applies a function or transformation to the weighted inputs, which means that the combined weighted input signal is transformed (linearly or non-linearly) prior to evaluating if the activation threshold has been exceeded. All those weights previously mentioned are adaptive; they are tuned constantly by learning algorithm during training phase when algorithm learns from observed data in order to improve the model. This is done by constantly assessing correctness of the network’s output, i.e. comparing actual output with expected (correct) output and calculating loss (or cost) function. Optimization technique called gradient descent is used to minimize the cost function, thus bringing your network’s actual outputs closer and closer, iteratively, to the expected outputs during the course of training. If you are new to the whole NN subject, you’ll definitely find much better and more comprehensive literature online so I won’t spend more time on this introduction.
Now that we are done with intros, let’s kick off the real thing. In this exercise we’ll try to model individual stock behavior and predict its returns. However, there are two ways to run this exercise – as a classification or as a regression problem. Because accurately predicting stock market prices is extremely challenging problem, for a start we will use bit simpler binary classification approach, i.e. we will try to predict whether the stock price will go up or down. This is called “direction of change” forecasting and tries to predict the direction of the market rather than the direction and magnitude (exact price). We will evaluate our model based on its power to make money, simple as that. This last sentence might sound obvious, not needed even stating it, but in so many places I encounter stuff like this: “we’ll do stock market prediction with deep learning, we used X model with Y topology, blah blah blah…and it worked like magic, we got great results, just look at our graphs!”. Then you see something like this:
Graph is so zoomed out that you can’t really see shit, you can’t get the sense of the accuracy and the authors even fail to calculate potential gains compared to some benchmark (e.g. buy&hold strategy would be a good choice). To make it worse, this was from conference/science journal paper. Other folks might try to convince you that the following graph is example of good short-term forecasting, when in fact model just lags real-world signal. You don’t need DL for this, you can just use Y(t+1) = X(t) and model would be equally good, meaning bad. In both cases you’ll end up losing money. So if you see DL application for stock market forecasting without profit/loss calculation (compared to some benchmark) be very sceptic, especially having in mind EMH (sorry Robert!).
Now that we have clear problem statement and defined KPIs, it’s time to outline our approach. In short, we want to predict whether the stock will go up or down tomorrow based on the stock behavior in the last X days. In order to do this, we need to perform following steps:
- data preprocessing & feature selection/engineering;
- select observation window;
- perform train-test split;
- do feature scaling, i.e. normalization;
- prepare the data: we want to predict whether the stock will go up or down tomorrow (T+1) based on the stock behavior in the last D days so we need to prepare the samples accordingly. If we use N features this means that one sample of input X we’ll be array of D rows and N columns (D*N), while one sample of target variable Y will be either vector [1, 0] (stock goes up) or [0, 1] (stock goes down) – remember one hot encoding.
- train the model;
- apply it and evaluate on test dataset;
- slide across time and repeat from point 2.
There’s one thing you need to keep in mind when normalizing features; any preprocessing statistics (e.g. mean/variance or min/max or…) must only be computed on the training data, and then applied to the test data. I’ve seen examples where folks have first normalized entire dataset and then performed train-test split – by doing this you are introducing future, unknown information in the training data set which can then influence results.
I’ve decided to use two combinations of features during model construction process: first combo containing 5 features, namely Open-Close-Low-High-Volume and second feature combo with 3 variables: Daily Return, Volatiliy (=High/Low) and Volume. For modeling I’ve decided to try Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), namely Long-Short Term Memory Network.
CNNs are bit different than standard neural networks; they make explicit assumption that inputs to the model are images which constrains their architecture considerably but permits them to scale really well to this type of input. Every layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters. In contract to regular networks, layers of CNNs are organized in 3 dimensions: width, height and depth, and neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. At first CNN might sound like a weird choice, considering this class of ANNs is most commonly applied to analyzing visual imagery but our case is not that different. Think about it, CNNs accept 2D/3D inputs which is exactly what we have here – one sample of our data set is array/matrix of D(days)*N(features). Intuitively, the idea of applying CNNs to time series forecasting would be to learn filters that represent certain repeating patterns in the series and use these to forecast the future values. I’ve used Conv1D layer which creates a convolution kernel that is convolved with the input layer over a single spatial, or in our case, temporal dimension to produce a tensor of outputs. I know this is lot of technical jargon, so in case you are not familiar with CNNs please find more info online, it’s just impossible to cram it all in one blog post.
If CNNs sound complicated, please welcome to the stage RNNs. Recurrent nets are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes or in our case (hopefully!) stock market data. These algorithms take time and sequence into account, they have a temporal dimension or let’s say memory. Basically, they are networks with loops in them, allowing information to persist. In contrast to classical feed-forward networks, recurrent networks take as their input not just the current input example they see, but also what they have perceived previously in time. Again, this is just to give you intuition about the functioning of LSTM networks, in reality it’s much, much more complicated but I don’t want to go in detail about cell state, different types of gates etc.
When you start playing with DL models it’s all fine and dandy until you get first results. Then you start thinking: hmm, I can do better than this, maybe if I change this and adjust that… before long and without realizing it, you’re in a rabbit hole of hyperparameter optimization… and unlike classical machine learning models, deep learning models are literally full of hyperparameters (e.g. learning rate, dropout, batch size, etc.).
When you also count in model design variables (e.g. number of layers, activation functions, optimizers etc.), it all escalates really quickly. Obviously, not all of these variables contribute in the same way to the model’s learning process, however finding the best configuration for these variables in such a high dimensional space is extremely complex challenge. After long and painful hyperparameter search, we’ve ended up with following two model candidates:
As you can see we ended up with stacked CNN and LSTM network architectures, with LSTM bit more complex having in mind number of trainable parameters. Complex neural networks are very prone to high variance, meaning models are learning the noise in the data, rather than the actual relation between the input features and the classes it is supposed to predict. Consequently, when the model is applied to new inputs, which were not used in training, it will fail to make good predictions. You can mitigate this with large amount of training data, which is not the case here. Yes, we have all historical data for all S&P 500 stocks, but our problem is slightly different. Basically, every stock and every time period are different problem. Stock market behavior is non-stationary & non-ergodic process (which makes this such a difficult problem to solve), meaning history is not repeating itself so feeding it 10 years of data for learning means very little. We tinkered a lot and got best results with shorter time windows, i.e. training on 3-4 months of data in order to predict near-term future. This means we had huge problems with overfitting as you can see from our models and we ended up employing dropout, batch norm and L1 (lasso) regularization to manage it.
Accuracy was used as primary metric for measuring performance of our models. This was not ideal choice for at least two reasons; first, there was slight imbalance within up/down target outputs in the data set, i.e. days when stock went up were more frequent than days were stock went down (51-49%), so just predicting stock will go up every day would ensure model is already better than coin-flip, i.e. random guessing. For this reason, we used confusion matrix to monitor both TP rate (True Positive, i.e. stock went up and we correctly predicted it) and TN rate (True Negative, i.e. stock went down and we correctly predicted it).
Second reason why accuracy is not ideal performance metric is simple, but people tend to forget it: accuracy is not perfectly correlated to money-making power of the model. Why? Let’s just give you an example, if an investor stayed fully invested in the S&P 500 from 1993 to 2013, they would’ve had a 9.2% annualized return. However, if trading resulted in them missing just the ten best days during that same period, then those annualized returns would collapse to 5.4%. So not predicting correctly few important days (big up or down swings) could make huge difference. That’s why theory says you shouldn’t trade, you shouldn’t try to time the market, you just buy and hold… and you buy “Jack Bogle” type index funds 🙂
Anyway, back to the topic… we’ve used two type of network architectures, LSTM and CNN, on two different feature sets and following table shows percentage of correct directional predictions for different model-feature set combinations. First note that all models on all datasets are predicting the correct direction with a probability of 50% or higher, meaning we are better than coin-flip. Interesting enough, CNN model outperformed LSTM one, not only accuracy wise but also regarding robustness, as we observed occurrences where TP rate and TN rates differed considerably, meaning LSTM just “memorized” prevalent behavior. Stock market prediction problem is such a difficult one and loss function surface is very convoluted, meaning we have many local minima so final weights and biases can vary with each training sessions as well as observed accuracy. Because of temporal dependencies classical k-fold validation was not really an option, so we are reporting averaged generalization accuracy. It’s also interesting to observe that 3-feature set performed considerably worse. Motivation for using changes instead of values was to allow apples-to-apples comparison across different time periods, it kinda makes our time-series stationary, however our models didn’t appreciate dimensionality reduction as if by removing prices we removed “memory”. One potential idea for the future: it would be interesting if we could take good aspects of both feature sets, i.e. create features that are more stationary but preserve “memory”.
As we previously mentioned, accuracy is not best metric for model evaluation as it does not translate well to earning power, so in order to make any judgements we’ll need to evaluate performance of our trading strategy. As benchmark we’ll use buy & gold and check whether our Long-Out strategy can beat it. By Long-Out we mean buying the stock when the model predicts stock will go up and staying out of the market when negative returns are predicted. As we can see, our model consistently earns money but does not consistently outperform buy & hold approach. Despite outperforming buy & hold approach on average, performance variations/instability due to previously mentioned loss function’s local minima is a problem.
We’ve tried improving the accuracy and stability by aggregating the output of several predictors and while initial testing showed some improvement, performance benefits in comparison to buy and hold approach are still very small which brings us to another important disclaimer. We’ve performed this analysis in hypothetical frictionless world under certain set of assumptions which do not/might not hold in reality. First and foremost, transaction costs would be significant factor when operating this kind of day trading strategy. Also, we have certain liquidity assumptions which would prevent executing this strategy on larger scale.
Even the above described approach and results might look interesting on the first look, reported accuracy is not enough to cover real-world frictions so lot of important parts for profitable trading are still missing. We’ve already mentioned some of the encountered issues and ways for improvement, including making time-series more stationary but preserving “memory”. Also, looking back at the whole process, the way we prepared our data sets is also something that can be improved. To illustrate, Xs (for predicting Ys) overlap a lot in time, they are similar and it’s hard for model to figure out which observed feature caused the effect. Other potential ways to improve performance might be including additional features such as bids and asks from the order book, alternative data such as real-time news and social media feeds, shortening time frame to intra-day trading etc. Other input variables could be economic indicators or other financial data such as exchange rates, volatility indices and/or interest rates. This brings us to multivariate time series forecasting where we can extract meaningful trends and relationships/dependencies in and between the noisy datasets. There are also other approaches such as momentum and (contrarian) value investing which in theory provide much more opportunity than technical analysis. Value investing is something we definitely plan to investigate from ML perspective, but like always getting quality data is a problem and it’s especially true for companies’ fundamental data. Anyway, all these options represent potential future extensions to this project and we’ll be doing some of them for sure.
I guess it’s time to wrap this up and summarize what we’ve learned. In short, it’s not enough to plug in a neural network and make shit load of cash, finance is just not plug&play field. Also, despite the power and potential of deep learning models, lack of effectiveness of technical analysis was proved once again. Our results suggest that technical analysis used by Wall Street practitioners maybe more of an art than science. There’s just too much randomness, too much noise and obviously not enough predicting power within available feature set (stock prices and volume). Prices and volumes just don’t reflect enough information about the market and what its participants are doing. Still, despite this and other reported issues this was a great exercise; we’ve learned a lot about subject domain and application of ML for stock market predictions and financial time-series modeling, with all its promises and perils and specific problems that you probably wont encounter anywhere else. It’s impossible to be expert in all Data Science domains and I certainly learned how much I don’t know about time-series forecasting, especially tricky subjects like this one. However, despite rather modest results (which were expected though), it was a great learning opportunity; we got the idea about next steps and better ways to approach the problem. To conclude, financial machine learning is different and difficult, but fun… so stayed tuned.