Please see Github Repository
This repository presents our work during a project realized in the context of the IEOR 8100 Reinforcement Leanrning at Columbia University.
This Deep Policy Network Reinforcement Learning project is our implementation and further research of the original paper A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem (Jiang et al. 2017).
Objective: The problem is the one of automated portfolio management: given a set of stocks, how to best allocate money through time to maximize returns at the end of a certain number of timesteps. In this way, we aim to build an automated agent which best allocates the weights of its investment between different stocks.
Data: Jiang et al. use 13 crypto-currencies from the Poloniex exchange. They take into account the open, high, low, close (OHLC) prices, minute per minute. They allow a portfolio rebalance every 30 minutes. They reprocess the data and create a tensor based on the last 50 time-steps.
We extend the experiment to the stock market, using the framework on daily data and intraday data with a daily rebalance.
The project is decomposed in 3 parts:
The files are:
For each stock, the input is a raw time series of the prices (High, Low, Open, Close).
The output is a matrix of 4 rows (3 in the case of the cryptocurrencies - Open(t) = Close(t-1) - the market never closes) and n (number of available data points) columns.
The columns correspond to:
The portfolio manager agent is set-up in the way:
The policy function is designed through a deep neural network which takes as input the input tensor (shape m x 50 x (3 or 4)) composed of :
A first convolution is realized resulting in a smaller tensor. Then, a second convolution is made resulting in 20 vector of shape (m x 1 x 1). The previous output vector is stacked. The last layer is a terminate convolution resulting in a unique m vector. Then, a cash bias is added and a softmax applied.
The output of the neural network is the vector of the actions the agent will take.
Then, the environment can compute the new vector of weights, the new portfolio and instant reward.
This part is still in progress as of today. Our thought is we are still not able to reproduce the paper’s results. Indeed, even if the algorithm demonstrated the capacity to identify high-potential stocks which maximizes results. However, it has a little potential to change the position through the trading process.
We tried many initial parameters such as low trading cost to produce incentive to change of position.
The agent is ‘training sensitive’ but it is not ‘input state sensitive’.
In order to make the policy more dynamic, we think of using a discrete action space using pre-defined return thresholds. We’ll turn the problem replacing the softmax by a tanh or by turning it into a classification task.