[D] How to feed variable length text data with a temporal structure?
I am working on a project that aims to predict stock returns using tweet data. I have been playing with an online dataset from here: https://github.com/yumoxu/stocknet-dataset. My aim is to feed, for example, tweets for 30 stocks in a day (variable number of tweets every day), and output a vector of stock return predictions for those 30 stocks. Since each tweet has different length, I was thinking to implement a RNN to feed in the words sequentially. It then seems to me the model will then capture the “temporal structure” of the text, but I am not sure how to capture the time series aspect of the data.
My questions can be summarised as follows:
(1) How to incorporate the time series as well as the textual temporal structure in the data I have?
(2) Or I am modelling my problem wrongly?
Edit: I have heard of encoder-decoder structures in sophisticated models like BERT, and the use of <EOS> tags to notify the model where to stop for each sentence (tweet). I think that might be something I should look into but it seems a little complicated when I was reading the BERT paper. I am rather amateur in this area so I prefer something a little beginner friendly to start with. Thanks!
Any ideas or references will be greatly appreciated. Cheers!
submitted by /u/blueclover
[link] [comments]