[D] Methods to handle streaming/real-time data storage, wrangling and prediction?
Say that there is data being streamed into Python (Kafka, Kinesis etc) every 10 seconds that I would like to wrangle and predict on. What is the best way to store this streaming data in order to do this? In the past, I have used online learning methods to do this. I am curious how to do this with a batch learning method.
I was thinking we iteratively populate a DataFrame with this data until stream stops, preprocess on the entire dataframe, predict, clear/delete the DataFrame. A caveat of this method that I am able to think of would be scenarios in which this preprocessing and predicting takes longer than 10 seconds.
What are some ways to handle this?