[D] Choosing an algorithm dealing with different label input shape sizes and creating a generalized regression model
Disclaimer: I’m not totally new to Machine Learning, I have witnessed some projects and I’m generally informed with how it all works but I never applied it myself.
Anyway, I have a Machine Learning problem and because of the properties of the problem I’m not sure what algorithm to use.
I’m trying to predict the travel time of an object.
Each trip is time-series labelled (infrequent due to detection and sensor placement) and consists out of one or more data points (records in my dataset). Each record contains geospatial information about the object.
Basically, the dataset consists out of tons of records of which one or more records belong to one trip exclusively. Each new trip is marked with a unique ID. They are all important since they contain information such as geospatial data.
Connecting all the data points of each trip ID will generate a “path”. However, some trips in the dataset only have one data point but with a known start and end point. This means a path cannot be constructed from those records.
I have created labels by finding certain conditions that should be met. Each label consists at least out of two records. This means a lot of information in between does not have to be present, as long as the object was observed close to the start and ending point, it can be considered a label and the total travel time can be estimated. It can also mean that a trip with a lot of data points but with no data points close to the start and end location of the object cannot be considered a label.
I have added additional features like the distance to the previous point in km and total trip distance in km (float values). Most features describe the relationship to the previous data point.
For the prediction labels: at least know the total distance the object will travel and I will have at least one data point with geospatial information in between this start location and end location.
For the training labels: at least I will have two data points with geospatial information close to start and end location together with the total travel time.
Ideally I want to predict the travel time for each trip with one or more data points (records). I’m not trying predict/construct the path taken. Just an estimate for the travel time.
The problem is, is that I’m not sure what algorithm to use since:
- I have multiple records that belong to one record. So far I haven’t really come across how to deal with a variable input shape size. Most say to reduce the input shape size to a single record/row (I would call it “flattening”) by binning values or one-hot encoding values.
- “Flattening the features” is not really possible. The added depth of the additionally computed features would be lost by binning since they provide insight about how it is connected to the previous data point (distance).
- A trip can be represented by simply one record in the database or by a lot of records. The greater the record count, the better. Some trips have consists out of 50 records which allow for a better estimate.
I was thinking of a Recurrent Neural Network since they can deal with a time-series sequence but I’m questioning it can be applied tot his problem.
Can I train a Recurrent Neural Network on a lot of groups (trips in this case) and generate a generalized model that I can use to predict other groups? Or can it only make prediction within each group? I have a lot of trips (groups) but the available information per trip (group) is very limited in most cases. I therefore want to develop a generalized model that will work for all groups.