[Project] Help with a quantitative dataset for a ML problem
I’m currently working on a ML project that considers (mostly) quantitative figures, i.e. a ‘classical’ ML problem with underlying data coming from a .csv file.
The problem setting is the following: I want to predict if workers are missing their shifts or not. For this problem I have a shift plan available, among other datasets. The plan tracks the date, shift durations and other very obvious data and of course if the workers were present for the respective shift or not (=Target Variable). I already did some EDA on the shift plan and incorporated some features for the classifier that were referring to the last shift. For example: For the shift in question, I incorporated a Feature that is documenting the number of consecutive shifts for the respective worker that he/she was present. Or how many consecutive shifts was the worker absent. For this, I just shifted the calculated column down. The following example might help:
I feel that this is a valid approach to incorporate historic information of the shifts/worker for the shift that needs to be predicted. If not, please tell me what I missed at this point and what approach I should rather consider.
The actual problem now comes with other datasets that I want to join with the shift plan dataset. For example I have a dataset that tracks the assignments of the workers per shift that he/she has accomplished. Again, there are some quantitative figures recorded per assignment. Along the same lines as the shift dataset, I want to incorporate some historic shift information of the worker for the shift that needs to be predicted. Therefore, I was trying to group the assignments per worker on a shift-base and calculate some quantitative figures (e.g. min, max, mean values). However, at this point I experienced the problem that there can be several consecutive not attended shifts for which there is no data from the assignment dataset. Therefore I don’t know how to incorporate the historic data. Consider the following example:
If I was trying to predict the shift at 2019-01-03 I could shift the column ‘Assignment_Measure’ one step size down like above. However, for the shift at 2019-01-07 I would still receive no information. A naive solution would be just to copy the values from the absent shift before but I think this is going to be problematic for the model as this would introduce many very similar rows.
Does any of you guys have an idea how to solve this problem? Or how this problem could be tackled?
Thanks a lot for your help and input!