[P] Filtering data in a Pyspark Pipeline without losing all the data?
I have a project where I’m feeding a dataframe into a PipelineModel with two pretrained models inside. The flow goes something like this:
Input DF -> Preprocessing Transformers -> Model1 -> Model2 -> Output DF
The thing is, Model1 and Model2 predict on different values (e.g. Male vs Female). I tried using the SQLTransformer to filter the data on each type, but I drop everything, so the output of Model1 throws away all the data I need to predict in Model2.
Is there a way to filter data to be fed into Model1, then filter data to be fed into Model2, and then concatenate the dataframes to be returned?
Please let me know if I can clarify anything!
submitted by /u/Octosaurus
[link] [comments]