Blog

Learn About Our Meetup

4500+ Members

[P] Filtering data in a Pyspark Pipeline without losing all the data?

I have a project where I’m feeding a dataframe into a PipelineModel with two pretrained models inside. The flow goes something like this:

Input DF -> Preprocessing Transformers -> Model1 -> Model2 -> Output DF

The thing is, Model1 and Model2 predict on different values (e.g. Male vs Female). I tried using the SQLTransformer to filter the data on each type, but I drop everything, so the output of Model1 throws away all the data I need to predict in Model2.

Is there a way to filter data to be fed into Model1, then filter data to be fed into Model2, and then concatenate the dataframes to be returned?

Please let me know if I can clarify anything!

submitted by /u/Octosaurus
[link] [comments]

Next Meetup

 

Days
:
Hours
:
Minutes
:
Seconds

 

Plug yourself into AI and don't miss a beat