Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Filtering data in a Pyspark Pipeline without losing all the data?

I have a project where I’m feeding a dataframe into a PipelineModel with two pretrained models inside. The flow goes something like this:

Input DF -> Preprocessing Transformers -> Model1 -> Model2 -> Output DF

The thing is, Model1 and Model2 predict on different values (e.g. Male vs Female). I tried using the SQLTransformer to filter the data on each type, but I drop everything, so the output of Model1 throws away all the data I need to predict in Model2.

Is there a way to filter data to be fed into Model1, then filter data to be fed into Model2, and then concatenate the dataframes to be returned?

Please let me know if I can clarify anything!

submitted by /u/Octosaurus
[link] [comments]