[P] Spark ML – Saving PySpark custom transformers in a pipeline model

Written by torontoai on October 27, 2019. Posted in Reddit MachineLearning.

I created a spark pipeline where the first stage is a custom transformer, which only filters data on a particular attribute for a column

class getPOST(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None): super(getPOST, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None): kwargs = self._input_kwargs return self._set(**kwargs) def _transform(self, dataset): return dataset.filter(dataset.method=='POST')

The model works great, I’m getting good performance, but when I go to save the model, I’m met with:

ValueError: ('Pipeline write will fail on this pipeline because stage %s of type %s is not MLWritable', 'getPOST_23cb579f79db', <class '__main__.getPOST'>)

I’ve been reading up and I don’t think a transformer is the most applicable use in this case as I’m not appending any columns onto the dataset and not messing with any values or parameters that need declared, such as I found in this link. I can’t find other examples that allow you to filter out data in the Spark ML pipelines.

This is the last stage of a project I’m working on and I’d greatly appreciate any push in the right direction. Thank you for taking the time to read this!

submitted by /u/Octosaurus
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Spark ML – Saving PySpark custom transformers in a pipeline model