Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] Spark ML – Saving PySpark custom transformers in a pipeline model

I created a spark pipeline where the first stage is a custom transformer, which only filters data on a particular attribute for a column

class getPOST(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None): super(getPOST, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None): kwargs = self._input_kwargs return self._set(**kwargs) def _transform(self, dataset): return dataset.filter(dataset.method=='POST') 

The model works great, I’m getting good performance, but when I go to save the model, I’m met with:

ValueError: ('Pipeline write will fail on this pipeline because stage %s of type %s is not MLWritable', 'getPOST_23cb579f79db', <class '__main__.getPOST'>) 

I’ve been reading up and I don’t think a transformer is the most applicable use in this case as I’m not appending any columns onto the dataset and not messing with any values or parameters that need declared, such as I found in this link. I can’t find other examples that allow you to filter out data in the Spark ML pipelines.

This is the last stage of a project I’m working on and I’d greatly appreciate any push in the right direction. Thank you for taking the time to read this!

submitted by /u/Octosaurus
[link] [comments]