[P] Spark ML – Saving PySpark custom transformers in a pipeline model
I created a spark pipeline where the first stage is a custom transformer, which only filters data on a particular attribute for a column
class getPOST(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None): super(getPOST, self).__init__() kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None): kwargs = self._input_kwargs return self._set(**kwargs) def _transform(self, dataset): return dataset.filter(dataset.method=='POST')
The model works great, I’m getting good performance, but when I go to save the model, I’m met with:
ValueError: ('Pipeline write will fail on this pipeline because stage %s of type %s is not MLWritable', 'getPOST_23cb579f79db', <class '__main__.getPOST'>)
I’ve been reading up and I don’t think a transformer is the most applicable use in this case as I’m not appending any columns onto the dataset and not messing with any values or parameters that need declared, such as I found in this link. I can’t find other examples that allow you to filter out data in the Spark ML pipelines.
This is the last stage of a project I’m working on and I’d greatly appreciate any push in the right direction. Thank you for taking the time to read this!
submitted by /u/Octosaurus
[link] [comments]