[P] Gobbli: A Python Framework for Text Classification Projects
At my day job, we do a lot of text classification projects with small/medium size data. Recent advances in transfer learning for NLP have moved these types of projects from impossible to feasible, especially for batch classification tasks we see frequently on survey projects with free-text responses. Models like BERT have been documented for research, and in trying to use them we found ourselves spending a lot of time extending them to the non-benchmarking applications and datasets we were curious about. Given these issues, we built a framework for text classification projects that aims to make the consistent application of transfer learning and other models easier.
For a little more context, we started trying out BERT last year and new models continued to be rapidly released. Every time there was a new model there was a new API to learn. pytorch-transformers from HuggingFace helped a lot with this standardization issue, so we also took a look at what happens before a model is built (data processing and augmentation) and afterwards (model evaluation), and built supporting tools around those problems as well.
In addition, since most models require GPUs, so we were spending a lot of time configuring environments, code, and data in tandem with Docker which gets messy. Because of this, we’ve abstracted most of that orchestration out so most everything is python code.
Details on the library are below. We’ve battle tested it on a few projects and are curious to have others kick the tires and give us feedback if you’re doing text classification.