[P] For NLP researchers, Easy-to-use Text Preprocessing Package, PreNLP

Written by torontoai on December 16, 2019. Posted in Reddit MachineLearning.

Do very simple text-preprocessing (a.k.a dirty work) with PreNLP Package !

I’m working in NLP part, and implementing a package to do iterative but necessary works for NLP. Here are some exmaples to preprocess text for following NLP tasks.

Frequently used normalization functions for text pre-processing are provided in prenlp. General use cases are as follows:

>>> from prenlp.data import Normalizer >>> normalizer = Normalizer() >>> normalizer.normalize('Visit this link for more details: https://github.com/') Visit this link for more details: [URL] >>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />') Use HTML with the desired attributes: [TAG] >>> normalizer.normalize('Hello 🤩, I love you 💓 !') Hello [EMOJI], I love you [EMOJI] ! >>> normalizer.normalize('Contact me at lyeoni.g@gmail.com') Contact me at [EMAIL] >>> normalizer.normalize('Call +82 10-1234-5678') Call [TEL]

Quick tour to Text classification : The following example code trains fastText classification model on IMDB. The code below has only 16 lines of code (except blank lines and comments).

import fasttext import prenlp from prenlp.data import Normalizer from prenlp.tokenizer import NLTKMosesTokenizer # Data Preparation imdb_train, imdb_test = prenlp.data.IMDB() # Preprocessing tokenizer = NLTKMosesTokenizer() normalizer = Normalizer(url_repl=' ', tag_repl=' ', emoji_repl=' ', email_repl=' ', tel_repl=' ') for dataset in [imdb_train, imdb_test]: for i, (text, label) in enumerate(dataset): dataset[i][0] = ' '.join(tokenizer(normalizer.normalize(text.strip()))) # both prenlp.data.fasttext_transform(imdb_train, 'imdb.train') prenlp.data.fasttext_transform(imdb_test, 'imdb.test') # Train model = fasttext.train_supervised(input='imdb.train', epoch=20) # Evaluate print(model.test('imdb.train')) print(model.test('imdb.test')) # Inference print(model.predict(imdb_test[0][0]))

For more details, follows: https://github.com/lyeoni/prenlp

p.s. And I want to know what you want to implement on this issue. I’ll implement that on this package.

submitted by /u/lyeoni
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] For NLP researchers, Easy-to-use Text Preprocessing Package, PreNLP