Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] For NLP researchers, Easy-to-use Text Preprocessing Package, PreNLP

Do very simple text-preprocessing (a.k.a dirty work) with PreNLP Package !

I’m working in NLP part, and implementing a package to do iterative but necessary works for NLP. Here are some exmaples to preprocess text for following NLP tasks.

  • Frequently used normalization functions for text pre-processing are provided in prenlp. General use cases are as follows:

>>> from import Normalizer >>> normalizer = Normalizer() >>> normalizer.normalize('Visit this link for more details:') Visit this link for more details: [URL] >>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />') Use HTML with the desired attributes: [TAG] >>> normalizer.normalize('Hello 🤩, I love you 💓 !') Hello [EMOJI], I love you [EMOJI] ! >>> normalizer.normalize('Contact me at') Contact me at [EMAIL] >>> normalizer.normalize('Call +82 10-1234-5678') Call [TEL] 
  • Quick tour to Text classification : The following example code trains fastText classification model on IMDB. The code below has only 16 lines of code (except blank lines and comments).

import fasttext import prenlp from import Normalizer from prenlp.tokenizer import NLTKMosesTokenizer # Data Preparation imdb_train, imdb_test = # Preprocessing tokenizer = NLTKMosesTokenizer() normalizer = Normalizer(url_repl=' ', tag_repl=' ', emoji_repl=' ', email_repl=' ', tel_repl=' ') for dataset in [imdb_train, imdb_test]: for i, (text, label) in enumerate(dataset): dataset[i][0] = ' '.join(tokenizer(normalizer.normalize(text.strip()))) # both, 'imdb.train'), 'imdb.test') # Train model = fasttext.train_supervised(input='imdb.train', epoch=20) # Evaluate print(model.test('imdb.train')) print(model.test('imdb.test')) # Inference print(model.predict(imdb_test[0][0])) 

For more details, follows:

p.s. And I want to know what you want to implement on this issue. I’ll implement that on this package.

submitted by /u/lyeoni
[link] [comments]