Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[P] For NLP researchers, Easy-to-use Text Preprocessing Package, PreNLP

Do very simple text-preprocessing (a.k.a dirty work) with PreNLP Package !

I’m working in NLP part, and implementing a package to do iterative but necessary works for NLP. Here are some exmaples to preprocess text for following NLP tasks.

  • Frequently used normalization functions for text pre-processing are provided in prenlp. General use cases are as follows:

>>> from prenlp.data import Normalizer >>> normalizer = Normalizer() >>> normalizer.normalize('Visit this link for more details: https://github.com/') Visit this link for more details: [URL] >>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />') Use HTML with the desired attributes: [TAG] >>> normalizer.normalize('Hello 🤩, I love you 💓 !') Hello [EMOJI], I love you [EMOJI] ! >>> normalizer.normalize('Contact me at lyeoni.g@gmail.com') Contact me at [EMAIL] >>> normalizer.normalize('Call +82 10-1234-5678') Call [TEL] 
  • Quick tour to Text classification : The following example code trains fastText classification model on IMDB. The code below has only 16 lines of code (except blank lines and comments).

import fasttext import prenlp from prenlp.data import Normalizer from prenlp.tokenizer import NLTKMosesTokenizer # Data Preparation imdb_train, imdb_test = prenlp.data.IMDB() # Preprocessing tokenizer = NLTKMosesTokenizer() normalizer = Normalizer(url_repl=' ', tag_repl=' ', emoji_repl=' ', email_repl=' ', tel_repl=' ') for dataset in [imdb_train, imdb_test]: for i, (text, label) in enumerate(dataset): dataset[i][0] = ' '.join(tokenizer(normalizer.normalize(text.strip()))) # both prenlp.data.fasttext_transform(imdb_train, 'imdb.train') prenlp.data.fasttext_transform(imdb_test, 'imdb.test') # Train model = fasttext.train_supervised(input='imdb.train', epoch=20) # Evaluate print(model.test('imdb.train')) print(model.test('imdb.test')) # Inference print(model.predict(imdb_test[0][0])) 

For more details, follows: https://github.com/lyeoni/prenlp

p.s. And I want to know what you want to implement on this issue. I’ll implement that on this package.

submitted by /u/lyeoni
[link] [comments]