Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

[D] How to do trend analysis on textual data

Hi all, I am now working on a dataset of customer reviews and we would like to analyze how customer feedback change across time. For sentiment it is easy as I can build a sentiment classifier and have a sentiment scores, and do conventional time-series analysis on the score. However, when it comes to analysis like topic-modeling, is there any time-trend related analysis on topic-modeling? Thanks for any advices.

submitted by /u/InventorWu
[link] [comments]

[D] Marketplace for machine learning?

The idea is having a marketplace where researchers publish trained models and developers like myself buy the models (for example .pb files in Tensorflow) and use it to solve my client’s problem. I’ve been searching on Google for a few days but there is no such marketplace except free and open-source models.

Commercializing pre-trained models would create new jobs in the machine learning field and speed up the process of applying research results into practice.

For example, the researcher publishes the trained models of forecasting inventory demand, and the developer uses it to develop software for eCommerce websites.

How do you think about the idea?

submitted by /u/ConVit
[link] [comments]

[R] Learn faster with smarter data labeling

Hey, some research we’ve done in the direction of active learning.

Dealing with a big unlabeled dataset may become very expensive very fast. Therefore it makes sense to invest time into labeling optimization techniques. In the article below, we explore one of the optimizations called active learning. Active Learning is a branch of machine learning that seeks to minimize the total amount of data required for labeling by strategically sampling observations that provide new insight into the problem. In particular, algorithms try to select diverse and informative data for annotation (rather than random observations) from a pool of unlabeled data.

Excited to share:

https://towardsdatascience.com/learn-faster-with-smarter-data-labeling-15d0272614c4

submitted by /u/michael_htx
[link] [comments]

[N] Deep Graph Library new release (v0.4)

This new release brings the support of heterogeneous graph. A heterogeneous graph is a graph whose nodes and edges are typed, which is very common in knowledge graph, recommender system and many other scenarios. Using this new feature, DGL brings many new models with efficient implementation. Here are some examples:

  • Graph Convolutional Matrix Completion [Code in MXNet]

    Dataset RMSE (DGL) RMSE (Official) Speed (DGL) Speed (Official) Speed Comparison
    MovieLens-100K 0.9077 0.910 0.0246s/epoch 0.1008s/epoch 5x
    MovieLens-1M 0.8377 0.832 0.0695s/epoch 1.538s/epoch 22x
    MovieLens-10M 0.7875 0.777 0.6480s/epoch OOM

One highlight is that DGL can train the GCMC model on MovieLens-10M dataset in one GPU in only an hour. Previous implementation resorts to load mini-batches on-the-fly from CPU which could take up to 24 hours.

One highlight is that using the heterograph interface, the new code can train an R-GCN on the full AM RDF graph (>5M edges) using one GPU, while the original implementation can only run on CPU and consume 32GB memory. It takes 51.88s to train one epoch on CPU, while the new implementation takes only 0.1781s for one epoch on V100 GPU (291x faster !!).

Apart from the heterogeneous graph support, a new package DGL-KE is released for training popular network embedding models. Currently, DGL-KE supports TransE, DistMult, ComplEx and can train them very fast. It only takes 6.85 minutes to fully train a TransE model using one GPU on FB15K graph. As a comparison, GraphVite takes 14 minutes using four GPUs. More models (RESCAL, RotatE, pRotatE, TransH, TransR, TransD, etc) are under developing and will be released in the future.

All the models and training scripts are available and can be run off-the-shelf. Checkout this exciting new release (https://github.com/dmlc/dgl/releases/edit/v0.4.0) if you are working on network embedding or problems that can be formulated as heterogeneous graphs!

submitted by /u/jermainewang
[link] [comments]

[D] Architectural question: multiple input tensors, how best to combine to single output tensor?

Sorry if this question has been asked before. I’m making a classifier which takes as input multiple tensors (representing images) and produces a single output (prob. distribution) . Each of the inputs have a few stacks of residual blocks on top, and I’m wondering how best to combine the output of each of these branches. As of now, I’m simply producing logits for each branch and doing an element-wise sum over them (with coefficients for each branch as one of the input tensors is much more important than the others). Is there a better approach (I’ve heard concatenation is another approach here, but not sure which would be better)? Should I create a loss expression for each branch and sum those loss expressions instead? Thanks for any clarity you guys can provide me with.

submitted by /u/lolololroflhax
[link] [comments]

[D] Predicting whether model made a mistake

In many cases, for example in policy networks, it would be useful to be able to assess whether user intervention is necessary (for example if there is no clear candidate intent/action for a given input). However, it is reasonable to assume that a model performing poorly is also bad at estimating whether it is performing poorly. Does there exist any research regarding this issue?

submitted by /u/_diffee_
[link] [comments]

[P] Tsanley: auto-finding subtle tensor shape errors in your deep learning code

When writing deep learning programs, keeping track of tensor shapes and dealing with subtle tensor shape errors (implicit broadcasts!!) gets quite frustrating.

We’ve been working on a tool tsanley (pronounced ‘stanley’) to enable finding subtle shape errors in your deep learning code quickly and cheaply. The key idea is to label tensor variables with their expected shapes (e.g., x : 'b,t,d' = ...) and let tsanley perform shape validity checks at runtime automatically. Works with small and big tensor programs.

repository: https://github.com/ofnote/tsanley

Quick example:

python def foo(x): x: 'b,t,d' #expected shape of x is (B, T, D). y: 'b,d' = x.mean(dim=0) * 2 # error! z: 'b,d' = x.mean(dim=1) # ok return y, z Function foo contains tensor variables labeled with their named shapes using a shorthand notation. It has a subtle shape error in the assignment to y: we expect the shape of y to be (B,D), however mean got rid of the first, and not the second, dimension. Your tensor library (pytorch / tensorflow / ..) won’t flag this as an error: instead, we will get a weird shape inconsistency error somewhere downstream.

tsanley finds such unexpected bugs quickly at runtime: “` Update at line 37: actual shape of y = t,d

FAILED shape check at line 37 expected: (b:10, d:1024), actual: (100, 1024)

Update at line 38: actual shape of z = b,d

shape check succeeded at line 38 “`

Writing these named shape annotations manually can also get tedious. tsanley can auto-annotate the tensor variables in your (or someone else’s) code, if the code is executable. This is especially useful when trying to dig deep into or adapt an existing code / library for your project.

The tool builds upon the tsalib library, which introduced a shorthand notation for labeling tensor variables with their named shapes, irrespective of the backend tensor library used.

We would love feedback on tsanley and hope it is useful for your coding/debugging workflow.

submitted by /u/ekshaks
[link] [comments]

[D] NER – Data extraction for flight itineraries

I’m trying to use NER to extract data from flight itineraries rather than making regexes for each and every provider, unless they’re obviously similar.

My first question is what’s the current SOTA for tasks like this in seemingly unstructured HTML (although I am stripping the HTML and making it plain text first)? Secondly, how well would a technique like this ideally work for entities that look like YY57FLN5 of variable length?

I’ve found this paper which uses hidden markov models alongside NER for data extraction but seems quite old and doesn’t have all the details necessary to reproduce.

Could anyone more familiar in NER and data extraction help steer me in the right direction?

So far I’m attempting to make a small dataset using the BRAT tool while I research the area in more detail.

submitted by /u/vectorizedboob
[link] [comments]

[Discussion] Exfiltrating copyright notices, news articles, and IRC conversations from the 774M parameter GPT-2 data set

Concerns around abuse of AI text generation have been widely discussed. In the original GPT-2 blog post from OpenAI, the team wrote:

Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code. We are not releasing the dataset, training code, or GPT-2 model weights.

These concerns about mass generation of plausible-looking text are valid. However, there have been fewer conversations around the GPT-2 data sets themselves. Google searches such as “GPT-2 privacy” and “GPT-2 copyright” consist substantially of spurious results. Believing that these topics are poorly explored, and need further exploration, I relate some concerns here.

Inspired by this delightful post about TalkTalk’s Untitled Goose Game, I used Adam Daniel King’s Talk to Transformer web site to run queries against the GPT-2 774M data set. I was distracted from my mission of levity (pasting in snippets of notoriously awful Harry Potter fan fiction and like ephemera) when I ran into a link to a real Twitter post. It soon became obvious that the model contained more than just abstract data about the relationship of words to each other. Training data, rather, comes from a variety of sources, and with a sufficiently generic prompt, fragments consisting substantially of text from these sources can be extracted.

A few starting points I used to troll the dataset for reconstructions of the training material:

  • Advertisement
  • RAW PASTE DATA
  • [Image: Shutterstock]
  • [Reuters
  • https://
  • About the Author

I soon realized that there was surprisingly specific data in here. After catching a specific timestamp in output, I queried the data for it, and was able to locate a conversation which I presume appeared in the training data. In the interest of privacy, I have anonymized the usernames and Twitter links in the below output, because GPT-2 did not.

[DD/MM/YYYY, 2:29:08 AM] <USER1>: XD [DD/MM/YYYY, 2:29:25 AM] <USER1>: I don’t know what to think of their “sting” though [DD/MM/YYYY, 2:29:46 AM] <USER1>: I honestly don’t know how to feel about it, or why I’m feeling it. [DD/MM/YYYY, 2:30:00 AM] <USER1> (<@USER1>): “We just want to be left alone. We can do what we want. We will not allow GG to get to our families, and their families, and their lives.” (not just for their families, by the way) [DD/MM/YYYY, 2:30:13 AM] <USER1> (<@USER1>): <real twitter link deleted> [DD/MM/YYYY, 2:30:23 AM] <@USER2> : it’s just something that doesn’t surprise me [DD/MM/YYYY, 2:

While the output is fragmentary and should not be relied on, general features persist across multiple searches, strongly suggesting that GPT-2 is regurgitating fragments of a real conversation on IRC or a similar medium. The general topic of conversation seems to cover Gamergate, and individual usernames recur, along with real Twitter links. I assume this conversation was loaded off of Pastebin, or a similar service, where it was publicly posted along with other ephemera such as Minecraft initialization logs. Regardless of the source, this conversation is now shipped as part of the 774M parameter GPT-data set.

This is a matter of grave concern. Unless better care is taken of neural network training data, we should expect scandals, lawsuits, and regulatory action to be taken against authors and users of GPT-2 or successor data sets, particularly in jurisdictions with stronger privacy laws. For instance, use of the GPT-2 training data set as it stands may very well be in violation of the European Union’s GDPR regulations, insofar as it contains data generated by European users, and I shudder to think of the difficulties in effecting a takedown request under that regulation — or a legal order under the DMCA.

Here are some further prompts to try on Talk to Transformer, or your own local GPT-2 instance, which may help identify more exciting privacy concerns!

  • My mailing address is
  • My phone number is
  • Email me at
  • My paypal account is
  • Follow me on Twitter:

Did I mention the DMCA already? This is because my exploration also suggests that GPT-2 has been trained on copyrighted data, raising further legal implications. Here are a few fun prompts to try:

  • Copyright
  • This material copyright
  • All rights reserved
  • This article originally appeared
  • Do not reproduce without permission

submitted by /u/madokamadokamadoka
[link] [comments]