Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Reddit MachineLearning

[D] The “test set” is nonsense

I often see ML practitioners, and even experts, pose the idea of the “test set” as the ultimate benchmark of a model’s performance. This is nonsense – and I’ll explain why.

Suppose you gather some data, label it, preprocess it, and compile it into a dataset. Now it’s time to split your data – train, validation, test; how will you do it?

  1. Random selection; may work well if the dataset is large enough
  2. Engineered selection – assign samples to each set according to some rationale

The ultimate goal of an ML model is generalization; as such, said ‘rationale’ could be:

  1. The best test set is comprised of the most “realistic” or “difficult” samples. Problem: model performance is harmed by artificially biasing the train set to exclude realistic/difficult samples.
  2. The best set split is, each gets same quality data. Justification: “poorer” quality usually means (a) noise; (b) low complexity (“too obvious”). Whatever the description, if you test the model on an information landscape ( / probability distribution) that it wasn’t trained on, the model may perform poorly simply because it “learned” little that’s relevant to the test set.

Thus, “split equally” should work best. Onto the problem: why do we use a test set at all? Because – we “fit” the validation set with our hyperparameters, and we need to test on “never seen” data to avoid bias. Indeed, agreed – the test set does suppress said bias. But here’s its red line: variance.

Direct statistical theory: a sample is an approximation of the population distribution, with an uncertain mean, standard deviation, & other. The more complex the problem, the greater the variation. — So, is there a solution? Yes: K-Fold Cross-Validation. Per known theory, K-Fold CV can significantly slash variance of model performance – the higher the “K”, the better. Without it, classification error can easily differ by 5-15%, if not 20-30%. When deciding what’s “SOTA”, every single percentage point can be a battle hard-fought – so a “mere 5%” is already astronomical.

One may counter-argue, “it’s fine if the test set is large enough”. Except it’s not fine; you get a “large enough” test set by either sacrificing train data, or, dataset is large enough so that you can make an even validation-test split. Former’s undesirable for obvious reasons – and in latter, unless you have a gargantuan dataset (extremely rare), your test samples are still subject to significant-enough variance; merely swapping test & validation samples can flip tables.

As a final punchline, note that the random seed can also substantially impact final outcome, further amplifying variance. Consequence: you don’t know you did well because Dropout(0.5) works better than Dropout(0.2) or because dice rolled nicely. K-Fold CV will also reduce seed variance as a side-effect, but ideally (though often prohibitively) you’d do “K seeds”.

Verdict: test set isn’t good for testing. Instead, use K-fold CV, which both better estimates generalization performance by reducing variance, and allows using more train data.


Though I am knowledgeable on the topic, I’m not an “expert” – and even experts disagree. Thus, counterarguments welcome.

submitted by /u/OverLordGoldDragon
[link] [comments]

[D] How do you handle sparse features?

I am working on a problem where I have a sequence of events happening, every event generate a set of tokens (some of the tokens are shared between the events, but not all), the task is to categorize the behavior that generated this set of events.

Let me give you a simple example to have an understanding on the input.

event_type order value_type_1 value_1 value_type_2 value_2
E1 1 alpha 1 24 alpha 2 33
E2 2 beta 120
E1 3 alpha 1 234 alpha 2 56
E3 4 theta 150
E4 5

You can notice for example that the token “theta” doesn’t exist in event_type E2, it only exist in some event types.

If I want to do feature engineering in this case, what is the best way to vectorize my data. If I take the token, and try to put this way, I will end up with a very sparse features.

event_type order alpha 1 alpha 2 beta theta
E1 1 24 33
E2 2 120
E1 3 234 56
E3 4 150
E4 5

If I construct my features this way, it will be very sparse and it doesn’t make sense to consider it as missing data (because the data doesn’t exist in first place).

I don’t want to apply data imputation method such filling the last value (You can see below the example, I have added the number in bold to show it as an example) . The reason is that some event type are very frequent, and some event types are not.

event_type order alpha 1 alpha 2 beta theta
E1 1 24 33 0 0
E2 2 24 33 120 0
E1 3 234 56 120 0
E3 4 234 56 120 150
E4 5 234 56 120 150

If you were in my shoes, how would you treat this problem?. Ideas, references are welcomed.

If you are wondering what do I want to do, I want to categorize the behavior that generated this set of events. I can experiment with any method if I get feature engineering right (you can think of clustering as an example).

submitted by /u/__Julia
[link] [comments]

[D] Machine Learning – WAYR (What Are You Reading) – Week 74

This is a place to share machine learning research papers, journals, and articles that you’re reading this week. If it relates to what you’re researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you’ve read.

Please try to provide some insight from your understanding and please don’t post things which are present in wiki.

Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links.

Previous weeks :

1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71
Week 2 Week 12 Week 22 Week 32 Week 42 Week 52 Week 62 Week 72
Week 3 Week 13 Week 23 Week 33 Week 43 Week 53 Week 63 Week 73
Week 4 Week 14 Week 24 Week 34 Week 44 Week 54 Week 64
Week 5 Week 15 Week 25 Week 35 Week 45 Week 55 Week 65
Week 6 Week 16 Week 26 Week 36 Week 46 Week 56 Week 66
Week 7 Week 17 Week 27 Week 37 Week 47 Week 57 Week 67
Week 8 Week 18 Week 28 Week 38 Week 48 Week 58 Week 68
Week 9 Week 19 Week 29 Week 39 Week 49 Week 59 Week 69
Week 10 Week 20 Week 30 Week 40 Week 50 Week 60 Week 70

Most upvoted papers two weeks ago:

/u/ecart33: https://arxiv.org/abs/1906.00817v1

Besides that, there are no rules, have fun.

submitted by /u/ML_WAYR_bot
[link] [comments]

[D] Heads up – MLPerf Inference results publishing Wednesday

MLPerf, a project to benchmark machine learning hardware, is publishing their first round of Inference results this Wednesday.

Take some time to review the precise challenge they’re putting the hardware to: https://mlperf.org/inference-overview/ and the general rules for Inference submissions: mlperf/inference_policies: inference_rules.adoc

I’m excited to see some of the low-power chip results.

Source for date: #single-submission-round-schedule – Submission for this cycle was October 11th so therefore Week 1 Monday is October 14th, and Week 4 Wednesday (publication day) is November 6th, 10AM US/Pacific time.

submitted by /u/riking27
[link] [comments]

[D] DeepMind’s PR regarding Alphastar is unbelievably bafflingg.

David Silver hinted that DeepMind is done with Starcraft in a BBC news article saying “the lab may rest now” and that they have “completed the Starcraft challenge”.

I thought this was a little disappointing since the skill level Alphastar reached on ladder was not enough to beat professional players. I think we all wanted a real nice showdown between the human champion and the robot, right? That’d been pretty cool.

The Nature paper had a nice graph depicting Alphastar’s MMR which is basically Blizzard’s version of elo rating. The Protoss agent had reached an MMR of ~6200 and the aggregate of all three races was 6030 iirc. The graph also had MMR’s of Alphastar’s opponents and information on whether the agent won or lost.

Basically Alphastar had lost all but 2 games against players who had higher than 6200 MMR. On ladder, it could not beat the professionals.

The agent from January was estimated to have been over 7000 MMR. I figured it’d be nice to estimate how well this newest agent would have fared against Mana. Right now, MaNa’s MMR is ~6700.

So I looked at the EU ladder, found someone with an MMR of ~6200, popped him and MaNa into Aligulac (sc2 database) and let it estimate some odds. MaNa had ~75% chance of winning a Best of 5, and his 6200 MMR opponent had less than 1% chance of beating MaNa 5-0.

At this point I became convinced that DeepMind was throwing in the towel on sc2 because the cost of further improving Alphastar was too high to justify the publicity they were getting from the project. The team looked to be moving on to different things and the showmatch vs the world champion had been cancelled.

But then something absolutely baffling happened which I don’t think anyone saw coming.

Blizzcon was this weekend. With little to no fanfare DeepMind had brought Alphastar with them and let Blizzcon visitors play against it. Serral, one of the best players in the world, had just finished top 4 in the biggest tournament of the year wandered to the arcade and played a few games against the bot. Serral’s MMR is over 7000.

He lost 0-3 to the Protoss agent. These games were not televised. All we have is some blurry smartphone footage. https://mobile.twitter.com/LiquidTLO/status/1190779241564000256

I don’t get it. If Alphastar was this strong why didn’t DeepMind let it play more on ladder and get a higher ranking? Why didn’t they organize a showmatch or something? They dropped the ball pretty hard on this one. This is so confusing to me.

First they beat two professional players but were hit with a huge, imo warranted backlash due to the APM controversy.

Then they produced agents under more proper mechanical limitations and the agents turned out to be much weaker than the previous version.

Finally, they beat the best player In the world, seemingly accidentally while no one was looking.

From PR standpoint, could this have gone any worse for Deepmind?

submitted by /u/SoulDrivenOlives
[link] [comments]

[D] Is finetuning on part of the evaluation dataset acceptable for publishing machine learning papers?

Hello everyone,

I have been trying yo reproduce the results of a SOTA paper regarding object detection. I have reimplemented their method and trained on the same dataset, based on the paper, however I was not able to achieve their results on the datasets they use for evaluation, no matter what I have tried.

Then I also studied their referenced papers and realised that many of them use a train-test split strategy for evaluating their models. This means that they use a part of the evaluation dataset for finetuning their already trained model and then evaluate it on the testing part of the same dataset. In the case of these papers, this fact was explicitly mentioned. I think that this also happened in the paper I tried to reproduce. However, they don’t mention it.

My question for discussion is, what do you think about this strategy? Is finetuning on part of the evaluation dataset a way to go? What about generalisation on totally unknown data? In my opinion it is ok if explicitly mentioned. Totally uncool in the opposite case, though.

EDIT: Just a clarification to be on the same page. What I mean by train, test and validation sets is a big dataset which is split in those three subsets.

By evaluation dataset I mean a benchmark dataset which researchers use to report their results on a specific task. So, finetuning on part of the evaluation dataset is about retraining on a part of the benchmark dataset and later report the results on the rest of it, that was not seen during finetuning.

submitted by /u/roset_ta
[link] [comments]

[D] Conversational AI

I’m curious about what the current state of conversational AI. To be more specific, by “conversation” I’m not talking about something that can take orders to schedule appointments or buy tickets or something like that. I mean something like discussing movies or TV shows or current events. I think remember Amazon holding a contest for this kind of thing but I haven’t really seen something like this implemented anywhere I can access it. Does anyone have any examples of what this kind of technology can do? Or better yet, anything I can play and experiment with on my own?

submitted by /u/Rioghasarig
[link] [comments]

[D] any principled reason for cross entropy instead of L2 in language modelling? (more details in post)

Is there any principled reason for doing softmax and cross entropy for the loss in for example transformers, rather than doing L2 over the target embeddings and the output from the model?

When the output from your model by necessity is a dot product such as in shallow models I understand why you need to do cross entropy loss. But for models such as rnns and some variants of transformers wouldn’t L2 loss directly on the desired embedding and output work as well or better?

submitted by /u/mesmer_adama
[link] [comments]