Category: Reddit MachineLearning

[D] The “test set” is nonsense

Written on November 2, 2019. Posted in Reddit MachineLearning.

I often see ML practitioners, and even experts, pose the idea of the “test set” as the ultimate benchmark of a model’s performance. This is nonsense – and I’ll explain why.

Suppose you gather some data, label it, preprocess it, and compile it into a dataset. Now it’s time to split your data – train, validation, test; how will you do it?

Random selection; may work well if the dataset is large enough
Engineered selection – assign samples to each set according to some rationale

The ultimate goal of an ML model is generalization; as such, said ‘rationale’ could be:

The best test set is comprised of the most “realistic” or “difficult” samples. Problem: model performance is harmed by artificially biasing the train set to exclude realistic/difficult samples.
The best set split is, each gets same quality data. Justification: “poorer” quality usually means (a) noise; (b) low complexity (“too obvious”). Whatever the description, if you test the model on an information landscape ( / probability distribution) that it wasn’t trained on, the model may perform poorly simply because it “learned” little that’s relevant to the test set.

Thus, “split equally” should work best. Onto the problem: why do we use a test set at all? Because – we “fit” the validation set with our hyperparameters, and we need to test on “never seen” data to avoid bias. Indeed, agreed – the test set does suppress said bias. But here’s its red line: variance.

Direct statistical theory: a sample is an approximation of the population distribution, with an uncertain mean, standard deviation, & other. The more complex the problem, the greater the variation. — So, is there a solution? Yes: K-Fold Cross-Validation. Per known theory, K-Fold CV can significantly slash variance of model performance – the higher the “K”, the better. Without it, classification error can easily differ by 5-15%, if not 20-30%. When deciding what’s “SOTA”, every single percentage point can be a battle hard-fought – so a “mere 5%” is already astronomical.

One may counter-argue, “it’s fine if the test set is large enough”. Except it’s not fine; you get a “large enough” test set by either sacrificing train data, or, dataset is large enough so that you can make an even validation-test split. Former’s undesirable for obvious reasons – and in latter, unless you have a gargantuan dataset (extremely rare), your test samples are still subject to significant-enough variance; merely swapping test & validation samples can flip tables.

As a final punchline, note that the random seed can also substantially impact final outcome, further amplifying variance. Consequence: you don’t know you did well because Dropout(0.5) works better than Dropout(0.2) or because dice rolled nicely. K-Fold CV will also reduce seed variance as a side-effect, but ideally (though often prohibitively) you’d do “K seeds”.

Verdict: test set isn’t good for testing. Instead, use K-fold CV, which both better estimates generalization performance by reducing variance, and allows using more train data.

Though I am knowledgeable on the topic, I’m not an “expert” – and even experts disagree. Thus, counterarguments welcome.

submitted by /u/OverLordGoldDragon
[link] [comments]

[D] How do you handle sparse features?

Written on November 2, 2019. Posted in Reddit MachineLearning.

I am working on a problem where I have a sequence of events happening, every event generate a set of tokens (some of the tokens are shared between the events, but not all), the task is to categorize the behavior that generated this set of events.

Let me give you a simple example to have an understanding on the input.

event_type	order	value_type_1	value_1	value_type_2	value_2
E1	1	alpha 1	24	alpha 2	33
E2	2	beta	120
E1	3	alpha 1	234	alpha 2	56
E3	4	theta	150
E4	5

You can notice for example that the token “theta” doesn’t exist in event_type E2, it only exist in some event types.

If I want to do feature engineering in this case, what is the best way to vectorize my data. If I take the token, and try to put this way, I will end up with a very sparse features.

event_type	order	alpha 1	alpha 2	beta	theta
E1	1	24	33
E2	2			120
E1	3	234	56
E3	4				150
E4	5

If I construct my features this way, it will be very sparse and it doesn’t make sense to consider it as missing data (because the data doesn’t exist in first place).

I don’t want to apply data imputation method such filling the last value (You can see below the example, I have added the number in bold to show it as an example) . The reason is that some event type are very frequent, and some event types are not.

event_type	order	alpha 1	alpha 2	beta	theta
E1	1	24	33	0	0
E2	2	24	33	120	0
E1	3	234	56	120	0
E3	4	234	56	120	150
E4	5	234	56	120	150

If you were in my shoes, how would you treat this problem?. Ideas, references are welcomed.

If you are wondering what do I want to do, I want to categorize the behavior that generated this set of events. I can experiment with any method if I get feature engineering right (you can think of clustering as an example).

submitted by /u/__Julia
[link] [comments]

[D] Machine Learning – WAYR (What Are You Reading) – Week 74

Written on November 2, 2019. Posted in Reddit MachineLearning.

This is a place to share machine learning research papers, journals, and articles that you’re reading this week. If it relates to what you’re researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you’ve read.

Please try to provide some insight from your understanding and please don’t post things which are present in wiki.

Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links.

Previous weeks :

1-10	11-20	21-30	31-40	41-50	51-60	61-70	71-80
Week 1	Week 11	Week 21	Week 31	Week 41	Week 51	Week 61	Week 71
Week 2	Week 12	Week 22	Week 32	Week 42	Week 52	Week 62	Week 72
Week 3	Week 13	Week 23	Week 33	Week 43	Week 53	Week 63	Week 73
Week 4	Week 14	Week 24	Week 34	Week 44	Week 54	Week 64
Week 5	Week 15	Week 25	Week 35	Week 45	Week 55	Week 65
Week 6	Week 16	Week 26	Week 36	Week 46	Week 56	Week 66
Week 7	Week 17	Week 27	Week 37	Week 47	Week 57	Week 67
Week 8	Week 18	Week 28	Week 38	Week 48	Week 58	Week 68
Week 9	Week 19	Week 29	Week 39	Week 49	Week 59	Week 69
Week 10	Week 20	Week 30	Week 40	Week 50	Week 60	Week 70

Most upvoted papers two weeks ago:

/u/ecart33: https://arxiv.org/abs/1906.00817v1

Besides that, there are no rules, have fun.

submitted by /u/ML_WAYR_bot
[link] [comments]

[D] Heads up – MLPerf Inference results publishing Wednesday

Written on November 2, 2019. Posted in Reddit MachineLearning.

MLPerf, a project to benchmark machine learning hardware, is publishing their first round of Inference results this Wednesday.

Take some time to review the precise challenge they’re putting the hardware to: https://mlperf.org/inference-overview/ and the general rules for Inference submissions: mlperf/inference_policies: inference_rules.adoc

I’m excited to see some of the low-power chip results.

Source for date: #single-submission-round-schedule – Submission for this cycle was October 11th so therefore Week 1 Monday is October 14th, and Week 4 Wednesday (publication day) is November 6th, 10AM US/Pacific time.

submitted by /u/riking27
[link] [comments]

[D] DeepMind’s PR regarding Alphastar is unbelievably bafflingg.

Written on November 2, 2019. Posted in Reddit MachineLearning.

David Silver hinted that DeepMind is done with Starcraft in a BBC news article saying “the lab may rest now” and that they have “completed the Starcraft challenge”.

I thought this was a little disappointing since the skill level Alphastar reached on ladder was not enough to beat professional players. I think we all wanted a real nice showdown between the human champion and the robot, right? That’d been pretty cool.

The Nature paper had a nice graph depicting Alphastar’s MMR which is basically Blizzard’s version of elo rating. The Protoss agent had reached an MMR of ~6200 and the aggregate of all three races was 6030 iirc. The graph also had MMR’s of Alphastar’s opponents and information on whether the agent won or lost.

Basically Alphastar had lost all but 2 games against players who had higher than 6200 MMR. On ladder, it could not beat the professionals.

The agent from January was estimated to have been over 7000 MMR. I figured it’d be nice to estimate how well this newest agent would have fared against Mana. Right now, MaNa’s MMR is ~6700.

So I looked at the EU ladder, found someone with an MMR of ~6200, popped him and MaNa into Aligulac (sc2 database) and let it estimate some odds. MaNa had ~75% chance of winning a Best of 5, and his 6200 MMR opponent had less than 1% chance of beating MaNa 5-0.

At this point I became convinced that DeepMind was throwing in the towel on sc2 because the cost of further improving Alphastar was too high to justify the publicity they were getting from the project. The team looked to be moving on to different things and the showmatch vs the world champion had been cancelled.

But then something absolutely baffling happened which I don’t think anyone saw coming.

Blizzcon was this weekend. With little to no fanfare DeepMind had brought Alphastar with them and let Blizzcon visitors play against it. Serral, one of the best players in the world, had just finished top 4 in the biggest tournament of the year wandered to the arcade and played a few games against the bot. Serral’s MMR is over 7000.

He lost 0-3 to the Protoss agent. These games were not televised. All we have is some blurry smartphone footage. https://mobile.twitter.com/LiquidTLO/status/1190779241564000256

I don’t get it. If Alphastar was this strong why didn’t DeepMind let it play more on ladder and get a higher ranking? Why didn’t they organize a showmatch or something? They dropped the ball pretty hard on this one. This is so confusing to me.

First they beat two professional players but were hit with a huge, imo warranted backlash due to the APM controversy.

Then they produced agents under more proper mechanical limitations and the agents turned out to be much weaker than the previous version.

Finally, they beat the best player In the world, seemingly accidentally while no one was looking.

From PR standpoint, could this have gone any worse for Deepmind?

submitted by /u/SoulDrivenOlives
[link] [comments]

[Discussion] On what basis are anchors chosen in YOLO algorithm?

Written on November 2, 2019. Posted in Reddit MachineLearning.

So we had a poject review, and our teacher asked us on what basis anchors are chosen in YOLO, Faster R-CNN and the lot.

Now I have no idea one what criterion is it based, so if anyone has something to say on this, please do. I would appreciate it!

submitted by /u/kirasama16997
[link] [comments]

[D] Is finetuning on part of the evaluation dataset acceptable for publishing machine learning papers?

Written on November 2, 2019. Posted in Reddit MachineLearning.

Hello everyone,

I have been trying yo reproduce the results of a SOTA paper regarding object detection. I have reimplemented their method and trained on the same dataset, based on the paper, however I was not able to achieve their results on the datasets they use for evaluation, no matter what I have tried.

Then I also studied their referenced papers and realised that many of them use a train-test split strategy for evaluating their models. This means that they use a part of the evaluation dataset for finetuning their already trained model and then evaluate it on the testing part of the same dataset. In the case of these papers, this fact was explicitly mentioned. I think that this also happened in the paper I tried to reproduce. However, they don’t mention it.

My question for discussion is, what do you think about this strategy? Is finetuning on part of the evaluation dataset a way to go? What about generalisation on totally unknown data? In my opinion it is ok if explicitly mentioned. Totally uncool in the opposite case, though.

EDIT: Just a clarification to be on the same page. What I mean by train, test and validation sets is a big dataset which is split in those three subsets.

By evaluation dataset I mean a benchmark dataset which researchers use to report their results on a specific task. So, finetuning on part of the evaluation dataset is about retraining on a part of the benchmark dataset and later report the results on the rest of it, that was not seen during finetuning.

submitted by /u/roset_ta
[link] [comments]

parameter sharing decoder pair for auto composing

Written on November 1, 2019. Posted in Reddit MachineLearning.

submitted by /u/edisonzhao
[link] [comments]

[D] Conversational AI

Written on November 1, 2019. Posted in Reddit MachineLearning.

I’m curious about what the current state of conversational AI. To be more specific, by “conversation” I’m not talking about something that can take orders to schedule appointments or buy tickets or something like that. I mean something like discussing movies or TV shows or current events. I think remember Amazon holding a contest for this kind of thing but I haven’t really seen something like this implemented anywhere I can access it. Does anyone have any examples of what this kind of technology can do? Or better yet, anything I can play and experiment with on my own?

submitted by /u/Rioghasarig
[link] [comments]

[D] any principled reason for cross entropy instead of L2 in language modelling? (more details in post)

Written on November 1, 2019. Posted in Reddit MachineLearning.

Is there any principled reason for doing softmax and cross entropy for the loss in for example transformers, rather than doing L2 over the target embeddings and the output from the model?

When the output from your model by necessity is a dot product such as in shallow models I understand why you need to do cross entropy loss. But for models such as rnns and some variants of transformers wouldn’t L2 loss directly on the desired embedding and output work as well or better?

submitted by /u/mesmer_adama
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Category: Reddit MachineLearning

[D] The “test set” is nonsense

[D] How do you handle sparse features?

[D] Machine Learning – WAYR (What Are You Reading) – Week 74

[D] Heads up – MLPerf Inference results publishing Wednesday

[D] DeepMind’s PR regarding Alphastar is unbelievably bafflingg.

[Discussion] On what basis are anchors chosen in YOLO algorithm?

[D] Is finetuning on part of the evaluation dataset acceptable for publishing machine learning papers?

parameter sharing decoder pair for auto composing

[D] Conversational AI

[D] any principled reason for cross entropy instead of L2 in language modelling? (more details in post)