Category: Reddit MachineLearning
[R] NeurIPS 2019 Livestream
aideeptalk will livestream the expo and posters at NeurIPS 2019 on Twitch at twitch.tv/aideeptalk
To receive a notification when we go live, please follow us and enable notifications on our Twitch channel.
Follow us on twitter.com/aideeptalk for our schedule.
Please pass this on to those who can’t make it to NeurIPS
For more details see our website aideeptalk.com
submitted by /u/aideeptalk
[link] [comments]
[D] Confused about generating a translation using Transformer
I’m reading the Attention Is All You Need paper and it doesn’t seem to explain how exactly the Transformer is used to generate a translation. Here’s how I understand it so far (please correct if I’m wrong):
- A sequence of k tokens comes in as one-hot vectors of length v – the vocab size. This is a (k x v) token matrix.
- The tokens are embedded in d_m (model size, e.g. 512) dimensional space via multiplication by an Embedding matrix E of dim (v x d_m), yielding a (k x d_m).
- Positional Encodings added, dim is still (k x d_m).
Encoding:
- Encoder block takes in the (k x d_m) matrix and outputs another (k x d_m) matrix.
- Repeat N times to get a final (k x d_m) matrix, i.e. the encoder output.
Now for decoding:
- The decoder takes in a (p x d_m) matrix and adds positional encodings.
- The (non-masked) multi-head attention function inside the decoder receives encoder’s (k x d_m) output as key K, and value V, and a (p x d_m) matrix as the query Q, yielding a (p x d_m) output.
- The final output of the decoder is therefore (p x d_m).
Final output:
- The (p x d_m) decoder output is mapped to (p x v) by a matrix multiply (Question: they say it’s “tied” to the embedding matrix E, so is this just E^T?).
- Select the max of each value in the p rows (softmax), so you get p tokens out.
Suppose I want to translate the sequence “This attention paper is super confusing !” into German. Here k = 7, so my encoder outputs a (7 x 512) matrix. From here, can someone walk me through the steps of generating the translation?
Thanks for looking at my question and have an awesome day!
submitted by /u/ME_PhD
[link] [comments]
[R] Piecewise Strong Convexity of Neural Networks
Paper: https://arxiv.org/abs/1810.12805
Video summary: https://www.youtube.com/watch?v=z89BTMQGVn
Earlier related work: https://arxiv.org/abs/1607.04917 (piecewise convexity)
I am not the author. This paper will be presented at NeurIPS this month and exposes some convexity results about piece-wise linear nns under the least squares loss – namely piecewise strong-convexity & the non-existance of differentiable local maxima. The approach is a spectral analysis of the Hessian and weights of the nn. The result is a relatively attractive convergence estimate for sgd.
I guess this provides some more motivation for studying techniques like ADMM which have convergence properties for some classes of piece-wise functions and can exploit lipschitz cts gradients. Nice work!
submitted by /u/i-heart-turtles
[link] [comments]
[D] Best network for battle game agents (neuroevolution)
Hi,
I’m working on an indie game where you evolve teams of agents that each have a neural network, and then battle them against other players team (looks like this: https://youtu.be/EPekL1JMXEY).
I already have an implementation of sparse lstm-ish networks (1), but I’d like to optimize this further and wanted to see what people here have to suggest. Since it’s an evolution based game I don’t use backprop. It also needs to be fairly simple as it all runs on the GPU (which is why I can have simulations of thousands of agents on a single machine). And since it all runs on the GPU I’d prefer something that is fixed size, which is why I’ve stayed away from NEAT so far.
So my question is; what would be the best network for something like this?
(1) My current network works as follows; Each node has: a state (1 float), 12 indices, 12 weights, 2 bias values. Each index decides which other nodes it “reads” from, so a node can be connected to any other node (layers are therefore less important, they only decide the order of updates). 8 of the inputs are used for the next value of the state. 4 of the inputs are used as a write gate (-1 keep state -> 1 update state). There are some more details but that’s roughly it.
submitted by /u/FredrikNoren
[link] [comments]
[D] Figure Eight and alternatives
Hey,
We are considering to use Figure Eight in our company.
Has someone tried to use Figure Eight as an annotation plartform? What are the pros and cons? How is it compared to Amazon in terms of price, annotation quality etc?
Would love to hear your opinions.
submitted by /u/guzguzit
[link] [comments]
[D] Evolutionary Algorithms researchers, do you feel like a new library is needed?
Researchers who work with Evolutionary Algorithms, do you feel there are current libraries that satisfy your needs? Do you feel there is a need for a new library? What features do you feel are missing/you need?
submitted by /u/ghost_shaba7
[link] [comments]
[D] Efficient Partial Dependence Plots with decision trees
Partial Dependence Plots (PDPs) are a standard model inspection technique. It turns out that for decision trees, they can be computed very efficiently. This post explains how PDPs are computed in general, and goes into the details of the optimized version for tree models.
submitted by /u/Niourf
[link] [comments]
[D] Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
A recent paper by Cynthia Rudin claims “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead”: https://arxiv.org/abs/1811.10154
A summary of the paper can be found here: https://www.kdnuggets.com/2019/11/stop-explaining-black-box-models.html
Thoughts?
submitted by /u/selib
[link] [comments]
[D] Validating regression models on edge cases?
I’m trying to predict USED car prices, given some x number of parameters.
The R2 is > 0.98 on the testing data, but it misses predictions on new data with edge cases by (what I think of as) too much.
Past the metric for evaluating, how can we validate that a result is good enough, even for an edge case.
Currently, I’m thinking about making some linear regression model and fitting varyingly different age and kilometers, then predicting on price. This would give me a model, where I could predict my edge case predictions on and fit it to a more average case.
I’m really just seeking advice on what to do here. Is the approach good enough? What are other approaches for validation / sanity checking if each sample we try to predict individually is good enough?
submitted by /u/permalip
[link] [comments]