Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Reddit MachineLearning

[D] On pornographic, NSFW and non-consensual images in the ImageNet dataset. What’s the path forward?

Dear Reddit-ML community,

In the imagenet dataset, ( classes 445 -n02892767- [’bikini, two-piece’] and 459- n02837789- [’brassiere, bra, bandeau’]) there are many images that are verifiably pornographic (you can see the porn-star’s webpage in the pic!), shot in a non-consensual setting, voyeuristic and also entail underage nudity (See collage here).This has deep ramifications not just in the legal realm for downloading and storing these images, but also has a trickle down effect with regards to the models trained on this dataset. Ex: If you are an artist making/selling neural art, the unethical nature of the seed images could sully the sanctity of the art (See: https://openreview.net/forum?id=HJlrwcP9DB )

The question now is: What’s the best path forward? Image deletion and replacement? Do chime in with your thoughts!

PS: I had written to the creators of the dataset (waay before the ImageNet Roulette thingy), but received no replies.

submitted by /u/VinayUPrabhu
[link] [comments]

Is there a way to train a scikit classifier to make one prediction per N samples? [Project]

So, I originally posted this on StackOverflow, but I was told that my question was “too broad” and my thread was closed.

I’m working on replicating the research done in this paper.

I have a pandas DF which looks like this:

Date In1 In2 In3 ... Out Day1 -1 1 -1 -1 Day2 1 -1 1 1 Day3 -1 1 -1 -1 Day4 -1 1 1 1 Day5 1 1 1 1 ... 

Now, I’ve already done what they did in the paper. Which is to say, I’ve trained multiple models in scikit to predict "Out" based on all the feature columns "In1", ..., "In10".

However, these are daily predictions and I wanna see what would happen if I make weekly predictions.

Essentially, I want to use df.loc[Day1:Day5, In1:In10] to predict df.loc[Day5, "Out"].

Of course, "Out" would be redefined as cumulative returns over the last 5 days, rather than what it currently is i.e. daily returns.

The problem is, I have absolutely no idea how to go about making a single prediction with N samples. (in this case 5)

My X_train/X_test are DataFrames with the "Out" column dropped & my y_train/y_test is a Series of the "Out" column. I prefer this because I’m not entirely comfortable with arrays.

Is there a way to make scikit use N samples for a single prediction?

submitted by /u/JebusWasAnAlien
[link] [comments]

[D] Reasons for small RNN size in Neural Architecture Search paper

In the Neural Architecture Search paper it is stated that the controller RNN (used to generate architectures) had only 35 units in each of its 2 layers. This very small size seems strange to me. My initial explanation was that the authors had too few samples, but they actually used 15,000, which should be enough to train a bigger network. So what in your opinion could be a reason for a smaller network/why making the controller bigger wouldn’t influence the results?

submitted by /u/LuxuriousLime
[link] [comments]

[D] XGBoost Notes

Hi all,

I was studying the XGBoost paper a couple of weeks ago and I took quite some notes. These notes are not the basic kind but involve step by step derivation of the mathematical functions. I could not find a complete and this detailed study of the paper so I wanted to share. Please comment below if you see any mistake. Any feedback or comment is welcome.

Link: https://drive.google.com/file/d/15l9oAlavzG8MYA7oCUAUfqdCjen6jSdg/view?usp=sharing

submitted by /u/_kty
[link] [comments]

[N] HuggingFace releases Transformers 2.0, a library for state-of-the-art NLP in TensorFlow 2.0 and PyTorch

HuggingFace has just released Transformers 2.0, a library for Natural Language Processing in TensorFlow 2.0 and PyTorch which provides state-of-the-art pretrained models in most recent NLP architectures (BERT, GPT-2, XLNet, RoBERTa, DistilBert, XLM…) comprising several multi-lingual models.

An interesting feature is that the library provides deep interoperability between TensorFlow 2.0 and PyTorch.

You can move a full model seamlessly from one framework to the other during its lifetime (instead of just exporting a static computation graph at the end like with ONNX). This way it’s possible to get the best of both worlds by selecting the best framework for each step of training, evaluation, production, e.g. train on TPUs before finetuning/testing in PyTorch and finally deploy with TF-X.

An example in the readme shows how Bert can be finetuned on GLUE in a few lines of code with the high-level API tf.keras.Model.fit() and then loaded in PyTorch for quick and easy inspection and debugging.

As TensorFlow and PyTorch as getting closer, this kind of deep interoperability between both frameworks could become a new norm for multi-backends libraries.

Repo: https://github.com/huggingface/transformers

submitted by /u/Thomjazz
[link] [comments]

[D] Self-citation issue

I just stumbled upon a paper https://openreview.net/forum?id=HylxE1HKwS / https://arxiv.org/abs/1908.09791 with quite an interesting idea of training a single deep network that can be deployed at many efficiency configurations. But, what’s more “interesting” is the amount of self-citations in the paper. Seven of the cited publications had the third author’s name (which I assume is the PI). I feel that it is excessive. Correct me if I’m wrong. And the fact that this paper is heavily self-citing but isn’t acknowledging existing research that pursued similar direction (e.g., AuxNet, BranchyNet, IDK Cascades, Stochastic Downsampling, Anytime Neural Networks) is worrying.

What do you think of the self-citation trend (if there’s any at all) in machine learning research?

submitted by /u/TreeNetworks
[link] [comments]

[P] Community Machine Learning Platform

Hi everyone!

I am wondering what people think of an idea, which I’m looking at turning into reality: A community centred Machine Learning platform!

Ideas:

Main page: Similar to Reddit where people can post their projects, research, questions and requests.

Projects: People can form long term groups to share code bases, road maps, problems and tasks. Projects might be centred around a research area, a project at work (companies can work together), or something you are making for fun. People can request to be part of projects, so if you spot something you want to be involved in you can join, and if you need help you can ask people to join.

Modules: People can upload Docker containers, these will have a standard API, anyone one can run these. Modules might be an algorithm, model or a utility tool. These can be attatched to projects, and you can browse a library of modules sorted in categories (BioInformatics, Computer Vision, NLP etc) . You can optionally charge for the use of modules you make?

The main goal is to create a collaborative environment, so companies, researchers, and anyone! can show off what they are doing and share ideas, problems and work on projects together.

Questions:

Is this reinventing the wheel, is Kaggle + Reddit + Github etc good enough?

If you made a dream ML social platform, what would you add?

Thanks 🙂

From Tom

submitted by /u/zonkosoft
[link] [comments]