Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Reddit MachineLearning

[P] Using protein sequences to make better classifiers in bioinformatics

As a data scientist in the bioinformatics field, I often found it useful to add features describing proteins to my models. These were often manually engineered or based on heuristics and alignments, and lacked information on the structure of the protein, as that data is relatively sparse.

Recently I found a paper by Bepler and Berger, published at ICLR 2019, where they created a set of models that use weak supervision to create protein embeddings. In this blog post I take a look at the theory behind this paper and present an intermediate-level tutorial for people who want to include these embeddings in their own models. A comprehensive analysis of the predictive power of these embeddings is also included.

https://stephanheijl.com/protein_sequence_ml.html

submitted by /u/Yuras_Stephan
[link] [comments]

[Discussion] Measuring run-time of complex machine learning pipelines

I am working on a fairly long pipeline for a complex project on my hands at the moment. There are several modules that form part of the pipeline. I am also under a strict end-to-end run-time SLA, so need to make sure my pipeline runs within N seconds. I’m investigating timing mechanisms for ML projects as a result. I’m working with Python.

All I have found in my brief search for mechanisms is the **timeit** to measure execution time of small code snippets, and the simple time() method in the **time** module. This can obviously work, but doesn’t seem to be a integrated elegant way to do it when there are multiple stages in the pipeline where you need to measure execution time, etc.

Does anyone have opinions on what is good, or any alternatives that people use? Cheers!

submitted by /u/humanager
[link] [comments]

[N] Deconstruct how Google Tulip was built by using serverless tech and machine learning

This is a 35 minute talk from GOTO Amsterdam 2019 by some of the team who helped build Google tulip: Christiaan Hees, Matt Feigal and Lee Boonstra.

https://www.youtube.com/watch?v=Gv5stbV7XT8&feature=youtu.be&list=PLEx5khR4g7PKT9RvuVyQxJLO8CZUJzNMy

I’ve dropped in the talk abstract below for a read before jumping into the talk:

This is a 35 minute talk from GOTO Amsterdam 2019 by some of the team who helped build Google tulip: Christiaan Hees, Matt Feigal and Lee Boonstra. Check out the full talk abstract below:

How to compose an application of multiple serverless components. Training an ML model for your needs with minimal training data and applying it in your application.

What will the audience learn from this talk?
How to compose an application of multiple serverless components. Training an ML model for your needs with minimal training data, and applying it in your application. Building and running your code in serverless Knative Using Dialogflow to power user conversations.

Does it feature code examples and/or live coding?
It features live demos of our components and end product. We’ll show code and run scripts to train ML models and deploy our code.

submitted by /u/mto96
[link] [comments]

[Discussion] How do you prepare a new dataset for your ML/DL project?

I found many guidelines online on how to prepare, analyze and clean datasets in tabular form (e.g. from csv files). Typically, they correlate the features, look for inlier/ outliers in the dataset and remove duplicates as well as corrupt samples.

But how do you perform such steps in a raw dataset consisting just of images or text as typically its the case in deep learning?

Let’s assume I just gathered 100’000 unlabeled images. Are there any tools or guidelines on how to start from there?

Thanks a lot for your input!

submitted by /u/rogi_o
[link] [comments]

[P] Gobbli: A Python Framework for Text Classification Projects

At my day job, we do a lot of text classification projects with small/medium size data. Recent advances in transfer learning for NLP have moved these types of projects from impossible to feasible, especially for batch classification tasks we see frequently on survey projects with free-text responses. Models like BERT have been documented for research, and in trying to use them we found ourselves spending a lot of time extending them to the non-benchmarking applications and datasets we were curious about. Given these issues, we built a framework for text classification projects that aims to make the consistent application of transfer learning and other models easier.

For a little more context, we started trying out BERT last year and new models continued to be rapidly released. Every time there was a new model there was a new API to learn. pytorch-transformers from HuggingFace helped a lot with this standardization issue, so we also took a look at what happens before a model is built (data processing and augmentation) and afterwards (model evaluation), and built supporting tools around those problems as well.

In addition, since most models require GPUs, so we were spending a lot of time configuring environments, code, and data in tandem with Docker which gets messy. Because of this, we’ve abstracted most of that orchestration out so most everything is python code.

Details on the library are below. We’ve battle tested it on a few projects and are curious to have others kick the tires and give us feedback if you’re doing text classification.

submitted by /u/pbaumgartner_rti
[link] [comments]

[P] ClearHead.ai – A marketplace for machine learning models

https://clearhead.ai

Hello /r/MachineLearning

I’m the founder of ClearHead.ai, a marketplace for machine learning models which allows modelers to upload their models via a python SDK and developers to use the models with a simple request to an API.

We are looking for some alpha (not ethological, more this-software-is-rough-around-the-edges) machine learning modelers who have machine learning models they would like to monetize. Ideally image based ones (CNNs, GANs) that are trained on a non-research dataset. If you are interested (both modelers and consumers) please sign-up via the website or this link.

This approach is similar to Algorithmia, but we want to really focus-in on the problem of model comparability and transparency.

I’ll be in the comments, if you have any questions, feedback or complaints.

submitted by /u/vishaan
[link] [comments]

[P] Access raw pointers of Tensorflow tensors.

Hi, I collaborated on Redner(https://github.com/BachiLi/redner), a differentiable graphics renderer, in the summer. I did something uncommon which is access of raw pointers of Tensorflow tensors. Since I couldn’t find any example that does this, I thought it might be useful to share my experience here. Thank you and have a lovely day 🙂

http://typingducks.com/blog/tensorflow-tensor-pointer-access/

Br,

Seyoung

submitted by /u/SuperShinyEyes
[link] [comments]

[P] Command-line tool for Bayesian black-box optimization

Hi r/machinelearning,

I wrote a CLI tool that runs black-box Bayesian optimization on an arbitrary shell command.

https://gitlab.com/gwerbin/bayesopt-cli

Usage: 1. Define a “space” to optimize over, using the Scikit-optimize YAML specification format 2. Run the bayesopt tool 3. Save a summary of results from stdout, and/or load the full saved “optimization result” object from disk

There is an example in the repo to hopefully make this usage more clear.

Internally, it’s just a wrapper for Scikit-optimize, which has been giving me good results. This code could easily be extended to use Hyperopt’s TPE optimizer, but I wanted to start simple with one backend.

It also lets you emit output from your command as JSON, with an option to extract a number from it using JMESPath.

Unfortunately installation is annoying at the moment. The Scikit-optimize devs seem to have abandoned the project, so this relies on a patched fork of the latest version. Hopefully that will get straightened out (community fork, maybe?) and I can distribute this like a normal Python package.

Let me know what you think! Is this useful to anyone other than me?

submitted by /u/nerdponx
[link] [comments]