Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

Chaining Amazon SageMaker Ground Truth jobs to label progressively

Amazon SageMaker Ground Truth helps you build highly accurate training datasets for machine learning. It can reduce your labeling costs by up to 70% using automatic labeling.

This blog post explains the Amazon SageMaker Ground Truth chaining feature with a few examples and its potential in labeling your datasets. Chaining reduces time and cost significantly as Amazon SageMaker Ground Truth determines the objects that are already labeled and optimizes the data for automated data labeling mode. As a prerequisite, you might want to check the post “Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth” that shows how to achieve multi-step hierarchical labeling and the documentation on how to use the augmented manifest functionality.

Chaining a labeling job

Chaining can help in the following scenarios:

  • Partially completed labeling job – A labeling job in which you have an input manifest that already contains few labels and the rest are to be labeled.
  • Failed labeling job – A labeling job in which you generated a few labels successfully and the rest of the labels either failed or expired.
  • Stopped labeling job – A labeling job that a user stopped, which may have generated a few labels before stopping.

The chaining feature allows you to reuse these previous labels and get the remaining labels coherently. For more information, see Chaining labeling jobs.

Chaining uses the output from a previous job as the input for a subsequent job.

The following are the artifacts used to bootstrap the new chained labeling job:

  1. LabelAttributeName
  2. Output manifest file contents from the previous labeling job
  3. The model, if available

If you are starting a job from the Amazon Sagemaker Ground Truth console, by default, the LabelingJob name is used as the LabelAttributeName. For more information, see LabelAttributeName.

If you are chaining a partially completed job, the console uses the LabelAttributeName of the parent job to decide which object is already labeled and which is not, so that only unlabeled or previously failed objects are sent for labeling. You can override this behavior by providing a different LabelAttributeName, in which case the previous labels aren’t counted and a new labeling job sends all the data for labeling. This post describes this process in more detail later.

If you are using the API or SDK, you need to properly configure these fields, which this post describes later.

When you enable automated data labeling, Amazon Sagemaker Ground Truth uses LabelAttributeName to decide which existing labels to use to start automated data labeling mode and see if you are eligible to train early. You can reap the maximum benefit of machine learning with existing labels; it reduces the cost of labeling tasks because you use existing labels instead of sending them to human labelers again.

Solution overview

The following diagram shows the workflow of this solution.

Step 1: Building the initial unlabeled dataset

Step 2: Launching a labeling job and stopping it (To simulate stopping/Failed status)

Step 3: Chaining your first job

Step 1: Building the initial unlabeled dataset

The first step is to build the initial unlabeled dataset. For more information about this process, see Step 1 in Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth.

This post uses the CBCL StreetScenes dataset, which contains approximately 3547 images. The full dataset is approximately 2 GB; you may choose to upload some or all of the dataset to S3 for labeling. Complete the following steps:

  1. Download the zip file.
  2. Extract the .zip archive to a folder. By default, the folder is Output.
  3. Create a small sample dataset to work with, or use the entire dataset.

For more information about creating an input manifest, see Step 2 in Creating hierarchical label taxonomies using Amazon SageMaker Ground Truth.

The lines in the manifest appear as the following code:

{"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00001.JPG"}
{"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00006.JPG"}
{"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00016.JPG"}
... ...

Step 2: Launching a labeling job and stopping it

From the console, start a labeling job using the Image classification task type to classify pictures as a vehicle, traffic signal, or pedestrian. Use the previously created manifest file as the input and Streetscenes-Job1 as the job name. For more information about starting a labeling job, see Amazon SageMaker Ground Truth – Build Highly Accurate Datasets and Reduce Labeling Costs by up to 70%.

To simulate the stopped or failed state, this post manually stopped the job after 1000 labels.

The output of the labeling job is written to an augmented manifest with the corresponding label augmented in each of the JSON lines in the manifest. Some of these have labels and some do not. See the following code:

1. {
  "source-ref": "s3://bucket_name/datasets/streetscenes/SSDB00001.JPG",
  "Streetscenes-Job1": 0,
  "Streetscenes-Job1-metadata": {
    "confidence": 0.95,
    "job-name": "labeling-job/streetscenes-job1",
    "class-name": "vehicles",
    "human-annotated": "yes",
    "creation-date": "2019-04-09T21:13:37.730999",
    "type": "groundtruth/image-classification"
  }
}
2. {"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00002.JPG"}
3. {
  "source-ref": "s3://bucket_name/datasets/streetscenes/SSDB00003.JPG",
  "Streetscenes-Job1": 1,
  "Streetscenes-Job1-metadata": {
    "confidence": 0.95,
    "job-name": "labeling-job/streetscenes-job1",
    "class-name": "traffic signals",
    "human-annotated": "yes",
    "creation-date": "2019-04-09T21:25:51.111094",
    "type": "groundtruth/image-classification"
  }
}
4. {"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00004.JPG"}
5. {"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00005.JPG"}
6. {"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00006.JPG"}
7. {"source-ref":"s3://bucket_name/datasets/streetscenes/SSDB00007.JPG"}
8. {
  "source-ref": "s3://bucket_name/datasets/streetscenes/SSDB00008.JPG",
  "Streetscenes-Job1": 0,
  "Streetscenes-Job1-metadata": {
    "confidence": 0.95,
    "job-name": "labeling-job/streetscenes-job1",
    "class-name": "vehicles",
    "human-annotated": "yes",
    "creation-date": "2019-04-09T21:28:54.752427",
    "type": "groundtruth/image-classification"
  }
}
...
...
...

For more information about the format for different modalities, see Output Data.

Step 3: Chaining your first job

You can now chain Streetscenes-Job1. In Labeling jobs, from the Actions dropdown, choose Chain.

The console pre-populates the input dataset location as it fetches the output manifest from the previous stopped job. The label attribute name remains the same as the previous job.

After the job starts, the console shows the counter as 1000, which reflects the data already labeled.

After the job is complete, all labels are generated.

The following code is from the output manifest. All the lines in the output manifest have labels

1. {
  "source-ref": "s3://bucket_name/datasets/streetscenes/SSDB00006.JPG",
  "Streetscenes-Job1": 3,
  "Streetscenes-Job1-metadata": {
    "confidence": 0.59,
    "job-name": "labeling-job/streetscenes-job1-chain",
    "class-name": "None",
    "human-annotated": "yes",
    "creation-date": "2019-04-10T01:37:07.663801",
    "type": "groundtruth/image-classification"
  }
}

2. {
  "source-ref": "s3://bucket_name/datasets/streetscenes/SSDB00007.JPG",
  "Streetscenes-Job1": 0,
  "Streetscenes-Job1-metadata": {
    "job-name": "labeling-job/streetscenes-job1-chain",
    "confidence": 0.99,
    "class-name": "vehicles",
    "type": "groundtruth/image-classification",
    "creation-date": "2019-04-10T01:23:05.309990",
    "human-annotated": "no"
  }
}

...

Chaining in a series

The previous scenarios only showed one level of chaining. Chaining is a powerful feature in which you can feed the output of one job as input to another.

Scenarios for chaining

The following table shows some of the scenarios with which you can experiment with chaining. AL indicates that automated data labeling mode is enabled. Non-AL indicates that automated data labeling mode is not enabled. For more information, see Annotate data for less with Amazon SageMaker Ground Truth and automated data labeling.

Parent labeling job Chained labeling lob Details
1 Non-AL Non-AL You started a labeling job in Non-AL mode and it failed or stopped before labeling all the objects. You want to resume the job in Non-AL mode to label the remaining unlabeled objects by a human.
2 Non-AL AL You started a labeling job in Non-AL mode and it failed or stopped before labeling all the objects. You want to resume the job in AL mode to label the remaining unlabeled objects automatically based on the existing labels.
3 AL Non-AL You started a labeling job in AL mode and it failed or stopped before labeling all the objects. You want to resume the job in Non-AL mode to label the remaining unlabeled objects by a human.
4 AL AL You started a labeling job in AL mode and it failed or stopped before labeling all the objects. You want to resume the job in AL mode to label the remaining unlabeled objects automatically based on the existing labels or pre-trained models.
5 Third-Party labels Non-AL You acquired some labels through other sources (Amazon SageMaker Ground Truth or a third party) and have a manifest with labeled objects and unlabeled data. You want to start a new job in Non-AL mode to label the remaining unlabeled objects automatically based on the existing labels.
6 Third-Party labels AL You acquired some labels through other sources (Amazon SageMaker Ground Truth or a third party) and have a manifest with labeled objects and unlabeled data. You want to start a new job in AL mode to label the remaining unlabeled objects automatically based on the existing labels.

In some of these scenarios, if you are in AL mode and the job stops after a model is generated, the subsequent AL job uses the model from the first step, which reduces training time. For more information, see Amazon SageMaker Ground Truth: Using a Pre-Trained Model for Faster Data Labeling.

Additionally, if enough pre-labeled objects are available, you can bootstrap these labels to be the training set for your automated labeling loop. This method saves on time and cost by not fetching labels from human annotators.

Using third-party labels

This section elaborates on the final two scenarios in the previous table. You can bring in third-party labels as long as it adheres to the Amazon Sagemaker Ground Truth label format. For more information, see Output Data.

For example, assume you have a job in which the manifest has 989/3450 third-party labels. You can start the labeling job with the following code, which contains third-party labels:

{
  "source-ref": "s3://bucket-name/datasets/streetscenes/SSDB03295.JPG",
  "third-party-label": 0,
  "third-party-label-metadata": {
    "confidence": 0.95,
    "job-name": "labeling-job/third-party-label",
    "class-name": "vehicles",
    "human-annotated": "yes",
    "creation-date": "2019-04-09T21:25:51.110794",
    "type": "groundtruth/image-classification"
  }
}
...

After the job starts, it automatically updates the counter.

Time and cost savings

Chaining offers many time- and cost-saving benefits.

Firstly, objects that are already labeled aren’t processed again. Additionally, if automated data labeling is enabled, auto labeling is attempted as soon as possible. If your data is already partially labeled, a validation set is collected by sending work to a human workforce, after which you can bootstrap the partially labeled input data to be the training set, and Amazon Sagemaker Ground Truth performs automated labeling depending on the number of existing labels. This expedites the automated data labeling process; training starts sooner and reduces the training job’s overall time.

Furthermore, skipping labeled objects reduces costs. Training costs are also reduced by using the ML model generated from your existing data.

Chaining using the API

You can also use the API or AWS CLI to do chaining. For more information, see create-labeling-job.

If you have a failed job and want to resume it, you need to enter the same create-labeling-job information as the failed job, with the same LabelAttributeName as the previous job, and use the output manifest file as the input in your chained job.

Similarly, if you want to chain the job for labeling all the objects with a different kind of label, you need to use a different LabelAttributeName than the one in the previous labeling job.

The following code is an example CLI for chaining:

>> aws sagemaker create-labeling-job --labeling-job-name "Streetscenes-Job1-chain" --label-attribute-name "Streetscenes-Job1" --input-config DataSource={S3DataSource={ManifestS3Uri="s3://<bucket_name>/streetscenes/output/Streetscenes-Job1/manifests/output/output.manifest"}},DataAttributes={ContentClassifiers=["FreeOfPersonallyIdentifiableInformation"]} --output-config S3OutputPath="s3://<bucket_name>/streetscenes/output/Streetscenes-Job1-chain/" --role-arn "arn:aws:iam::accountID:role/<rolename>" --label-category-config-s3-uri "s3://<path_to_label_category_file>/labelcategory.json" --stopping-conditions MaxPercentageOfInputDatasetLabeled=100 --human-task-config WorkteamArn="arn:aws:sagemaker:region:394669845002:workteam/public-crowd/default",UiConfig={UiTemplateS3Uri="s3://<bucket_name>/template.liquid"},PreHumanTaskLambdaArn="arn:aws:lambda:us-west-2:081040173940:function:PRE-ImageMultiClass",TaskKeywords="Images","classification",TaskTitle="Image Categorization",TaskDescription="Categorize images into specific classes",NumberOfHumanWorkersPerDataObject=3,TaskTimeLimitInSeconds=300,TaskAvailabilityLifetimeInSeconds=21600,MaxConcurrentTaskCount=1000,AnnotationConsolidationConfig={AnnotationConsolidationLambdaArn="arn:aws:lambda:us-west-2:081040173940:function:ACS-ImageMultiClass"}

This code uses the same label attribute name (label-attribute-name) as the first job, Streetscenes-Job1.

Conclusion

This post demonstrated how the Amazon SageMaker Ground Truth chaining feature offers time-saving and cost-reduction benefits. This is a very powerful feature and this post merely scratches the surface of what Amazon SageMaker Ground Truth chaining can do. Let us know what you think in the comments. You can get started with Amazon Sagemaker Ground Truth by visiting Getting Started page in the documentation.


About the authors

Priyanka Gopalakrishna is a software engineer at Amazon AI. She works on building scalable solutions using distributed systems for machine learning. In her spare time, she loves to hike, catch up on things related to space sciences or read good old strips of Calvin and Hobbes.

 

 

 

Zahid Rahman is a SDE in AWS AI where he builds large scale distributed systems to solve complex machine learning problems . He is primarily focused on innovating technologies that can ‘Divide and Conquer’ Big Data problem.

[Discussion] Advice needed: Feeling trapped by lack of management/strategy, no implemented models.

(also posted to r/datascience but I realized this community is almost 10x bigger)

Hey all,

I’m a data scientist and looking to reddit for some advice… I’ve been in this role for about two years and have been the only data scientist that entire time – this was also my first data scientist role. Since entering into the new ‘data’ team about a year ago, we’ve been continuously plagued with issues like:

  • strange and distracting projects by our boss (who has no analytics experience, but a long career of software developer management) – and a lack of him be able to understand scope and true effort required for these random projects
  • lack of interest in hearing from Sr DAs and me on what our ideal working environments would be (warehouse design, what we can put on our VMs, etc)
  • lack of him working with leadership to build an actual understanding of business needs
  • exerting random/arbitrary control over how things get done (I’ve never seen him do this and it end up being a benefit to a project)

I’ve stayed in the role mainly because it was my foot in the door to this industry (which I am very grateful for), and at the beginning (and to this day, really) the amount of possibilities here are huge and exciting… if they could ever be executed properly as a team. And, to add difficulty to it even more, my boss is an overall great guy – I just don’t think he has the mental horsepower for such a huge change this late into career.

My main predicament is that I’ve been tasked with building out a customer engagement ‘engine’ of sorts – so attempting to predict individually customers likely to leave, and also understand customer cohorts that are more engaged. I’m approaching it similar to a customer churn model, but with a few differences. This has been hyped for a year- the board of directors is aware of it, and so is everyone in leadership. To say the least: the hype around the team he was supposed to build is huge. That pendulum is beginning to swing back in.

The problem is that because of my manager’s disjointed priorities, we have had no progress in building out a warehouse or pipeline that helps a data scientist/me in any way. So, I’m spending time crafting ETL around extremely messy and unreliable system data which has cost me a few months just to implement that – and it isn’t done, of course. He has made very little progress in figuring out that his mental model of what a data scientist does is mostly not true and that the slowness of this project is a result of the past year’s lack of a decent strategy.

And just for a quick, very cringe, example: a few weeks ago he was sweet-talked by a Harvard MBA type into trialing yet-another-vendor’s autoML solution, thinking that the reason why my project has taken months to get off of the ground was simply because I was having trouble building decent models (I haven’t been able to train a model, here, for months because the data didn’t exist in any usable way!). And this is *not* because I’ve been quiet about the real challenges – he simply does not listen to other points of view unless it’s coming down to him from his boss.

But – to be fair – I made several mistakes when joining his team- my #1 one being that I doubted my intuition around the team’s strategy. If someone has 10+ years experience managing software teams, he’s this confident, and I’m this new to the role, then I need to stop challenging the status quo he’s putting together.

Recently I learned that – finally – the screws are coming down on him from his boss and he’s been told to do several of the things I suggested months ago (petty for me to mention, but it feels good and is validating). But rather than him reaching out to me for advice on changing strategy, or what we can do to accelerate the project, it’s more of the same. It’s also too late at this point for me to give suggestions that our small team could do before the end of the year.

Skip to here if you don’t wanna read:

Previously, I’ve said to myself that I will stay at this company until I can put into production at least one model. The question I’m coming to is, what do I do when the timeline for that keeps extending indefinitely and you’ve lost faith in management to be where you need them to be? What do I do in my job hunt when they see I’ve been in a DS role for ~2 years and never got to implement a model into production? If I was a hiring manager, I’d assume to some extent that this person wasn’t in a real data scientist role and would doubt my skills/abilities. Of course I could just lie in an interview- but that feels extremely gross.

My solution so far is to do a few ambitious personal projects that flex on modeling, python ability, and creativity. But we all know that (at least traditionally), your professional experience is the most important factor.

So, if anyone has words of encouragement, discouragement, suggestions, whatever- I’d love to hear it. I doubt my situation is truly unique – and I also know things could be much worse. I am thankful to have gotten my foot in the door, which can be quite hard.

Thanks for reading

submitted by /u/low_life_walrus
[link] [comments]

NVIDIA and Microsoft Team Up to Aid AI Startups

NVIDIA and Microsoft are teaming up to provide the world’s most innovative young companies with access to their respective accelerator programs for AI startups.

Members of NVIDIA Inception and Microsoft for Startups can now receive all the benefits of both programs — including technology, training, go-to-market support and NVIDIA GPU credits in the Azure cloud — to continue growing and solving some of the world’s most complex problems.

The announcement was made at Slush, a startup event taking place this week in Helsinki.

With a variety of tools, technology and resources — including NVIDIA GPU cloud instances on Azure — AI startups can move into production and deployment faster.

NVIDIA and Microsoft will evaluate what startups in the joint program need, and how NVIDIA Inception and Microsoft for Startups can help them achieve their goals.

NVIDIA Inception members are eligible for the following benefits from Microsoft for Startups:

  • Free access to specific Microsoft technologies suited to every startup’s needs, including up to $120,000 in free credits in the Azure cloud
  • Go-to-market resources to help startups sell alongside Microsoft’s global sales channels

Microsoft for Startups members can access the following benefits from NVIDIA Inception:

  • Technology expertise on implementing GPU applications and hardware
  • Free access to NVIDIA Deep Learning Institute online courses, such as “Fundamentals of Deep Learning for Computer Vision” and “Accelerating Data Science”
  • Unlimited access to DevTalk, a forum for technical inquiries and community engagement
  • Go-to-market assistance and hardware discounts across the NVIDIA portfolio, from NVIDIA DGX AI systems to NVIDIA Jetson embedded computing platforms

Microsoft for Startups is a global program designed to support startups as they create and expand their companies. Since its launch in 2018, thousands of startups have applied and are active in the program. Microsoft for Startups members are on course to drive $1 billion in pipeline opportunity by the end of 2020.

NVIDIA Inception is a virtual accelerator program that supports startups harnessing GPUs for AI and data science applications during critical stages of product development, prototyping and deployment. Since its launch in 2016, the program has expanded to over 5,000 companies.

The post NVIDIA and Microsoft Team Up to Aid AI Startups appeared first on The Official NVIDIA Blog.

[R]Research Guide: Model Distillation Techniques for Deep Learning

Knowledge distillation is a model compression technique whereby a small network (student) is taught by a larger trained neural network (teacher). The smaller network is trained to behave like the large neural network. This enables the deployment of such models on small devices such as mobile phones or other edge devices. In this guide, we’ll look at a couple of papers that attempt to tackle this challenge.

https://heartbeat.fritz.ai/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb

submitted by /u/mwitiderrick
[link] [comments]

[P] Machine Learning Flight Rules

A guide for astronauts (now, people doing machine learning) about what to do when things go wrong.

GitHub: https://github.com/bkkaggle/machine-learning-flight-rules

Product Hunt: https://www.producthunt.com/posts/machine-learning-flight-rules

There’s a lot of “hidden knowledge” online on places like Stackoverflow, Kaggle, and the Pytorch discussion forums that is really useful but not easily accessible to people who are just getting started with machine learning. This is why I made Machine learning flight rules, this Github repo compiles all of the things I have learned over the last two years about best practices, common mistakes, and little-known tricks when training neural networks. I’ve tried to make sure that all the information in this repository is accurate, but if you find something that you think is wrong, please let me know by opening an issue. This repository is still a work in progress, so if you find a bug, think there is something missing, or have any suggestions for new features, feel free to open an issue or a pull request. Feel free to use the library or code from it in your own projects, and if you feel that some code used in this project hasn’t been properly accredited, please open an issue. I named this project after the awesome Git Flight Rules project (https://github.com/k88hudson/git-flight-rules). I took a lot of tips from both Andrej Kaparthy’s blog post on a recipe for training neural networks (https://karpathy.github.io/2019/04/25/recipe/) and the Amid Fish blog post on lessons learned when reporoducing a deep reinforcement learning paper (http://amid.fish/reproducing-deep-rl)

submitted by /u/16yoMLDev
[link] [comments]