Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Author: torontoai

Saved by the Spell: Serkan Piantino’s Company Makes AI for Everyone

Spell, founded by Serkan Piantino, is making machine learning as easy as ABC.

Piantino, CEO of the New York-based startup and former director of engineering for Facebook AI Research, explained to AI Podcast host Noah Kravitz how he’s bringing compute power to those that don’t have easy access to GPU clusters.

Spell provides access to hardware as well as a software interface that accelerates execution. Piantino reported that a wide variety of industries has shown interest in Spell, from healthcare to retail, as well as researchers and academia.

Key Points From This Episode

  • Spell’s basic tool is a command line, which has users type “spell run” before code that they previously would’ve run locally. Spell will then snapshot the code, find any necessary data and move that computation onto relevant hardware in the cloud.
  • Spell’s platform provides a collaborative workspace in which clients within an organization can work together on their Jupyter Notebooks and Labs.
  • Users can choose what type of GPU they require for their machine learning experiment, and Spell will run it on the corresponding hardware in the cloud.

Tweetables

“You know there’s some upfront cost to running an experiment, but if you get that cost down low enough, it disappears mentally” — Serkan Piantino [11:52]

“Providing access to hardware and making things easier — giving everybody the same sort of beautiful compute cluster that giant research organizations work on — was a really powerful idea” — Serkan Piantino [18:36]

You Might Also Like

NVIDIA Chief Scientist Bill Dally on How GPUs Ignited AI, and Where His Team’s Headed Next

Deep learning icon and NVIDIA Chief Scientist Bill Dally reflects on his career in AI and offers insight into the AI revolution made possible by GPU-driven deep learning. He shares his predictions on where AI is going next: more powerful algorithms for inference, and neutral networks that can train on less data.

Speed Reader: Evolution AI Accelerates Data Processing with AI

Across industries, employees spend valuable time processing mountains of paperwork. Evolution AI, a U.K. startup and NVIDIA Inception member, has developed an AI platform that extracts and understands information rapidly. Evolution AI Chief Scientist Martin Goodson explains the variety of problems that the company can solve.

Striking a Chord: Anthem Helps Patients Navigate Healthcare with Ease

Health insurance company Anthem helps patients personalize and better understand their healthcare information through AI. Rajeev Ronanki, senior vice president and chief digital officer at Anthem, explains how the company gives users the opportunity to schedule video consultations and book doctor’s appointments virtually.

Tune in to the AI Podcast

Get the AI Podcast through iTunesGoogle PodcastsGoogle PlayCastbox, DoggCatcher, OvercastPlayerFM, Pocket Casts, PodbayPodBean, PodCruncher, PodKicker, SoundcloudSpotifyStitcher and TuneIn.

  

Make Our Podcast Better

Have a few minutes to spare? Fill out this short listener survey. Your answers will help us make a better podcast.

The post Saved by the Spell: Serkan Piantino’s Company Makes AI for Everyone appeared first on The Official NVIDIA Blog.

Building a custom classifier using Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in texts. Amazon Comprehend identifies the language of the text; extracts key phrases, places, people, brands, or events; and understands how positive or negative the text is. For more information about everything Amazon Comprehend can do, see Amazon Comprehend Features.

You may need out-of-the-box NLP capabilities tied to your needs without having to lead a research phase. This would allow you to recognize entity types and perform document classifications that are unique to your business, such as recognizing industry-specific terms and triaging customer feedback into different categories.

Amazon Comprehend is a perfect match for these use cases. In November 2018, Amazon Comprehend added the ability for you to train it to recognize custom entities and perform custom classification. For more information, see Build Your Own Natural Language Models on AWS (no ML experience required).

This post demonstrates how to build a custom text classifier that can assign a specific label to a given text. No prior ML knowledge is required.

About this blog post
Time to complete 1 hour for the reduced dataset ; 2 hours for the full dataset
Cost to complete ~ $50 for the reduced dataset ; ~ $150 for the full dataset
These include training, inference and model management, see Amazon Comprehend pricing for more details.
Learning level Advanced (300)
AWS services Amazon Comprehend
Amazon S3
AWS Cloud9

Prerequisites

To complete this walkthrough, you need an AWS account and access to create resources in AWS IAM, Amazon S3, Amazon Comprehend, and AWS Cloud9 within that account.

This post uses the Yahoo answers corpus cited in the paper Text Understanding from Scratch by Xiang Zhang and Yann LeCun. This dataset is available on the AWS Open Data Registry.

You can also use your own dataset. It is recommended that you train your model with up to 1,000 training documents for each label, and that when you select your labels, suggest labels that are clear and don’t overlap in meaning. For more information, see Training a Custom Classifier.

Solution overview

The walkthrough includes the following steps:

  1. Preparing your environment
  2. Creating an S3 bucket
  3. Setting up IAM
  4. Preparing data
  5. Training the custom classifier
  6. Gathering results

For more information about how to build a custom entity recognizer to extract information such as people and organization names, locations, time expressions, numerical values from a document, see Build a custom entity recognizer using Amazon Comprehend.

Preparing your environment

In this post, you use the AWS CLI as much as possible to speed up the experiment.

AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with a browser. It includes a code editor, debugger, and terminal. AWS Cloud9 comes pre-packaged with essential tools for popular programming languages and the AWS CLI pre-installed, so you don’t need to install files or configure your laptop for this workshop.

Your AWS Cloud9 environment has access to the same AWS resources as the user with which you logged in to the AWS Management Console.

To prepare your environment, complete the following steps:

  1. On the console, under Services, choose AWS Cloud9.
  2. Choose Create environment.
  3. For Name, enter CustomClassifier.
  4. Choose Next step.
  5. Under Environment settings, change the instance type to t2.large.
  6. Leave other settings at their defaults.
  7. Choose Next step.
  8. Review the environment settings and choose Create environment.

It can take up to a few minutes for your environment to be provisioned and prepared. When the environment is ready, your IDE opens to a welcome screen, which contains a terminal prompt.

You can run AWS CLI commands in this prompt the same as you would on your local computer.

  1. To verify that your user is logged in, enter the following command:
Admin:~/environment $ aws sts get-caller-identity

You get the following output which indicates your account and user information:

{
    "Account": "123456789012",
    "UserId": "AKIAI53HQ7LSLEXAMPLE",
    "Arn": "arn:aws:iam::123456789012:user/Colin"
}
  1. Record the account ID to use in the next step.

Keep your AWS Cloud9 IDE opened in a tab throughout this walkthrough.

Creating an S3 bucket

Use the account ID from the previous step to create a globally unique bucket name, such as 123456789012-customclassifier. Enter the following command in your AWS Cloud9 terminal prompt:

Admin:~/environment $ aws s3api create-bucket --acl private --bucket '123456789012-comprehend' --region us-east-1

The output shows the name of the bucket you created:

{
    "Location": "/123456789012-comprehend"
}

Setting up IAM

To authorize Amazon Comprehend to perform bucket reads and writes during the training or during the inference, you must grant Amazon Comprehend access to the S3 bucket that you created. You are creating a data access role in your account to trust the Amazon Comprehend service principal.

To set up IAM, complete the following steps:

  1. On the console, under Services, choose IAM.
  2. Choose Roles.
  3. Choose Create role.
  4. Select AWS service as the type of trusted entity.
  5. Choose Comprehend as the service that uses this role.
  6. Choose Next: Permissions.

The Policy named ComprehendDataAccessRolePolicy is automatically attached.

  1. Choose Next: Review
  2. For Role name, enter ComprehendBucketAccessRole.
  3. Choose Create role.
  4. Record the Role ARN.

You use this ARN when you launch the training of your custom classifier.

Preparing data

In this step, you download the corpus and prepare the data to match Amazon Comprehend’s expected formats for both training and inference. This post provides a script to help you achieve the data preparation for your dataset.

Alternatively, for even more convenience, you can download the prepared data by entering the following two command lines:

Admin:~/environment $ aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv . 

Admin:~/environment $ aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv .

If you follow the preceding step, skip the next steps and go directly to the upload part at the end of this section.

If you want to go through the dataset preparation for this walkthrough, or if you are using your own data follow the next steps:

Enter the following command in your AWS Cloud9 terminal prompt to download it from the AWS Open Data registry:

Admin:~/environment $ aws s3 cp s3://fast-ai-nlp/yahoo_answers_csv.tgz .

You see a progress bar and then the following output:

download: s3://fast-ai-nlp/yahoo_answers_csv.tgz to ./yahoo_answers_csv.tgz

Uncompress it with the following command:

Admin:~/environment $ tar xvzf yahoo_answers_csv.tgz

You should delete the archive because you are limited in available space in your AWS Cloud9 environment. Use the following command:

Admin:~/environment $ rm -f yahoo_answers_csv.tgz

You get a folder yahoo_answers_csv, which contains the following four files:

  • classes.txt
  • readme.txt
  • test.csv
  • train.csv

The files train.csv and test.csv contain the training samples as comma-separated values. There are four columns in them, corresponding to class index (1 to 10), question title, question content, and best answer. The text fields are escaped using double quotes (“), and any internal double quote is escaped by two double quotes (“”). New lines are escaped by a backslash followed with an “n” character, that is “n”.

The following code is the overview of file content:

"5","why doesn't an optical mouse work on a glass table?","or even on some surfaces?","Optical mice use an LED
"6","What is the best off-road motorcycle trail ?","long-distance trail throughout CA","i hear that the mojave
"3","What is Trans Fat? How to reduce that?","I heard that tras fat is bad for the body.  Why is that? Where ca
"7","How many planes Fedex has?","I heard that it is the largest airline in the world","according to the www.fe
"7","In the san francisco bay area, does it make sense to rent or buy ?","the prices of rent and the price of b

The file classes.txt contains the available labels.

The train.csv file contains 1,400,000 lines and test.csv contains 60,000 lines. Amazon Comprehend uses between 10–20% of the documents submitted for training to test the custom classifier.

The following command indicates that the data is evenly distributed:

Admin:~/environment $ awk -F '","' '{print $1}'  yahoo_answers_csv/train.csv | sort | uniq -c

You should train your model with up to 1,000 training documents for each label and no more than 1,000,000 documents.

With 20% of 1,000,000 used for testing, that is still plenty of data to train your custom classifier.

Use a shortened version of train.csv to train your custom Amazon Comprehend model, and use test.csv to perform your validation and see how well your custom model performs.

For training, the file format must conform to the following requirements:

  • File must contain one label and one text per line – 2 columns
  • No header
  • Format UTF-8, carriage return “n”.

Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped.

The following table contains the formatted labels proposed for the training.

Index Original For training
1 Society & Culture SOCIETY_AND_CULTURE
2 Science & Mathematics SCIENCE_AND_MATHEMATICS
3 Health HEALTH
4 Education & Reference EDUCATION_AND_REFERENCE
5 Computers & Internet COMPUTERS_AND_INTERNET
6 Sports SPORTS
7 Business & Finance BUSINESS_AND_FINANCE
8 Entertainment & Music ENTERTAINMENT_AND_MUSIC
9 Family & Relationships FAMILY_AND_RELATIONSHIPS
10 Politics & Government POLITICS_AND_GOVERNMENT

When you want your custom Amazon Comprehend model to determine which label corresponds to a given text in an asynchronous way, the file format must conform to the following requirements:

  • File must contain one text per line
  • No header
  • Format UTF-8, carriage return “n”.

This post includes a script to speed up the data preparation. Enter the following command to copy the script to your local AWS Cloud9 environment:

Admin:~/environment $ aws s3 cp s3://aws-ml-blog/artifacts/comprehend-custom-classification/prepare_data.py .

To launch data preparation, enter the following commands:

Admin:~/environment $ sudo pip-3.6 install pandas tqdm
Admin:~/environment $ python3 prepare_data.py

This script is tied to the Yahoo corpus and uses the pandas library to format the training and testing datasets to match your Amazon Comprehend expectations. You may adapt it to your own dataset or change the number of items in the training dataset and validation dataset.

When the script is finished (it should run for approximately 11 minutes on a t2.large instance for the full dataset, and in under 5 minutes for the reduced dataset), you have the following new files in your environment:

  • comprehend-train.csv
  • comprehend-test.csv

Upload the prepared data (either the one you downloaded or the one you prepared) to the bucket you created with the following commands:

Admin:~/environment $ aws s3 cp comprehend-train.csv s3://123456789012-comprehend/

Admin:~/environment $ aws s3 cp comprehend-test.csv s3://123456789012-comprehend/

Training the custom classifier

You are ready to launch the custom text classifier training. Enter the following command, and replace the role ARN and bucket name with your own:

Admin:~/environment $ aws comprehend create-document-classifier --document-classifier-name "yahoo-answers" --data-access-role-arn arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole --input-data-config S3Uri=s3://123456789012-comprehend/comprehend-train.csv --output-data-config S3Uri=s3://123456789012-comprehend/TrainingOutput/ --language-code en

You get the following output that names the custom classifier ARN:

{
    "DocumentClassifierArn": "arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers"
}

It is an asynchronous call. You can then track the training progress with the following command:

Admin:~/environment $ aws comprehend describe-document-classifier --document-classifier-arn arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers

You get the following output:

{
    "DocumentClassifierProperties": {
        "DocumentClassifierArn": "arn:aws:comprehend:us-east-1: 123456789012:document-classifier/yahoo-answers",
        "Status": "TRAINING", 
        "LanguageCode": "en", 
        "DataAccessRoleArn": "arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole", 
        "InputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/comprehend-train.csv"
        },
        "SubmitTime": 1571657958.503, 
        "OutputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/TrainingOutput/123456789012-CLR-b205910479f3a195124bea9b70c4e2a9/output/output.tar.gz"
        }
    }
}

When the training is finished, you get the following output:

{
    "DocumentClassifierProperties": {
        "DocumentClassifierArn": "arn:aws:comprehend:us-east-1: 123456789012:document-classifier/yahoo-answers",
        "Status": "TRAINED", 
        "LanguageCode": "en", 
        "DataAccessRoleArn": "arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole", 
        "InputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/comprehend-train.csv"
        },
        "OutputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/TrainingOutput/123456789012-CLR-b205910479f3a195124bea9b70c4e2a9/output/output.tar.gz"
        },
        "SubmitTime": 1571657958.503,
        "EndTime": 1571661250.482,
        "TrainingStartTime": 1571658140.277
        "TrainingEndTime": 1571661207.915,
        "ClassifierMetadata": {
            "NumberOfLabels": 10,
            "NumberOfTrainedDocuments": 989873,
            "NumberOfTestDocuments": 10000,
            "EvaluationMetrics": {
                "Accuracy": 0.7235,
                "Precision": 0.722,
                "Recall": 0.7235,
                "F1Score": 0.7219
            }
        },
    }
}

The training duration may vary; in this case, the training took approximately one hour for the full dataset (20 minutes for the reduced dataset).

The output for the training on the full dataset shows that your model has a recall of 0.72—in other words, it correctly identifies 72% of given labels.

The following screenshot shows the view from the console (Comprehend > Custom Classification > yahoo-answers).

Gathering results

You can now launch an inference job to test how the classifier performs. Enter the following commands:

Admin:~/environment $ aws comprehend start-document-classification-job --document-classifier-arn arn:aws:comprehend:us-east-1:123456789012:document-classifier/yahoo-answers --input-data-config S3Uri=s3://123456789012-comprehend/comprehend-test.csv,InputFormat=ONE_DOC_PER_LINE --output-data-config S3Uri=s3://123456789012-comprehend/InferenceOutput/ --data-access-role-arn arn:aws:iam::123456789012:role/ComprehendBucketAccessRole

You get the following output:

{
    "JobStatus": "SUBMITTED", 
    "JobId": "cd5a6ee7f490a353e33f50d866d0317a"
}

Just as you did for the training progress tracking, because the inference is asynchronous, you can check the progress of the newly launched job with the following command:

Admin:~/environment $ aws comprehend describe-document-classification-job --job-id cd5a6ee7f490a353e33f50d866d0317a

You get the following output:

{
    "DocumentClassificationJobProperties": {
        "InputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/comprehend-test.csv", 
            "InputFormat": "ONE_DOC_PER_LINE"
        }, 
        "DataAccessRoleArn": "arn:aws:iam:: 123456789012:role/ComprehendBucketAccessRole", 
        "DocumentClassifierArn": "arn:aws:comprehend:us-east-1: 123456789012:document-classifier/yahoo-answers", 
        "JobStatus": "IN_PROGRESS", 
        "JobId": "cd5a6ee7f490a353e33f50d866d0317a", 
        "SubmitTime": 1571663022.503, 
        "OutputDataConfig": {
            "S3Uri": "s3://123456789012-comprehend/InferenceOutput/123456789012-CLN-cd5a6ee7f490a353e33f50d866d0317a/output/output.tar.gz"
        }
    }
}

When it is completed, JobStatus changes to COMPLETED. This takes approximately a few minutes to complete.

Download the results using OutputDataConfig.S3Uri path with the following command:

Admin:~/environment $ aws s3 cp s3://123456789012-comprehend/InferenceOutput/123456789012-CLN-cd5a6ee7f490a353e33f50d866d0317a/output/output.tar.gz .

When you uncompress the output (tar xvzf output.tar.gz), you get a .jsonl file. Each line represents the result of the requested classification for the corresponding line of the document you submitted.

For example, the following code is one line from the predictions:

{"File": "comprehend-test.csv", "Line": "9", "Classes": [{"Name": "ENTERTAINMENT_AND_MUSIC", "Score": 0.9685}, {"Name": "EDUCATION_AND_REFERENCE", "Score": 0.0159}, {"Name": "BUSINESS_AND_FINANCE", "Score": 0.0102}]}

This means that your custom model predicted with a 96.8% confidence score that the following text was related to the “Entertainment and music” label.

"What was the first Disney animated character to appear in color? n Donald Duck was the first major Disney character to appear in color, in his debut cartoon, "The Wise Little Hen" in 1934.nnFYI: Mickey Mouse made his color debut in the 1935 'toon, "The Band Concert," and the first color 'toon from Disney was "Flowers and Trees," in 1932."

Each line of results also provides the second and third possible labels. You might use these different scores to build your application upon applying each label with a score superior to 40% or changing the model if no single score is above 70%.

Summary

With the full dataset for training and validation, in less than two hours, you used Amazon Comprehend to learn 10 custom categories—and achieved a 72% recall on the test—and to apply that custom model to 60,000 documents.

Try custom categories now from the Amazon Comprehend console. For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.

Amazon Comprehend can help you power your application with NLP capabilities in almost no time. Happy experimentation!


About the Author

Hervé Nivon is a Solutions Architect who helps startup customers grow their business on AWS. Before joining AWS, Hervé was the CTO of a company generating business insights for enterprises from commercial unmanned aerial vehicle imagery. Hervé has also served as a consultant at Accenture.

 

 

 

[D] Should autoencoders really be symmetric?

I always find myself wanting to make the decoder side of an autoencoder as symmetric as possible with respect to the encoder side, because it feels like an “elegant” design decision. But I suspect that it’s not optimal. And I’m not finding any direct discussions of this topic via google.

In most of mathematics, complex functions tend to have even more complex inverses. With respect to CNNs, convolutions are not strictly invertible, so it seems like the Conv2DTranspose operations could benefit from a higher complexity and parameter count to approximate it better. I’m curious if anyone has direct experience studying this, or if there are conventions for “optimizing” the decoder side of an autoencoder (or maybe it’s the encoder side needs more parameters…?).

My first inclination is to just double some numbers on the decoder side to give it twice as many parameters. But maybe including extra layers is better, since it more significantly increases the complexity of functions it can approximate. Or maybe none of this is theoretically necessary/relevant…?

Here’s an almost perfectly-symmetric reference network. Obviously I could experiment with it to come up with ideas, but I’m more interested in the general theory and if there’s any established ideas on the topic (and not just for CNNs, but all types of autoencoders).

Encoder:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 48, 48, 3)] 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 24, 24, 32) 2432 _________________________________________________________________ conv2d_2 (Conv2D) (None, 12, 12, 64) 32832 _________________________________________________________________ conv2d_3 (Conv2D) (None, 6, 6, 128) 73856 _________________________________________________________________ flatten_3 (Flatten) (None, 4608) 0 _________________________________________________________________ dense_1 (Dense) (None, 256) 1179904 _________________________________________________________________ dense_2 (Dense) (None, 64) 16448 ================================================================= Total params: 1,305,472 

Decoder:

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) [(None, 64)] 0 _________________________________________________________________ dense_3 (Dense) (None, 256) 16640 _________________________________________________________________ dense_4 (Dense) (None, 4608) 1184256 _________________________________________________________________ reshape_1 (Reshape) (None, 6, 6, 128) 0 _________________________________________________________________ conv2d_transpose_3 (Conv2DTr (None, 12, 12, 64) 73792 _________________________________________________________________ conv2d_transpose_4 (Conv2DTr (None, 24, 24, 32) 32800 _________________________________________________________________ conv2d_transpose_5 (Conv2DTr (None, 48, 48, 3) 2403 ================================================================= Total params: 1,309,891 

For reference, the above computation graph was produced with the following code fragment:

# Encoder enc_input = L.Input(shape=(48, 48, 3)) enc0 = L.Conv2D(filters= 32, kernel_size=5, strides=2, padding='same', activation='relu')(enc_input) enc1 = L.Conv2D(filters= 64, kernel_size=4, strides=2, padding='same', activation='relu')(enc0) enc2 = L.Conv2D(filters=128, kernel_size=3, strides=2, padding='same', activation='relu')(enc1) enc_flat = L.Flatten()(enc2) enc_dense = L.Dense(256, activation='tanh')(enc_flat) enc_out = L.Dense(64, activation='linear')(enc_dense) encoder = keras.Model(inputs=enc_input, outputs=enc_out, name='Encoder') # Decoder dec_input = L.Input(shape=(64,)) dec_dense1 = L.Dense(256, activation='tanh')(dec_input) dec_dense2 = L.Dense(6*6*128, activation='relu')(dec_dense1) dec_reshape = L.Reshape((6,6,128))(dec_dense2) dec2 = L.Conv2DTranspose(filters=64, kernel_size=3, strides=2, padding='same', activation='relu')(dec_reshape) dec1 = L.Conv2DTranspose(filters=32, kernel_size=4, strides=2, padding='same', activation='relu')(dec2) dec0 = L.Conv2DTranspose(filters= 3, kernel_size=5, strides=2, padding='same', activation='linear')(dec1) decoder = keras.Model(inputs=dec_input, outputs=dec0, name='Decoder') encoder.summary() decoder.summary() 

submitted by /u/etotheipi_
[link] [comments]

[D] Decision Tree Splitting strategy

[D] Decision Tree Splitting strategy

I have a dataset with 4 categorical features (Cholesterol, Systolic Blood pressure, diastolic blood pressure, and smoking rate). I use a decision tree classifier to find the probability of stroke. I am trying to verify my understanding of the splitting procedure done by Python Sklearn. Since it is a binary tree, there are three possible ways to split the first feature which is either to group categories {0 and 1 to a leaf, 2 to another leaf} or {0 and 2, 1}, or {0, 1 and 2}. What I know (please correct me here) is that the chosen split is the one with the highest information gain.

I have calculated the information gain for each of the three grouping scenarios:

{0 + 1 , 2} –> 0.17

{0 + 2 , 1} –> 0.18

{1 + 2 , 0} –> 0.004

However, sklearn’s decision tree chose the first scenario instead of the third (please check the picture).

Can anyone please help clarify the reason for selecting the first scenario? is there a priority for splits that results in pure nodes. thus selecting such a scenario although it has less information gain?

https://preview.redd.it/mkve4teopk641.jpg?width=1319&format=pjpg&auto=webp&s=fe487bedf67bc812d720ae2fe595fc41d9589dda

submitted by /u/elmsha
[link] [comments]

[D] Classifying malware based on API calls

Hi guys,

I am new to machine learning and after trying out TensorFlow’s tutorial on how to create a classifier based on IMDb reviews, I want to create my own classifier to actually do a binary classification(malicious/benign) of maybe .exe or .apk files.

I was wondering if I can actually proceed to do the same thing as what tensorflow’s IMDb tutorial did, i.e train using a set of text + give those text a label (pos/neg).

So in the context of classifying malware, those texts are actually system API calls. i.e

Set 1 [ func1() func2() func3() func4() func5() func6()…etc] Label -> Malicious

Set 2 [func1() func3() func4() func5()] Label -> benign

Sequence of the API call matters btw and i heard to do that I will need to use RNN LSTM.

I would love to hear from you guys if this is the correct way to do things…would most likely target Android applications…

submitted by /u/yourspeaker317
[link] [comments]

[Project] Curated list of computational narratology papers

Hi!

You might remember me from the blog I posted a few days ago link.

I received an absolute onslaught of emails (close to 30 emails!!!). The main question I got was “Wow computational narratology seems pretty cool! Where do I get started? I’ve only seen paper XYZ”

As such, I decided rather than answering ever email independently, I would create a curated list of papers!

https://github.com/LouisCastricato/Narratology-Papers

Feel free to contribute (PRs are welcome!) I’ll be working on this for the next few hours, so it should be a couple dozen papers by tomorrow 🙂

submitted by /u/FerretDude
[link] [comments]

[N] 4 Months after Siraj was caught scamming he has still not refunded any victims based in India, Philippines, or any other countries with no legal recourse. He makes an apology video, and when his victims ask for their refund, his followers respond with “Be kind. He’s asking for your forgiveness”

This is fucking sick..

People based in India, the Philippines, and other countries that do not have the resources to go after Siraj legally are those who need the money the most. 200$ could be a months worth of salary, or several months. And the types of people who get caught up in the scams are those who genuinely looking to improve their financial situation and work hard for it. This is fucking cruel.

I’m having a hard time believing Siraj’s followers are that brainwashed. Most likely alt accounts controlled by Siraj.

https://i.imgur.com/6cUhQDO.png

https://i.imgur.com/TDx5ELA.png

submitted by /u/RelevantMarketing
[link] [comments]