Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

Category: Vimarsh Karbhari

How to evaluate ML models using confusion matrix?

Model Evaluation using Confusion Matrix

Model evaluation is a very important aspect of data science. Evaluation of a Data Science Model provides more colour to our hypothesis and helps evaluate different models that would provide better results against our data.

What Big-O is to coding, validation and evaluation is to Data Science Models.

Photo by Leon Koye on Unsplash

When we are implementing a multi-class classifier, we have multiple classes and the number of data entries belonging to all these classes is different. During testing, we need to know whether the classifier performs equally well for all the classes or whether there is bias towards some classes. This analysis can be done using the confusion matrix. It will have a count of how many data entries are correctly classified and how many are misclassified.

Let’s take an example. There is a total of ten data entries that belong to a class, and the label for that class is “Class 1”. When we generate the prediction from our ML model, we will check how many data entries out of the ten entries get the predicted label as “Class 1”. Suppose six data entries are correctly classified and get the label “Class 1”. In this case, for six entries, the predicted label and True(actual) label is the same, so the accuracy is 60%. For the remaining data entries (4 entries), the ML model misclassifies them. The ML model predicts class labels other than “Class 1”. From the preceding example, it is visible that the confusion matrix gives us an idea about how many data entries are classified correctly and how many are misclassified. We can explore the class-wise accuracy of the classifier.

Source: ML Solutions

For more learning on similar topics, the ML solutions book provides good explanations.

For more such answers to important Data Science concepts, please visit Acing AI.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.


How to evaluate ML models using confusion matrix? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Shopify Data Science Interview Questions

Shopify powers 800,000 businesses in approximately 175 countries.

The first iteration of Shopify (before it was called that) was an online store that sold snowboards. Eventually, there was a pivot to becoming an e- commerce platform. It’s been named Canada’s “smartest” company, among myriad other well-earned accolades. Shopify was the third largest e-commerce CMS in 2018, with a market share of 10.03% in the first million websites. In 2018, Shopify platform did 1.5+ Billion $ in sales on Cyber Monday alone.

Source: https://mobilesyrup.com/2018/05/08/shopify-new-retail-features-chip-reader/

Interview Process

The first step is the phone screen with HR person. The next step is a three part in person interview (‘life story’ and technical interview). Once those are clear, there is an onsite interview which consists of two more technical interviews, and three more interviews before prospective team leads.

Important Reading

surviving-flashes-of-high-write-traffic-using-scriptable-load-balancers

Data Science Related Interview Questions

  • Go through a previously completed project and explain it. Why did you make the choices in the project that you did?
  • What’s the difference between Type I and Type II error?
  • Explain the difference between L1 and L2 regularization.
  • Write a program to solve a simulation of Conway’s game of life.
  • What is the difference between supervised and unsupervised machine learning?
  • What’s the difference between a generative and discriminative model?
  • What’s the F1 score? How would you use it?
  • What is your experience working on big data technologies?
  • Do you have experience with Spark or big data tools for machine learning?
  • How do you ensure you are not overfitting with a model?

Reflecting on the Question

The 800,000 businesses that Shopify powers generates massive amounts of data. The Data Science team at Shopify asks basic data science questions which are fundamental in nature. Sometimes, the questions revolve around your resume and the problems you have solved in your past career. Good grip on fundamentals can surely land you a job with the world’s largest e-commerce platform!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Shopify and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Shopify Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

What is TF-IDF in Feature Engineering?

Basic concept of TF-IDF in NLP

The concept TF-IDF stands for term frequency-inverse document frequency. This is in the field of numerical statistics. With this concept, we will be able to decide how important a word is to a given document in the present dataset or corpus.

Frequency

What is TF-IDF?

TF-IDF indicates what the importance of the word is in order to understand the document or dataset. Let us understand with an example. Suppose you have a dataset where students write an essay on the topic, My House. In this dataset, the word a appears many times; it’s a high frequency word compared to other words in the dataset. The dataset contains other words like home, house, rooms and so on that appear less often, so their frequency are lower and they carry more information compared to the word. This is the intuition behind TF-IDF.

Let us dive deep into the mathematical aspect of TF-IDF. It has two parts: Term Frequency(TF) and Inverse Document Frequency(IDF). The term frequency indicates the frequency of each of the words present in the document or dataset.

So, its equation is given as follows:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

The second part is — inverse document frequency. IDF actually tells us how important the word is to the document. This is because when we calculate TF, we give equal importance to every single word. If the word appears in the dataset more frequently, then its term frequency (TF) value is high while not being that important to the document.

So, if the word the appears in the document 100 times, then it’s not carrying that much information compared to words that are less frequent in the dataset. Thus, we need to define some weighing down of the frequent terms while scaling up the rare ones, which decides the importance of each word. We will achieve this with the following equation:

IDF(t) = log10(Total number of documents / Number of documents with term t in it).

Hence, equation is calculate TF-IDF is as follows.

TF * IDF = [ (Number of times term t appears in a document) / (Total number of terms in the document) ] * log10(Total number of documents / Number of documents with term t in it).

In reality, TF-IDF is the multiplication of TF and IDF, such as TF * IDF.

Now, let’s take an example where you have two sentences and are considering those sentences as different documents in order to understand the concept of TF-IDF:

Document 1: This is a sample.

Document 2: This is another example.

Source: Python NLP
Source: Python NLP

In summary, to calculate TF-IDF, we will follow these steps:

1. We first calculate the frequency of each word for each document.

2. We calculate IDF.

3. We multiply TF and IDF.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Reference: Python NLP


What is TF-IDF in Feature Engineering? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

What is AUC?

Data Science Interview Questions based on AUC.

Few weeks ago, I started wrote about ROC curves. The purpose was to provide a basic primer on ROC curves. As a follow up, this article talks about AUC.

Photo by Zbysiu Rodak on Unsplash

AUC stands for Area Under the Curve. ROC can be quantified using AUC. The way it is done is to see how much area has been covered by the ROC curve. If we obtain a perfect classifier, then the AUC score is 1.0. If the classifier is random in its guesses, then the AUC score is 0.5. In the real world, we don’t expect an AUC score of 1.0, but if the AUC score for the classifier is in the range of 0.6 to 0.9, then it is considered to be a good classifier.

AUC for the ROC curve

In the preceding figure, the area under the curve which has been covered becomes our AUC score. This gives us an indication of how good or bad our classifier is performing. ROC and AUC are the two indicators that can provide us with insights on how our classifier performs.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Reference: ML Solutions


What is AUC? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Expedia Data Science Interview Questions

There are 37 million Expedia members across 32 countries.

Expedia has covered 534 billion miles in air travel, this is enough for 72 round trips (in passenger miles flown) from the sun to Pluto and back. Expedia is a travel company like Booking.com which we have covered at Acing AI previously. It has sold enough hotel room nights in the last 20 years to account for every person living in the United States. The amount of data Expedia accumulates by having so many travellers every year leads to huge investment in technology. Expedia has invested over $850M trailing year over year in tech spend. A mature tech stack helps Data Scientists at their job. This is a great opportunity for any Data Scientist to build their career.

Photo by Vincent Versluis on Unsplash

Interview Process

If you are short-listed after resume screening, there is first interview with the manager of the data science team. These include technical questions about machine learning and statistics. After clearing that, there is a technical coding interview. The third round is an interview with HR more classic and typical job interview.

Important Reading

Source: Streaming Data Ecosystems

Data Science Related Interview Questions

  • What is the process of cross validation?
  • How can we do price optimization for properties on Expedia?
  • Predict Hotel prices in a given dataset.
  • Explain a Machine Learning project on your resume.
  • Develop a recommendation system based on a provided dataset.
  • Which flight path is more profitable for London-Lisbon or London-Milan?
  • Should we invest on buying more property in X city?
  • Explain linear and logistic regression.
  • Give pros and cons of SVM.
  • Explain the meaning of overfitting to non technical people.

Reflecting on the Question

The data science team at Expedia is geographically dispersed. The technical team has build a very mature data science architecture that enables the Data Science team. The questions are based on the questions the data science team at Expedia answers day to day. Great product sense about the Expedia product and its business can surely land you a job at one of the world’s largest travel sites!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Expedia and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Expedia Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lyft Data Science Interview Questions

As of January 2018, Lyft could count 23 million users.

Lyft currently offers services in 350 US cities, and Toronto and Ottawa in Canada. It was launched in 2012, as a part of long-distance car-pooling business Zimride — the largest such app in the US (named for transportation culture in Zimbabwe). It was renamed as Lyft later. Launched in Silicon Valley, Lyft spread from 60 US cities in April 2014 to 300 in January 2017, to 350 today — plus the two aforementioned Canadian cities. With 350 cities, millions of users and billions of rides the data generated at Lyft is huge. The product achieves economies of scale deploying Data Science. Hence, data science is a core part of the product and not just an added feature.

Photo by Austin Distel on Unsplash

Interview Process

The interview process starts with a phone interview with a Data Scientist. It is around an in depth conversation about your resume and past projects. That interview is followed by a take home test which is usually around a ride sharing data set. As part of the take home test, there is a presentation which has to be created for the onsite interview. The onsite interview consists of 4–5 interviews. One of those is presentation of the take home test. It also includes a SQL test, stats and probability and business case. There is a final core values interview to know if you fit within the Lyft culture. The interview is challenge but the reward when you clear the interview is totally worth it.

Important Reading

Source: From shallow to deep learning in fraud

Data Science Related Interview Questions

  • Find expectations of a random variable with basic distribution. How would you construct a confidence interval? How would you estimate a probability of ordering a ride? What assumptions do you need in order to estimate this probability?
  • What optimization techniques are you familiar with and how do they work? How would you find the optimal price given a linear demand function?
  • Coin got x heads during y flips. How can we test if this is a fair coin?
  • What are some metrics for monitoring supply and demand in Lyft market?
  • Explain correlation and variance.
  • What is the lifetime value of a driver?
  • Implement k nearest neighbour using a quad tree.
  • What are the different factors that could influence a rise in average wait time of a driver?
  • Explain what are the best ways to achieve pool matching?
  • How do you reduce churn on the supply side?

Reflecting on the Questions

The Data Science team at Lyft moves very quickly. The Data sets are huge and problems so wide in nature that the team explores different types of models which can provide higher precision for same recall and feature set. The questions reflect the tough problems which the team faces day to day. There is a mix of model building along with complex coding questions. As I mentioned before the interviews are tough but they are well worth it for getting to work in an excellent team. Hard work can surely get you a job in one of the world’s largest transportation companies!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Lyft and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Lyft Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Booking.com Data Science Interview Questions

There are 28 Million+ listings to stay on Booking.com.

Booking.com is the travel E-Commerce part of Booking Holdings. They have over 140,000+ destinations in 230 countries all over the world. They also have over 1.5 Million+ nights reserved every day on their platform. From a data science perspective, this translates into over 300 TB of data. A robust data engineering infrastructure coupled with huge amounts of data makes Booking.com one of the best places for a Data Scientist to build their career.

Photo by John Matychuk on Unsplash

Interview Process

The interview process starts with MCQ based test on machine learning and statistics questions.That is followed by the HR phone interview. Once you clear both of those, there is a technical phone interview with data scientists. This is based around your projects and also includes a case study discussion. Finally there is an onsite interview which consists of technical interviews, behavioural interview and hiring manager interview.

Important Reading

Booking.com streaming ecosystem

Data Science Related Interview Questions

  • What is the difference between L1 and L2 regularization?
  • What is gradient decent?
  • Why did you use Random Forests instead of Clustering on a particular problem?(case study)
  • How to deal with new hotels that do not have an official rating?
  • If the training error and the testing error are both high, as the number of data points increase, what measures will you take to fix the model?
  • How would you optimize the advertising that directs people to your site? How do you evaluate how much to spend on each channel?
  • What do you do to make sure your model is not over fitting?
  • Given a business case as such, how would you handle this with a Machine Learning solution?
  • How did you validate your model?
  • What are the parameters of decision trees and random forests, and how would you choose them?

Reflecting on the Questions

Booking.com is headquartered in Amsterdam but has offices all over the globe. Data dictates the spending and drives efficiency in their business. It is a critical component of their product. The questions are about deep data science fundamentals and also about the different situations within their business where they deploy data science. A good knowledge of Data Science fundamentals coupled with know how about their business can surely land you a job with one of the world’s largest booking sites!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Booking.com and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Booking.com Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dropbox Data Science Interview Questions

1.2 Billion files are uploaded to Dropbox everyday.

Dropbox has over 500 million users. It has users in over 200 countries and supports over 20 languages. This explains the scale of data within the company. 4000 files are edited every second on Dropbox. All this contributes to gigantic amounts of data. The syncing of data and keeping everything up to date for so many files is a daunting task in itself and Dropbox manages all this very efficiently. Another interesting thing within Dropbox is the heavy use of Python which is the language of choice when it comes to Data Science. The product itself uses Python which makes it even better when it comes to building data science applications. This is great for any Data Scientist to build on top of. Dropbox promises an ML heavy inclination for a Data Scientist which maybe very interesting for many of them.

Source: https://images.app.goo.gl/6PHNTLP4sXpwX5XM8

Interview Process

The interview process starts with a recruiter screen. This interview goes through your resume and a chat over the phone by the recruiter to determine if you are a fit for the role. This is followed by a phone interview with the hiring manager. If you clear this interview, the next round is an onsite interview with team members. The onsite interviews might be ML heavy depending on your team and composition of the interview panel.

Important Reading

Source: Machine learning model v1

Data Science Related Interview Questions

  • What is a propensity model?
  • How would you set up a propensity model for the SMB team looking at companies between 5–200 employees?
  • How will you up-sell to a customer based on data?
  • Find out which employee reports to which manager using SQL?
  • How will you maintain a data metric?
  • Given a table with a series of values how will you determine if there are missing values and what are those?
  • Given a root directory, return all file paths grouped by duplicate files.
  • Describe how MD5 algorithm works.
  • From a user perspective how can you determine that the search experience is good or bad?
  • How can you analyze order data to determine churn?

Reflecting on the Questions

The data science team at at Dropbox is working in two areas. One on the BI analytics space to help improve renewal rate and reduce churn by using data. The second area is to improve the product itself like trying to know what file will be accessed next. The interview questions reflect this dichotomy with Dropbox. A data scientist should decide where he would fit and interview accordingly. Deep ML knowledge or knowledge of how to improve customer retention via data can help you land a job with one of the world’s largest document database!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Dropbox and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Dropbox Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Sprint Data Science Interview Questions

Sprint had more than 55 million customers at the end of 2013.

Sprint Corporation headquarters are located in Overland Park, Kansas. The company is widely recognized for developing, engineering and deploying innovative technologies, including the first wireless 4G service from a national carrier in the United States. The American Customer Satisfaction Index rated Sprint as the most improved company in customer satisfaction, across all 47 industries, during the last five years. The merger of T-Mobile and Sprint, the third- and fourth-largest carriers in the U.S happened in 2018. The combined company would have more than 126 million customers. One cannot imagine the amount of data that resides within a telecom company let alone two companies after merging. Sprint established subsidiary Pinsight Media to investigate ways of capitalizing on that data. Since then it has gone from serving zero to six billion ad impressions per month, based on “authenticated first party data” which it alone has access to. This kind of data is a huge advantage for any Data Scientist and provide a tremendous potential to grow their career.

Source: Bizjournal

Interview Process

The interview process with an HR interview. The next interview is a take home ML and Coding assessment. The assessment is statistics heavy and requires in depth knowledge about probability distributions and ML Algorithms. The assessment is followed a case study around predictions. The case study provides a problem statement and requires to come up with predictions based on the dataset provided in the case study. The case study is followed by the technical interview and finally a hiring manager interview. The interview process is intense consisting of five rounds but the company is well worth it.

Important Reading

Source: Slideshare

Data Science Related Interview Questions

  • Describe Ridge and Lasso Regression.
  • Explain SVMs and how they could be used in telecom.
  • What are the differences between RDBMS and NoSQL?
  • What are the different Data Structures used in Spark?
  • Which Data Structure is apt for Geolocation Analysis?
  • What is standard deviation? Why do we need it?
  • Given n samples from a uniform distribution[0,d]. How do you estimate d?
  • In an A/B test, how can you check if assignment to the various buckets was truly random?
  • How do you optimize model parameters during model building?
  • How does regularization reduce over fitting?

Reflecting on the Questions

The data science team at Sprint which is now merged with T-Mobile has some of the best data sets in the world. Their stack is hadoop and spark based. Their questions reflect the kind of work they do where data insights could be employed for ads. A decent knowledge of how ML can be applied to Telecom can surely land you a job with one of the world’s largest Telecom giant!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Sprint and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Sprint Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Citibank Data Science Interview Questions

Citicorp and Travelers’ Group has total assets of $1.7 trillion.

In America, Citibank is one of the four main firms that accounts for half of the nation’s total mortgages, and two-thirds of the total credit cards. Although this institution isn’t necessarily the largest in America, it is often considered to be the largest banking facility across the globe. Citibank serves a mass number of over 200 million customers across a span of 160 countries. Its main location functions through Citibank Europe, stationed in the Czech Republic. The Spend Tracker which is quite common at banks all over the world today was started by Citibank first. Citibank which provides such heterogeneous financial products showcases a wide variety of information on its spend tracker for its customers from sign-on bonuses, bonus amounts and expiration dates, bonus miles, and even how much you have to spend in order to get certain rewards. Such varied information across multiple products and across 200 million customers makes it one of the best companies for Data Scientists to work at.

Photo by Anthony Ginsbrook on Unsplash

Interview Process

The interview process starts with a phone interview. The phone interview is a basic Data Science Q&A interview. The phone interview is followed by an onsite interview. The onsite interview consists of interview with team leads, team members and SVPs. There may or may not be an online SQL assessment before the onsite. The SQL assessment is usually a difficult one.

Important Reading

Source: ML and Cognitive Computing

Data Science Related Interview Questions

  • Given a list of integers, find all combinations that sum to a given integer.
  • Segment a long string into a set of valid words using a dictionary. Return false if the string cannot be segmented. What is the complexity of your solution?
  • Write a SQL query to find the repeated items in a column.
  • How do you use the Q data structure to maintain the state in Spark.
  • Design a Trading system with high throughput and low latency.
  • How do you describe a financial planning process?
  • What would you prefer, being attacked by a giant chicken or 100 small ones?
  • What problems did you encounter in your project(resume based) and what are the solutions you did?
  • Explain your thesis in layman’s terms.
  • Find the second maximum value of a column in a Database table.

Reflecting on the Questions

The data science team at Citigroup uses Hadoop and Spark. They have a geographical diverse team located in the US, Europe and India. Their questions are a mix of questions related to coding, SQL, Systems Design, Hadoop and Spark. They are based on foundational and deep aspects of Data Science. If you work hard on your basics, you can surely land a job at one of the largest banks of the world!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Citibank and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.


Citibank Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.