Category: Toronto People

What is TF-IDF in Feature Engineering?

Written on July 8, 2019. Posted in Vimarsh Karbhari.

Basic concept of TF-IDF in NLP

The concept TF-IDF stands for term frequency-inverse document frequency. This is in the field of numerical statistics. With this concept, we will be able to decide how important a word is to a given document in the present dataset or corpus.

What is TF-IDF?

TF-IDF indicates what the importance of the word is in order to understand the document or dataset. Let us understand with an example. Suppose you have a dataset where students write an essay on the topic, My House. In this dataset, the word a appears many times; it’s a high frequency word compared to other words in the dataset. The dataset contains other words like home, house, rooms and so on that appear less often, so their frequency are lower and they carry more information compared to the word. This is the intuition behind TF-IDF.

Let us dive deep into the mathematical aspect of TF-IDF. It has two parts: Term Frequency(TF) and Inverse Document Frequency(IDF). The term frequency indicates the frequency of each of the words present in the document or dataset.

So, its equation is given as follows:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

The second part is — inverse document frequency. IDF actually tells us how important the word is to the document. This is because when we calculate TF, we give equal importance to every single word. If the word appears in the dataset more frequently, then its term frequency (TF) value is high while not being that important to the document.

So, if the word the appears in the document 100 times, then it’s not carrying that much information compared to words that are less frequent in the dataset. Thus, we need to define some weighing down of the frequent terms while scaling up the rare ones, which decides the importance of each word. We will achieve this with the following equation:

IDF(t) = log10(Total number of documents / Number of documents with term t in it).

Hence, equation is calculate TF-IDF is as follows.

TF * IDF = [ (Number of times term t appears in a document) / (Total number of terms in the document) ] * log10(Total number of documents / Number of documents with term t in it).

In reality, TF-IDF is the multiplication of TF and IDF, such as TF * IDF.

Now, let’s take an example where you have two sentences and are considering those sentences as different documents in order to understand the concept of TF-IDF:

Document 1: This is a sample.

Document 2: This is another example.

In summary, to calculate TF-IDF, we will follow these steps:

1. We first calculate the frequency of each word for each document.

2. We calculate IDF.

3. We multiply TF and IDF.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Reference: Python NLP

What is TF-IDF in Feature Engineering? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Explain NLP models with LIME & SHAP

Written on July 3, 2019. Posted in Susan Li.

Interpretation for Text Classification

Last week, I gave a talk on “Hands-on Feature Engineering for NLP” at QCon New York. As a very small part of the presentation, I gave a brief demo on how LIME & SHAP work in terms of text classification explainability.

I decided to write a blog post about them because they are fun, easy to use and visually compelling.

All machine learning models that operate in higher dimensions than what can be directly visualized by the human mind can be referred as black box models which come down to the interpretability of the models. In particular in the field of NLP, it’s always the case that the dimension of the features are very huge, explaining feature importance is getting much more complicated.

LIME & SHAP help us provide an explanation not only to end users but also ourselves about how a NLP model works.

Using the Stack Overflow questions tags classification data set, we are going to build a multi-class text classification model, then applying LIME & SHAP separately to explain the model. Because we have done text classification many times before, we will quickly build the NLP models and focus on the models interpretability.

Data Pre-processing, Feature Engineering and Logistic Regression

<a href="https://medium.com/media/ab22a2d62c16dcaac3702aa0f35e403c/href">https://medium.com/media/ab22a2d62c16dcaac3702aa0f35e403c/href</a>

Our objective here is not to produce the highest results. I wanted to dive into LIME & SHAP as soon as possible and that’s what happened next.

Interpreting text predictions with LIME

From now on, it’s the fun part. The following code snippets were largely borrowed from LIME tutorial.

<a href="https://medium.com/media/732f8dfa670d5df756fd450e3e9c972d/href">https://medium.com/media/732f8dfa670d5df756fd450e3e9c972d/href</a>

We randomly select a document in test set, it happens to be a document that labeled as sql, and our model predicts it as sql as well. Using this document, we generate explanations for label 4 which is sql and label 8 which is python.

print ('Explanation for class %s' % class_names[4])
print ('n'.join(map(str, exp.as_list(label=4))))

print ('Explanation for class %s' % class_names[8])
print ('n'.join(map(str, exp.as_list(label=8))))

It is obvious that this document has the highest explanation for label sql. We also notice that the positive and negative signs are with respect to a particular label, such as word “sql” is positive towards class sql while negative towards class python, and vice versa.

We are going to generate labels for the top 2 classes for this document.

exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, top_labels=2)
print(exp.available_labels())

It gives us sql and python.

exp.show_in_notebook(text=False)

Let me try to explain this visualization:

For this document, word “sql” has the highest positive score for class sql.
Our model predicts this document should be labeled as sql with the probability of 100%.
If we remove word “sql” from the document, we would expect the model to predict label sql with the probability at 100% — 65% = 35%.
On the other hand, word “sql” is negative for class python, and our model has learned that word “range” has a small positive score for class python.

We may want to zoom in and study the explanations for class sql, as well as the document itself.

exp.show_in_notebook(text=y_test[idx], labels=(4,))

Interpreting text predictions with SHAP

The following process were learned from this tutorial.

<a href="https://medium.com/media/812fb43dae8c46ba8322804a2afe40ea/href">https://medium.com/media/812fb43dae8c46ba8322804a2afe40ea/href</a>

After model is trained, we use the first 200 training documents as our background data set to integrate over, and to create a SHAP explainer object.
We get the attribution values for individual predictions on a subset of the test set.
Transform the index to words.
Use SHAP’s summary_plot method to show the top features impacting model predictions.

attrib_data = X_train[:200]
explainer = shap.DeepExplainer(model, attrib_data)
num_explanations = 20
shap_vals = explainer.shap_values(X_test[:num_explanations])

words = processor._tokenizer.word_index
word_lookup = list()
for i in words.keys():
  word_lookup.append(i)

word_lookup = [''] + word_lookup
shap.summary_plot(shap_vals, feature_names=word_lookup, class_names=tag_encoder.classes_)

Word “want” is the biggest signal word used by our model, contribute most to class jquery predictions.
Word “php” is the 4th biggest signal word used by our model, contributing most to class php of course.
On the other hand, word “php” is likely to have a negative signal to the other class because it is unlikely to see word “php” to appear in a python document.

There are a lot to learn in terms of machine learning interpretability with LIME & SHAP. I have only covered a tiny piece for NLP. Jupyter notebook can be found on Github. Enjoy the fun!

Explain NLP models with LIME & SHAP was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Anomaly Detection for Dummies

Written on July 1, 2019. Posted in Susan Li.

Unsupervised Anomaly Detection for Univariate & Multivariate Data.

Anomaly detection is the process of identifying unexpected items or events in data sets, which differ from the norm. And anomaly detection is often applied on unlabeled data which is known as unsupervised anomaly detection. Anomaly detection has two basic assumptions:

Anomalies only occur very rarely in the data.
Their features differ from the normal instances significantly.

Univariate Anomaly Detection

Before we get to Multivariate anomaly detection, I think its necessary to work through a simple example of Univariate anomaly detection method in which we detect outliers from a distribution of values in a single feature space.

We are using the Super Store Sales data set that can be downloaded from here, and we are going to find patterns in Sales and Profit separately that do not conform to expected behavior. That is, spotting outliers for one variable at a time.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
from sklearn.ensemble import IsolationForest

Distribution of the Sales

df = pd.read_excel("Superstore.xls")
df['Sales'].describe()

plt.scatter(range(df.shape[0]), np.sort(df['Sales'].values))
plt.xlabel('index')
plt.ylabel('Sales')
plt.title("Sales distribution")
sns.despine()

sns.distplot(df['Sales'])
plt.title("Distribution of Sales")
sns.despine()

print("Skewness: %f" % df['Sales'].skew())
print("Kurtosis: %f" % df['Sales'].kurt())

The Superstore’s sales distribution is far from a normal distribution, and it has a positive long thin tail, the mass of the distribution is concentrated on the left of the figure. And the tail sales distribution far exceeds the tails of the normal distribution.

There are one region where the data has low probability to appear which is on the right side of the distribution.

Distribution of the Profit

df['Profit'].describe()

plt.scatter(range(df.shape[0]), np.sort(df['Profit'].values))
plt.xlabel('index')
plt.ylabel('Profit')
plt.title("Profit distribution")
sns.despine()

sns.distplot(df['Profit'])
plt.title("Distribution of Profit")
sns.despine()

print("Skewness: %f" % df['Profit'].skew())
print("Kurtosis: %f" % df['Profit'].kurt())

The Superstore’s Profit distribution has both a positive tail and negative tail. However, the positive tail is longer than the negative tail. So the distribution is positive skewed, and the data are heavy-tailed or profusion of outliers.

There are two regions where the data has low probability to appear: one on the right side of the distribution, another one on the left.

Univariate Anomaly Detection on Sales

Isolation Forest is an algorithm to detect outliers that returns the anomaly score of each sample using the IsolationForest algorithm which is based on the fact that anomalies are data points that are few and different. Isolation Forest is a tree-based model. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature.

The following process shows how IsolationForest behaves in the case of the Susperstore’s sales, and the algorithm was implemented in Sklearn and the code was largely borrowed from this tutorial

Trained IsolationForest using the Sales data.
Store the Sales in the NumPy array for using in our models later.
Computed the anomaly score for each observation. The anomaly score of an input sample is computed as the mean anomaly score of the trees in the forest.
Classified each observation as an outlier or non-outlier.
The visualization highlights the regions where the outliers fall.

<a href="https://medium.com/media/76083426ae267360986b1ed5abcc3731/href">https://medium.com/media/76083426ae267360986b1ed5abcc3731/href</a>

According to the above results and visualization, It seems that Sales that exceeds 1000 would be definitely considered as an outlier.

Visually investigate one anomaly

df.iloc[10]

This purchase seems normal to me expect it was a larger amount of sales compared with the other orders in the data.

Univariate Anomaly Detection on Profit

Trained IsolationForest using the Profit variable.
Store the Profit in the NumPy array for using in our models later.
Computed the anomaly score for each observation. The anomaly score of an input sample is computed as the mean anomaly score of the trees in the forest.
Classified each observation as an outlier or non-outlier.
The visualization highlights the regions where the outliers fall.

<a href="https://medium.com/media/5307a6dbdae6131bd2a3de04fc342150/href">https://medium.com/media/5307a6dbdae6131bd2a3de04fc342150/href</a>

Visually investigate some of the anomalies

According to the above results and visualization, It seems that Profit that below -100 or exceeds 100 would be considered as an outlier, let’s visually examine one example each that determined by our model and to see whether they make sense.

df.iloc[3]

Any negative profit would be an anomaly and should be further investigate, this goes without saying

df.iloc[1]

Our model determined that this order with a large profit is an anomaly. However, when we investigate this order, it could be just a product that has a relatively high margin.

The above two visualizations show the anomaly scores and highlighted the regions where the outliers are. As expected, the anomaly score reflects the shape of the underlying distribution and the outlier regions correspond to low probability areas.

However, Univariate analysis can only get us thus far. We may realize that some of these anomalies that determined by our models are not the anomalies we expected. When our data is multidimensional as opposed to univariate, the approaches to anomaly detection become more computationally intensive and more mathematically complex.

Multivariate Anomaly Detection

Most of the analysis that we end up doing are multivariate due to complexity of the world we are living in. In multivariate anomaly detection, outlier is a combined unusual score on at least two variables.

So, using the Sales and Profit variables, we are going to build an unsupervised multivariate anomaly detection method based on several models.

We are using PyOD which is a Python library for detecting anomalies in multivariate data. The library was developed by Yue Zhao.

Sales & Profit

When we are in business, we expect that Sales & Profit are positive correlated. If some of the Sales data points and Profit data points are not positive correlated, they would be considered as outliers and need to be further investigated.

sns.regplot(x="Sales", y="Profit", data=df)
sns.despine();

From the above correlation chart, we can see that some of the data points are obvious outliers such as extreme low and extreme high values.

Cluster-based Local Outlier Factor (CBLOF)

The CBLOF calculates the outlier score based on cluster-based local outlier factor. An anomaly score is computed by the distance of each instance to its cluster center multiplied by the instances belonging to its cluster. PyOD library includes the CBLOF implementation.

The following code are borrowed from PyOD tutorial combined with this article.

Scaling Sales and Profit to between zero and one.
Arbitrarily set outliers fraction as 1% based on trial and best guess.
Fit the data to the CBLOF model and predict the results.
Use threshold value to consider a data point is inlier or outlier.
Use decision function to calculate the anomaly score for every point.

<a href="https://medium.com/media/86cc8e3e1b8bcb4cf8c0250afa65d257/href">https://medium.com/media/86cc8e3e1b8bcb4cf8c0250afa65d257/href</a>

Histogram-based Outlier Detection (HBOS)

HBOS assumes the feature independence and calculates the degree of anomalies by building histograms. In multivariate anomaly detection, a histogram for each single feature can be computed, scored individually and combined at the end. When using PyOD library, the code are very similar with the CBLOF.

<a href="https://medium.com/media/69304d021a8ec189a14d49483322b314/href">https://medium.com/media/69304d021a8ec189a14d49483322b314/href</a>

Isolation Forest

Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that selected feature.

The PyOD Isolation Forest module is a wrapper of Scikit-learn Isolation Forest with more functionalities.

<a href="https://medium.com/media/86715c01859ecc0c4c19a71ad36a36cd/href">https://medium.com/media/86715c01859ecc0c4c19a71ad36a36cd/href</a>

K – Nearest Neighbors (KNN)

KNN is one of the simplest methods in anomaly detection. For a data point, its distance to its kth nearest neighbor could be viewed as the outlier score.

<a href="https://medium.com/media/49bca04ed9fd70173c0a909037fbe850/href">https://medium.com/media/49bca04ed9fd70173c0a909037fbe850/href</a>

The anomalies predicted by the above four algorithms were not very different.

Visually investigate some of the anomalies

We may want to investigate each of the outliers that determined by our model, for example, let’s look in details for a couple of outliers that determined by KNN, and try to understand what make them anomalies.

df.iloc[1995]

For this particular order, a customer purchased 5 products with total price at 294.62 and profit at lower than -766, with 80% discount. It seems like a clearance. We should be aware of the loss for each product we sell.

df.iloc[9649]

For this purchase, it seems to me that the profit at around 4.7% is too small and the model determined that this order is an anomaly.

df.iloc[9270]

For the above order, a customer purchased 6 product at 4305 in total price, after 20% discount, we still get over 33% of the profit. We would love to have more of these kind of anomalies.

Jupyter notebook for the above analysis can be found on Github. Enjoy the rest of the week.

Anomaly Detection for Dummies was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Churn Analytics: from data analysis to Machine learning

Written on June 26, 2019. Posted in Mohammad Chowdhury.

Churn Analytics: Data Analysis to Machine learning

Customer is one of the most precious resources in any business, acquiring clients can time consuming and expensive. Retaining the most profitable clients can be one of the best strategies businesses can have. Identifying the clients before they leave would be crucial. that’s were the churn analysis comes very handy in the Data Science.

The business or organizations are interested in know the cluster/segment/group of the clients who is like to leave. retention is more cost-effective than acquiring a new customer. there is always a cost & risk involved in acquiring a new client. here is an example of churn analytics & Applied Machine Learning on a banking client dataset.

Data

The dataset comes from the Kaggle, and it is related to European banking clients of counties like France, Germany, and Spain. The classification goal is to predict whether the client will churn (1) or stay (0). The dataset can be downloaded from here.

Input Variables

RowNumber: each row consist of one client information (numeric)

CustomerId: unique identifier for customers (numeric)

Surname: last name of the client (categorical)

CreditScore: Credit score of the client(numeric)

Geography: the territory of the customers (categorical)

Gender: male or female (categorical)

Age: age of the client (numeric)

Tenure: the time with the bank as a client (numeric)

Balance: balance (numeric)

NumOfProducts: How many accounts, bank account affiliated products the person has (numeric)

HasCrCard: the person has a credit card or not (categorical)

IsActiveMember: active product user with transaction vs no activity or transaction (categorical)

EstimatedSalary: estimated salary income or each client (numeric)

Exited: attrition, Did they leave the bank after all? Yes (1), No (0) (categorical)

Predict variable (desired target):

Exited Yes (1)— has the client churned? (binary: “1”, means “Yes”, “0” means “No”)

Data Preprocessing

I have used pandas for data preprocessing, the data set came with column labels and each row represents single client data. In terms of missing values or duplicates (a rare case in real-world data) came pretty clean.
besides python, pandas, and sk-learn, Cloud AWS S3, EC2, Linux, Excel & Tableau public is being used for this end to end project

AWS Steps:

Start an EC2 instance, install all relevant libs with anaconda distribution & Jupyter notebook (use Linux CentOS)
Open S3 bucket
Export the Data to S3
Mount Data on EC2
Clean, Explore Analyse, model the data using Python
Connect Tableau to S3 for Dashboarding and Reporting/ alternatively AWS Quicksight can be used

Pandas Dataframe

Snapshot of the Data

Statistical Summary

Feature Engineering:

Finding Correlated Features

It shows that none of the features are highly correlated with each other

Some of the features Geography, Gender, Surname came of as pandas object, some rowNumber, CustomerId, Creditscore, Age, Tenure, NumOfProducts, HasCrCard, IsActiveMember came as an integer. those columns need to be feature engineered for machine learning. Transformed objects & int features into floats & also created new encoded features for Geography, Gender.

Insights & Analytics:

Here are some of the Insights drawn from the dataset (using Tableau public)

The proportion of Churn to Non-Churn

20% Churn /Attrition

2. Gender Proportion to Churn

**Female churners are higher, the mean of female churn 0.250715 where the male is 0.164559**

**Female customer is more likely to churn compared to male**

3. Countrywise churn

**Mean of country wise churn shows Germany has a higher churn compared to France and Spain**

4. Does the Number of Products affect Churn?

**A client with multiple products are less likely to churn where a single product holder has the highest churn**

5. Does the Balance have any influence on Churn?

**Customer with higher balances showing a less likelihood of Churn**

Dashboard:

The dashboard shows overall presentation/summary of the features influencing the attrition rate, some of the most influential features which affecting the churn are number of products, credit card, inactive, country, credit score, balance, Gender, age range

The link to the public interactive dashboard

https://public.tableau.com/profile/tests#!/vizhome/ProductsVSCHURN/ChurnAnalyticsDashboard

Model Building

Training the model

Training set uses 80% of the data, rest for test set

Testing the model

20% of the data is used for test set

Prediction using Machine Learning

Logistic Regression

It is a classification algorithm that is used to predicting the probability of a categorical dependent variable in Machine Learning. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, churn) or 0 (no Churn.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

We are trying to predict whether the clients are like to leave or stay, the outcome is binary. here the logistic algorithm statistically analyzing the features to determine whether a client will churn or not

Here is the application of the algorithm

DecisionTree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Here the decision tree representing boolean function (Y/N) as binary whether the client will churn or not

Gradient Boosting | Gradient Boosting Machine (GBM)

Here Gradient boosting classifying the outcome to whether a client will churn or not, it is a predictive model in the form of an ensemble uses decision trees.

Model Performance:

Modeling was applied on multiple machine learning algorithms with fine-tuning, here are some of the outcome of the model in terms of accuracy scores

•Logistic Regression 78.65%

•Decision Tree 79.45%

•Random Forest 84.85%

•SVM accuracy 79.80%

•Gradient Boosting 86.35%

•AdaBoost 86.35%

The algorithms gave the higher accuracy score are Gradient Boosting, AdaBoost compared to Decision Tree & Logistic regression

What is AUC?

Written on June 24, 2019. Posted in Vimarsh Karbhari.

Data Science Interview Questions based on AUC.

Few weeks ago, I started wrote about ROC curves. The purpose was to provide a basic primer on ROC curves. As a follow up, this article talks about AUC.

AUC stands for Area Under the Curve. ROC can be quantified using AUC. The way it is done is to see how much area has been covered by the ROC curve. If we obtain a perfect classifier, then the AUC score is 1.0. If the classifier is random in its guesses, then the AUC score is 0.5. In the real world, we don’t expect an AUC score of 1.0, but if the AUC score for the classifier is in the range of 0.6 to 0.9, then it is considered to be a good classifier.

In the preceding figure, the area under the curve which has been covered becomes our AUC score. This gives us an indication of how good or bad our classifier is performing. ROC and AUC are the two indicators that can provide us with insights on how our classifier performs.

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Reference: ML Solutions

What is AUC? was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Expedia Data Science Interview Questions

Written on June 24, 2019. Posted in Vimarsh Karbhari.

There are 37 million Expedia members across 32 countries.

Expedia has covered 534 billion miles in air travel, this is enough for 72 round trips (in passenger miles flown) from the sun to Pluto and back. Expedia is a travel company like Booking.com which we have covered at Acing AI previously. It has sold enough hotel room nights in the last 20 years to account for every person living in the United States. The amount of data Expedia accumulates by having so many travellers every year leads to huge investment in technology. Expedia has invested over $850M trailing year over year in tech spend. A mature tech stack helps Data Scientists at their job. This is a great opportunity for any Data Scientist to build their career.

Interview Process

If you are short-listed after resume screening, there is first interview with the manager of the data science team. These include technical questions about machine learning and statistics. After clearing that, there is a technical coding interview. The third round is an interview with HR more classic and typical job interview.

Important Reading

Streaming Data Ecosystems: Slideshare
Spark Streaming + Kafka Best Practices at Expedia: Slideshare
Tips by an Insider on Data Science Interviewing: Sixty People at Expedia

Data Science Related Interview Questions

What is the process of cross validation?
How can we do price optimization for properties on Expedia?
Predict Hotel prices in a given dataset.
Explain a Machine Learning project on your resume.
Develop a recommendation system based on a provided dataset.
Which flight path is more profitable for London-Lisbon or London-Milan?
Should we invest on buying more property in X city?
Explain linear and logistic regression.
Give pros and cons of SVM.
Explain the meaning of overfitting to non technical people.

Reflecting on the Question

The data science team at Expedia is geographically dispersed. The technical team has build a very mature data science architecture that enables the Data Science team. The questions are based on the questions the data science team at Expedia answers day to day. Great product sense about the Expedia product and its business can surely land you a job at one of the world’s largest travel sites!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Expedia and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.

Expedia Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lyft Data Science Interview Questions

Written on June 10, 2019. Posted in Vimarsh Karbhari.

As of January 2018, Lyft could count 23 million users.

Lyft currently offers services in 350 US cities, and Toronto and Ottawa in Canada. It was launched in 2012, as a part of long-distance car-pooling business Zimride — the largest such app in the US (named for transportation culture in Zimbabwe). It was renamed as Lyft later. Launched in Silicon Valley, Lyft spread from 60 US cities in April 2014 to 300 in January 2017, to 350 today — plus the two aforementioned Canadian cities. With 350 cities, millions of users and billions of rides the data generated at Lyft is huge. The product achieves economies of scale deploying Data Science. Hence, data science is a core part of the product and not just an added feature.

Interview Process

The interview process starts with a phone interview with a Data Scientist. It is around an in depth conversation about your resume and past projects. That interview is followed by a take home test which is usually around a ride sharing data set. As part of the take home test, there is a presentation which has to be created for the onsite interview. The onsite interview consists of 4–5 interviews. One of those is presentation of the take home test. It also includes a SQL test, stats and probability and business case. There is a final core values interview to know if you fit within the Lyft culture. The interview is challenge but the reward when you clear the interview is totally worth it.

Important Reading

Source: From shallow to deep learning in fraud

Predicting Seasonality: How to deal with the seasonality of a market
Interactions in fraud experiments: A case study in multi-variable testing
Personalization among drivers and passengers : Personalized marketing with machine learning

Data Science Related Interview Questions

Find expectations of a random variable with basic distribution. How would you construct a confidence interval? How would you estimate a probability of ordering a ride? What assumptions do you need in order to estimate this probability?
What optimization techniques are you familiar with and how do they work? How would you find the optimal price given a linear demand function?
Coin got x heads during y flips. How can we test if this is a fair coin?
What are some metrics for monitoring supply and demand in Lyft market?
Explain correlation and variance.
What is the lifetime value of a driver?
Implement k nearest neighbour using a quad tree.
What are the different factors that could influence a rise in average wait time of a driver?
Explain what are the best ways to achieve pool matching?
How do you reduce churn on the supply side?

Reflecting on the Questions

The Data Science team at Lyft moves very quickly. The Data sets are huge and problems so wide in nature that the team explores different types of models which can provide higher precision for same recall and feature set. The questions reflect the tough problems which the team faces day to day. There is a mix of model building along with complex coding questions. As I mentioned before the interviews are tough but they are well worth it for getting to work in an excellent team. Hard work can surely get you a job in one of the world’s largest transportation companies!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Lyft and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.

Lyft Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://github.com/susanli2016/SMS-Message-Spam-Detector

Written on June 9, 2019. Posted in Susan Li.

https://github.com/susanli2016/SMS-Message-Spam-Detector

Booking.com Data Science Interview Questions

Written on June 3, 2019. Posted in Vimarsh Karbhari.

There are 28 Million+ listings to stay on Booking.com.

Booking.com is the travel E-Commerce part of Booking Holdings. They have over 140,000+ destinations in 230 countries all over the world. They also have over 1.5 Million+ nights reserved every day on their platform. From a data science perspective, this translates into over 300 TB of data. A robust data engineering infrastructure coupled with huge amounts of data makes Booking.com one of the best places for a Data Scientist to build their career.

Interview Process

The interview process starts with MCQ based test on machine learning and statistics questions.That is followed by the HR phone interview. Once you clear both of those, there is a technical phone interview with data scientists. This is based around your projects and also includes a case study discussion. Finally there is an onsite interview which consists of technical interviews, behavioural interview and hiring manager interview.

Important Reading

Data Streaming Ecosystem: Booking.com Platform
Productionizing of Behavioral Features: ML with Apache Spark
Druid: Interactive Analytics at (big) Scale

Data Science Related Interview Questions

What is the difference between L1 and L2 regularization?
What is gradient decent?
Why did you use Random Forests instead of Clustering on a particular problem?(case study)
How to deal with new hotels that do not have an official rating?
If the training error and the testing error are both high, as the number of data points increase, what measures will you take to fix the model?
How would you optimize the advertising that directs people to your site? How do you evaluate how much to spend on each channel?
What do you do to make sure your model is not over fitting?
Given a business case as such, how would you handle this with a Machine Learning solution?
How did you validate your model?
What are the parameters of decision trees and random forests, and how would you choose them?

Reflecting on the Questions

Booking.com is headquartered in Amsterdam but has offices all over the globe. Data dictates the spending and drives efficiency in their business. It is a critical component of their product. The questions are about deep data science fundamentals and also about the different situations within their business where they deploy data science. A good knowledge of Data Science fundamentals coupled with know how about their business can surely land you a job with one of the world’s largest booking sites!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Acing AI Newsletter – Revue

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Booking.com and its technologies helping people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.

Booking.com Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

By myself only. As you have seen, I am not an expert on everything.

Written on June 3, 2019. Posted in Susan Li.

By myself only. As you have seen, I am not an expert on everything.

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Category: Toronto People

Basic concept of TF-IDF in NLP

What is TF-IDF?

Interpretation for Text Classification

Data Pre-processing, Feature Engineering and Logistic Regression

Interpreting text predictions with LIME

Interpreting text predictions with SHAP

Unsupervised Anomaly Detection for Univariate & Multivariate Data.

Univariate Anomaly Detection

Distribution of the Sales

Distribution of the Profit

Univariate Anomaly Detection on Sales

Visually investigate one anomaly

Univariate Anomaly Detection on Profit

Visually investigate some of the anomalies

Multivariate Anomaly Detection

Sales & Profit

Cluster-based Local Outlier Factor (CBLOF)

Histogram-based Outlier Detection (HBOS)

Isolation Forest

K – Nearest Neighbors (KNN)

Visually investigate some of the anomalies

Churn Analytics: Data Analysis to Machine learning

Data

Input Variables

Data Preprocessing

AWS Steps:

Feature Engineering:

Insights & Analytics:

Dashboard:

Model Building

Training the model

Testing the model

Prediction using Machine Learning

DecisionTree

Gradient Boosting | Gradient Boosting Machine (GBM)

Model Performance:

Further Reading:

Data Science Interview Questions based on AUC.

There are 37 million Expedia members across 32 countries.

As of January 2018, Lyft could count 23 million users.

There are 28 Million+ listings to stay on Booking.com.