Category: Toronto People

When Topic Modeling is Part of the Text Pre-processing

Written on October 4, 2019. Posted in Susan Li.

How to effectively and creatively pre-process text data

A few months ago, we built a content based recommender system using a relative clean text data set. Because I collected the hotel descriptions my self, I made sure that the descriptions were useful for the goals we were going to accomplish. However, the real-world text data is never clean and there are different pre-processing ways and steps for different goals.

Topic modeling in NLP is rarely my final goal in an analysis, I use it often to either explore data or as a tool to make my final model more accurate. Let me show you what I meant.

The Data

We are still using the Seattle Hotel description data set I collected earlier, and I made it a bit more messier this time. We are going to skip all the EDA processes and I want to make recommendations as quickly as possible.

If you have read my previous post, I am sure you understand the following code script. Yes, we are looking for top 5 most similar hotels with “Hilton Garden Inn Seattle Downtown” (except itself), according to hotel description texts.

Make Recommendations

https://medium.com/media/aafe45c62a3f78bdd069868868db2b66/href

Our model returns the above 5 hotels and thinks they are top 5 most similar hotels to “Hilton Garden Inn Seattle Downtown”. I am sure you don’t agree, neither do I. Let’s say why the model thinks they are similar by looking at these descriptions.

df.loc['Hilton Garden Inn Seattle Downtown'].desc

df.loc["Mildred's Bed and Breakfast"].desc

df.loc["Seattle Airport Marriott"].desc

Found anything interesting? Yes, there are indeed somethings in common in these three hotel descriptions, they all have the same check in and check out time, and they all have the similar smoking policies. But are they important? Can we declare two hotels are similar just because they are all “non-smoking”? Of course not, these are not important characteristics and we shouldn’t measure similarity in vector space of these texts.

We need to find a way to safely remove these texts programmatically, while not removing any other useful characteristics.

Topic modeling comes to our rescue. But before that, we need to wrangle the data to make it in the right shape.

Split each description into sentences. Hilton Garden Seattle Downtown’s entire description will be split into 7 sentences.

https://medium.com/media/57f6877b6638b27c9e142fa3a5bb7c63/href

Topic Modeling

We are going to build topic model for all the sentences together. I decided to have 40 topics after several experiments.

https://medium.com/media/0478e76d0041ce7510ce5092c27e42b9/href

Not too bad, there were not too much overlapping.

To understand better, you may want to investigate top 20 words in each topic.

https://medium.com/media/3667d3da9af7aee9b9e721ffd1ae853b/href

We shall have 40 topics, and each topic shows 20 keywords. Its very hard to print out the entire table, I will only show a small part of it.

By staring at the table, we can guess that at least topic 12 should be one of the topics we would like to dismiss, because it contains several words that meaningless for our purpose.

In the following code scripts, we:

Create document-topic matrix.
Create a data frame where each document is a row, and each column is a topic.
The weight of each topic is assigned to each document.
The last column is the dominant topic for that document, in which it carries the most weight.
When we merge this data frame to the previous sentence data frame. We are able to find the the weight of each topic in every sentence, and the dominant topic for each sentence.

https://medium.com/media/4d76caf58479fd78fe5beaae6256fabd/href

Now we can visually examine dominant topics assignment of each sentence for “Hilton Garden Inn Seattle Downtown”.

df_sent_topic.loc[df_sent_topic['name'] == 'Hilton Garden Inn Seattle Downtown'][['sentence', 'dominant_topic']]

By staring at the above table, my assumption is that if a sentence’s dominant topic is topic 4 or topic 12, that sentence is likely to be useless.
Let’s see a few more example sentences that have topic 4 or topic 12 as their dominant topic.

df_sent_topic.loc[df_sent_topic['dominant_topic'] == 4][['sentence', 'dominant_topic']].sample(20)

df_sent_topic.loc[df_sent_topic['dominant_topic'] == 12][['sentence', 'dominant_topic']].sample(10)

After reviewing the above two tables, I decided to remove all the sentences that have topic 4 or topic 12 as their dominant topic.

print('There are', len(df_sent_topic.loc[df_sent_topic['dominant_topic'] == 4]), 'sentences that belong to topic 4 and we will remove')
print('There are', len(df_sent_topic.loc[df_sent_topic['dominant_topic'] == 12]), 'sentences that belong to topic 12 and we will remove')

df_sent_topic_clean = df_sent_topic.drop(df_sent_topic[(df_sent_topic.dominant_topic == 4) | (df_sent_topic.dominant_topic == 12)].index)

Next, we will join the clean sentence together in to a descriptions. That is, making it back to one description per hotel.

df_description = df_sent_topic_clean[['sentence','name']]
df_description = df_description.groupby('name')['sentence'].agg(lambda col: ' '.join(col)).reset_index()

Let’s see what left for our “Hilton Garden Inn Seattle Downtown”

df_description['sentence'][45]

There is only one sentence left and it is about the location of the hotel and this is what I had expected.

Make Recommendations

Using the same cosine similarity measurement, we are going to find the top 5 most similar hotels with “Hilton Garden Inn Seattle Downtown” (except itself), according to the cleaned hotel description texts.

https://medium.com/media/2338ec051f9d736f4062aa769eace360/href

Nice! Our method worked!

Jupyter notebook can be found on Github. Have a great weekend!

Correct.

Written on September 26, 2019. Posted in Susan Li.

Correct.

Thank you for your feedback. I did not try this time.

Written on September 23, 2019. Posted in Susan Li.

Thank you for your feedback. I did not try this time.

Classify Toxic Online Comments with LSTM and GloVe

Written on September 22, 2019. Posted in Susan Li.

Deep learning, text classification, NLP

This article shows how to use a simple LSTM and one of the pre-trained GloVe files to create a strong baseline for the toxic comments classification problem.

The article consist of 4 main sections:

Preparing the data
Implementing a simple LSTM (RNN) model
Training the model
Evaluating the model

The Data

In the following steps, we will set the key model parameters and split the data.

“MAX_NB_WORDS” sets the maximum number of words to consider as features for tokenizer.
“MAX_SEQUENCE_LENGTH” cuts off texts after this number of words (among the MAX_NB_WORDS most common words).
“VALIDATION_SPLIT” sets a portion of data for validation and not used in training.
“EMBEDDING_DIM” defines the size of the “vector space”.
“GLOVE_DIR” defines the GloVe file directory.
Split the data into the texts and the labels.

https://medium.com/media/e404f87284057d710e0e0f3967897354/href

Text Pre-processing

In the following step, we remove stopwords, punctuation and make everything lowercase.

https://medium.com/media/eadbd6070963e87f8225525edc19beff/href

Have a look a sample data.

print('Sample data:', texts[1], y[1])

We create a tokenizer, configured to only take into account the MAX_NB_WORDS most common words.
We build the word index.
We can recover the word index that was computed.

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Vocabulary size:', len(word_index))

Turns the lists of integers into a 2D integer tensor of shape (samples, maxlen)
Pad after each sequence.

data = pad_sequences(sequences, padding = 'post', maxlen = MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', y.shape)

Shuffle the data.

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = y[indices]

Create the train-validation split.

num_validation_samples = int(VALIDATION_SPLIT*data.shape[0])
x_train = data[: -num_validation_samples]
y_train = labels[: -num_validation_samples]
x_val = data[-num_validation_samples: ]
y_val = labels[-num_validation_samples: ]

print('Number of entries in each category:')
print('training: ', y_train.sum(axis=0))
print('validation: ', y_val.sum(axis=0))

This is what the data looks like:

print('Tokenized sentences: n', data[10])
print('One hot label: n', labels[10])

Create the model

We will use pre-trained GloVe vectors from Stanford to create an index of words mapped to known embeddings, by parsing the data dump of pre-trained embeddings.
Then load word embeddings into an embeddings_index

https://medium.com/media/4a58a99c42bc4a3760b0155be6d63f59/href

Create the embedding layers.
Specifies the maximum input length to the Embedding layer.
Make use of the output from the previous embedding layer which outputs a 3-D tensor into the LSTM layer.
Use a Global Max Pooling layer to to reshape the 3D tensor into a 2D one.
We set the dropout layer to drop out 10% of the nodes.
We define the Dense layer to produce a output dimension of 50.
We feed the output into a Dropout layer again.
Finally, we feed the output into a “Sigmoid” layer.

https://medium.com/media/6d48566a561c43069f9da57f9ad9e800/href

Its time to Compile the model into a static graph for training.

Define the inputs, outputs and configure the learning process.
Set the model to optimize our loss function using “Adam” optimizer, define the loss function to be “binary_crossentropy” .

model = Model(sequence_input, preds)
model.compile(loss = 'binary_crossentropy',
             optimizer='adam',
             metrics = ['accuracy'])

Training

Feed in a list of 32 padded, indexed sentence for each batch. The validation set will be used to assess whether the model has overfitted.
The model will run for 2 epochs, because even 2 epochs is enough to overfit.

print('Training progress:')
history = model.fit(x_train, y_train, epochs = 2, batch_size=32, validation_data=(x_val, y_val))

Evaluate the model

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss)+1)

plt.plot(epochs, loss, label='Training loss')
plt.plot(epochs, val_loss, label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show();

accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']

plt.plot(epochs, accuracy, label='Training accuracy')
plt.plot(epochs, val_accuracy, label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epochs')
plt.legend()
plt.show();

Jupyter notebook can be found on Github. Happy Monday!

Classify Toxic Online Comments with LSTM and GloVe was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Visa, Inc. Data Science Interviews

Written on September 22, 2019. Posted in Vimarsh Karbhari.

Visa Data Science Interviews

In 2018, Visa had $20.61 billion in revenue.

In 2015, the Nilson Report, a publication that tracks the credit card industry, found that Visa’s global network (known as VisaNet) processed 100 billion transactions during 2014 with a total volume of US$6.8 trillion. VisaNet data centers can handle up to 30,000 simultaneous transactions and up to 100 billion computations every second. Visa is a very household name all over the world. If you ever owned a credit card, you will surely know what Visa is. With a 100 billion transactions, the scale of data in the company is beyond compare. It could be a highlight of a Data professionals’ career.

Interview Process

A senior data scientist from the team reaches out for the first telephonic interview after the resume is selected. The interview involves resume based questions, SQL, and or a business case study. After the first round, there is another telephonic technical interview. Eventually, there are five on-site interviews. On-site interviews are with top level personnel, directors and VPs. Each of those interviews is 45 minutes long.

Important Reading

Data Science at Visa — Youtube
A Chat With Amitava Dutta Of Visa — Blog article
Docker at Visa — Dockerconf 2017

Data Science Related Interview Questions

Invert an integer without the use of strings.
Write a sorting algorithm.
How do you estimate a customer’s location based on Visa transaction data?
Write a code for a Fibonacci sequence.
What functions can I perform using a spreadsheet?
Who would be your first line of contact to report a missing data you’re keeping record of?
Give the top three employee salaries in each department in a company.
What is Node.js?
What is MVC?
What is synchronous vs asynchronous Javascript?

Reflecting on the Interviews

The data science interview at Visa, Inc. is a rigorous process which involves many different interviews. The team is top notch and they are looking for similar candidates to hire. Most interviews look for fundamentals in SQL, coding, probability and statistics as well as ML. A decent amount of hard work can surely get you a job with the world’s largest credit transaction processing company!

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Newsletter

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

The sole motivation of this blog article is to learn about Visa Inc. and its technologies and help people to get into it. All data is sourced from online public sources. I aim to make this a living document, so any updates and suggested changes can always be included. Please provide relevant feedback.

Visa, Inc. Data Science Interviews was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building A Collaborative Filtering Recommender System with TensorFlow

Written on September 12, 2019. Posted in Susan Li.

Collaborative Filtering is a technique widely used by recommender systems when you have a decent size of user — item data. It makes recommendations based on the content preferences of similar users.

Therefore, collaborative filtering is not a suitable model to deal with cold start problem, in which it cannot draw any inference for users or items about which it has not yet gathered sufficient information.

But once you have relative large user — item interaction data, then collaborative filtering is the most widely used recommendation approach. And we are going to learn how to build a collaborative filtering recommender system using TensorFlow.

The Data

We are again using booking crossing dataset that can be found here. The data pre-processing steps does the following:

Merge user, rating and book data.
Remove unused columns.
Filtering books that have had at least 25 ratings.
Filtering users that have given at least 20 ratings. Remember, collaborative filtering algorithms often require users’ active participation.

https://medium.com/media/defd12f7c924869652fef81d9795e6c6/href

So, our final dataset contains 3,192 users for 5,850 books. And each user has given at least 20 ratings and each book has received at least 25 ratings. If you do not have a GPU, this would be a good size.

The collaborative filtering approach focuses on finding users who have given similar ratings to the same books, thus creating a link between users, to whom will be suggested books that were reviewed in a positive way. In this way, we look for associations between users, not between books. Therefore, collaborative filtering relies only on observed user behavior to make recommendations — no profile data or content data is necessary.

Our technique will be based on the following observations:

Users who rate books in a similar manner share one or more hidden preferences.
Users with shared preferences are likely to give ratings in the same way to the same books.

The Process in TensorFlow

First, we will normalize the rating feature.

scaler = MinMaxScaler()
combined['Book-Rating'] = combined['Book-Rating'].values.astype(float)
rating_scaled = pd.DataFrame(scaler.fit_transform(combined['Book-Rating'].values.reshape(-1,1)))
combined['Book-Rating'] = rating_scaled

Then, build user, book matrix with three features:

combined = combined.drop_duplicates(['User-ID', 'Book-Title'])
user_book_matrix = combined.pivot(index='User-ID', columns='Book-Title', values='Book-Rating')
user_book_matrix.fillna(0, inplace=True)

users = user_book_matrix.index.tolist()
books = user_book_matrix.columns.tolist()

user_book_matrix = user_book_matrix.as_matrix()

tf.placeholder only available in v1, so I have to work around like so:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

In the following code scrips

We set up some network parameters, such as the dimension of each hidden layer.
We will initialize the TensorFlow placeholder.
Weights and biases are randomly initialized.
The following code are taken from the book: Python Machine Learning Cook Book — Second Edition

https://medium.com/media/b9d3ea77cc85bc45921c16d0d53e2ffb/href

Now, we can build the encoder and decoder model.

https://medium.com/media/7eefae55b835e0f730c1a2f1dd21c16d/href

Now, we construct the model and the predictions

encoder_op = encoder(X)
decoder_op = decoder(encoder_op)

y_pred = decoder_op

y_true = X

In the following code, we define loss function and optimizer, and minimize the squared error, and define the evaluation metrics.

loss = tf.losses.mean_squared_error(y_true, y_pred)
optimizer = tf.train.RMSPropOptimizer(0.03).minimize(loss)
eval_x = tf.placeholder(tf.int32, )
eval_y = tf.placeholder(tf.int32, )
pre, pre_op = tf.metrics.precision(labels=eval_x, predictions=eval_y)

Because TensorFlow uses computational graphs for its operations, placeholders and variables must be initialized before they have values. So in the following code, we initialize the variables, then create an empty data frame to store the result table, which will be top 10 recommendations for every user.

init = tf.global_variables_initializer()
local_init = tf.local_variables_initializer()
pred_data = pd.DataFrame()

We can finally start training our model.

We split training data into batches, and we feed the network with them.
We train our model with vectors of user ratings, each vector represents a user and each column a book, and entries are ratings that the user gave to books.
After a few trials, I discovered that training model for 100 epochs with a batch size of 35 would be consuming enough memories. This means that the entire training set will feed our neural network 100 times, every time using 35 users.
At the end, we must make sure to remove user’s ratings in the training set. That is, we must not recommend books to a user in which he (or she) has already rated.

https://medium.com/media/fcd7f1ee4eb1aa50cc9c45ecd402244c/href

Finally, let’s see how our model works. I randomly selected a user, to see what books we should recommended to him (or her).

top_ten_ranked.loc[top_ten_ranked['User-ID'] == 278582]

The above are the top 10 results for this user, sorted by the normalized predicted ratings.

Let’s see what books he (or she) has rated, sorted by ratings.

book_rating.loc[book_rating['User-ID'] == 278582].sort_values(by=['Book-Rating'], ascending=False)

The types of the books this user liked are: historical mystery novel, thriller and suspense novel, science and fiction novel, fantasy novel and so on.

The top 10 results for this user are: murder fantasy novel, mystery thriller novel and so on.

The results were not disappointing.

The Jupyter notebook can be found on Github. Happy Friday!

References:

Python Machine Learning Cook Book — Second Edition

https://cloud.google.com/solutions/machine-learning/recommendation-system-tensorflow-overview

Building A Collaborative Filtering Recommender System with TensorFlow was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Analysis of Data Science Interview Questions

Written on September 9, 2019. Posted in Vimarsh Karbhari.

Breaking down data science interview questions by category

tl;dr: Unless the data science role is nuanced, most data science roles require fundamental knowledge about the basics of data science. (SQL, Coding, Probability and Stats, Data Analysis)

We analyzed hundreds of data science interview questions to find trends, patterns and topics that are the core of a data science interview.

At a high level, we divided these questions into different categories. We added a weight to each category. Weight of a category is simply the number of times we found a question occurring or repeating in the bucket from a random corpus of 100 questions.

From the pie chart above, categories like SQL, coding are non ambiguous. Machine learning basics consists of Linear/Logistic regression and related ML algorithms. Advanced ML consists of comparisons between multiple approaches, algorithms and nuanced techniques. Big Data technology includes big data concepts such as Hadoop, Spark and may include the infrastructure side/data engineering/deployment side of data science models. In essence, data science fundamentals are asked 70% of the time in a data science interview.

While SQL and coding based questions might be part of the initial online assessment, data analysis questions tend to be a take-home assessment. The remaining categories are usually covered during the phone/in-person interview and vary based on the role, company, years of experience and team composition.

Considering all this data, we designed a data science interview course to help people Ace Data Science Interviews. All the categories mentioned above will be covered in this course. The current cohort starts September 16, 2019. It will be a small group of 15 people. Sign up here!

Acing Data Science Interviews

Subscribe to the Acing AI/Data Science Newsletter. It is FREE!

Newsletter

Analysis of Data Science Interview Questions was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Acing Data Science Interviews

Written on August 19, 2019. Posted in Vimarsh Karbhari.

According to Indeed, there is a 344% increase in demand for data scientists year over year.

In January 2018, I started the Acing AI blog with a goal to help people get into data science. My first article was about “The State of Autonomous Transportation”. As I wrote more, I realized people were interested in acing data science interviews. This led me to start my articles covering various companies’ data science interview questions and processes. The Acing AI blog will continue to have interesting articles as always. This journey continues with today’s exciting announcements.

First, we are launching the Acing Data Science/Acing AI newsletter. This newsletter will always be free. We will be sharing interesting data science articles, interview tips and more via the newsletter. Some of you are already subscribed to this newsletter and will continue to get emails on it.

Through my first newsletter, I also wanted to share the next evolution of the Acing AI blog, Acing Data Science Interviews.

I partnered with Johnny to come up with an amazing course to help people ace data science interviews. Everything we have learned from conducting interviews, giving interviews, writing these blogs and learning from the best people in the data science, we packaged that into this course. Think about the collective six plus years of learning condensed into a three month course. That would be Acing Data Science Interviews.

At a high level, we will cover different topics from a data science interview perspective. These include SQL, coding, probability and statistics, data analysis, machine learning algorithms, advanced machine learning, machine learning system design, deep learning, neural networks, big-data concepts and finally approaching a data science interviews. The first few topics provide the foundation aspects of data science. They are followed by the data science application topics. Collectively, all these should encompass everything that could be asked in a data science interview.

The first sessions will start in the second half of September 2019. We are aiming to have a small group of 15 people. The original course will be only 199$. We are focused on quality and would like to provide the best experience and hence, we want to keep the small group size.

Acing Data Science Interviews

Thank you for reading!

Acing Data Science Interviews was originally published in Acing AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Thanks very much for pointing out. I have updated.

Written on August 18, 2019. Posted in Susan Li.

Thanks very much for pointing out. I have updated.

Bayesian Strategy for Modeling Retail Price with PyStan

Written on August 17, 2019. Posted in Susan Li.

Statistical modeling, partial pooling, Multilevel modeling, hierarchical modeling

Pricing is a common problem faced by any e-commerce business, and one that can be addressed effectively by Bayesian statistical methods.

The Mercari Price Suggestion data set from Kaggle seems to be a good candidate for the Bayesian models I wanted to learn.

If you remember, the purpose of the data set is to build a model that automatically suggests the right price for any given product for Mercari website sellers. I am here to attempt to see whether we can solve this problem by Bayesian statistical methods, using PyStan.

And the following pricing analysis replicates the case study of home radon levels from Professor Fonnesbeck. In fact, the methodology and code were largely borrowed from his tutorial.

The Data

In this analysis, we will estimate parameters for individual product price that exist within categories. And the measured price is a function of the shipping condition (buyer pays shipping or seller pays shipping), and the overall price.

At the end, our estimate of the parameter of product price can be considered a prediction.

Simply put, the independent variables we are using are: category_name & shipping. And the dependent variable is: price.

from scipy import stats
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
import pystan
import seaborn as sns
import pandas as pd
from theano import shared
from sklearn import preprocessing
plt.style.use('bmh')

df = pd.read_csv('train.tsv', sep = 't')
df = df.sample(frac=0.01, random_state=99)
df = df[pd.notnull(df['category_name'])]

df.category_name.nunique()

To make things more interesting, I will model all of these 689 product categories. If you want to produce better results quicker, you may want to model the top 10 or top 20 categories, to start.

shipping_0 = df.loc[df['shipping'] == 0, 'price']
shipping_1 = df.loc[df['shipping'] == 1, 'price']
fig, ax = plt.subplots(figsize=(10,5))
ax.hist(shipping_0, color='#8CB4E1', alpha=1.0, bins=50, range = [0, 100],
       label=0)
ax.hist(shipping_1, color='#007D00', alpha=0.7, bins=50, range = [0, 100],
       label=1)
plt.xlabel('price', fontsize=12)
plt.ylabel('frequency', fontsize=12)
plt.title('Price Distribution by Shipping Type', fontsize=15)
plt.tick_params(labelsize=12)
plt.legend()
plt.show();

“shipping = 0” means shipping fee paid by buyer, and “shipping = 1” means shipping fee paid by seller. In general, price is higher when buyer pays shipping.

Modeling

For construction of a Stan model, it is convenient to have the relevant variables as local copies — this aids readability.

category: index code for each category name
price: price
category_names: unique category names
categories: number of categories
log_price: log price
shipping: who pays shipping
category_lookup: index categories with a lookup dictionary

le = preprocessing.LabelEncoder()
df['category_code'] = le.fit_transform(df['category_name'])
category_names = df.category_name.unique()
categories = len(category_names)
category = df['category_code'].values
price = df.price
df['log_price'] = log_price = np.log(price + 0.1).values
shipping = df.shipping.values
category_lookup = dict(zip(category_names, range(len(category_names))))

We should always explore the distribution of price (log scale) in the data:

df.price.apply(lambda x: np.log(x+0.1)).hist(bins=25)
plt.title('Distribution of price (log scale)')
plt.xlabel('log (price)')
plt.ylabel('Frequency');

Conventional Approaches

There are two conventional approaches to modeling price represent the two extremes of the bias-variance tradeoff:

Complete pooling:

Treat all categories the same, and estimate a single price level, with the equation:

To specify this model in Stan, we begin by constructing the data block, which includes vectors of log-price measurements (y) and who pays shipping covariates (x), as well as the number of samples (N).

The complete pooling model is:

https://medium.com/media/51a3101eb6118ee17ae87ea25bc4edb0/href

When passing the code, data, and parameters to the Stan function, we specify sampling 2 chains of length 1000:

https://medium.com/media/c57be4f2144e862403f9dc722036fd0e/href

Inspecting the fit

Once the fit has been run, the method extract and specifying permuted=True extracts samples into a dictionary of arrays so that we can conduct visualization and summarization.

We are interested in the mean values of these estimates for parameters from the sample.

b0 = alpha = mean price across category
m0 = beta = mean variation in price with change on who pays shipping

We can now visualize how well this pooled model fits the observed data.

pooled_sample = pooled_fit.extract(permuted=True)
b0, m0 = pooled_sample['beta'].T.mean(1)

plt.scatter(df.shipping, np.log(df.price+0.1))
xvals = np.linspace(-0.2, 1.2)
plt.xticks([0, 1])
plt.plot(xvals, m0*xvals+b0, 'r--')
plt.title("Fitted model")
plt.xlabel("Shipping")
plt.ylabel("log(price)");

Observations:

The fitted line runs through the centre of the data, and it describes the trend.
However, the observed points vary widely about the fitted model, and there are multiple outliers indicating that the original price varies quite widely.
We might expect different gradients if we chose different subsets of the data.

Unpooling

When unpooling, we model price in each category independently, with the equation:

where j = 1, … , 689

The unpooled model is:

https://medium.com/media/17665e2ecce7247121bde96788c0f169/href

When running the unpooled model in Stan, We again map Python variables to those used in the Stan model, then pass the data, parameters and the model to Stan. We again specify 1000 iterations of 2 chains.

https://medium.com/media/0255b96493f2197f098e598785e6bb50/href

Inspecting the fit

To inspect the variation in predicted price at category level, we plot the mean of each estimate with its associated standard error. To structure this visually, we’ll reorder the categories such that we plot categories from the lowest to the highest.

unpooled_estimates = pd.Series(unpooled_fit['a'].mean(0), index=category_names)

order = unpooled_estimates.sort_values().index

plt.figure(figsize=(18, 6))
plt.scatter(range(len(unpooled_estimates)), unpooled_estimates[order])
for i, m, se in zip(range(len(unpooled_estimates)), unpooled_estimates[order], unpooled_se[order]):
    plt.plot([i,i], [m-se, m+se], 'b-') 
    plt.xlim(-1,690); 
plt.ylabel('Price estimate (log scale)');plt.xlabel('Ordered category');plt.title('Variation in category price estimates');

Observations:

There are multiple categories with relatively low predicted price levels, and multiple categories with a relatively high predicted price levels. Their distance can be large.
A single all-category estimate of all price level could not represent this variation well.

Comparison of pooled and unpooled estimates

We can make direct visual comparisons between pooled and unpooled estimates for all categories, and we are going to show several examples, and I purposely select some categories with many products, and other categories with very few products.

https://medium.com/media/2470950a8e5c243beeb1acebee519701/href

Let me try to explain what the above visualizations tell us:

The pooled models (red dashed line) in every category are the same, meaning all categories are modeled the same, indicating pooling is useless.
For categories with few observations, the fitted estimates track the observations very closely, suggesting that there has been overfitting. So that we can’t trust the estimates produced by models using few observations.

Multilevel and Hierarchical Models

Partial Pooling — simplest

The simplest possible partial pooling model for the e-commerce price data set is one that simply estimates prices, with no other predictors (i.e. ignoring the effect of shipping). This is a compromise between pooled (mean of all categories) and unpooled (category-level means), and approximates a weighted average (by sample size) of unpooled category estimates, and the pooled estimates, with the equation:

The simplest partial pooling model:

https://medium.com/media/66824a405e17c27e56dcb4bb92ff8830/href

Now we have two standard deviations, one describing the residual error of the observations, and another describing the variability of the category means around the average.

https://medium.com/media/73b49d1cd60ce693f146761352779f89/href

We’re interested primarily in the category-level estimates of price, so we obtain the sample estimates for “a”:

https://medium.com/media/e17f1f478ee3f3b2d6785917db1e316b/href

Observations:

There are significant differences between unpooled and partially-pooled estimates of category-level price, The partially pooled estimates looks way less extreme.

Partial Pooling — Varying Intercept

Simply put, the multilevel modeling shares strength among categories, allowing for more reasonable inference in categories with little data, with the equation:

The varying intercept model:

https://medium.com/media/7e6b2af5c74e7d7e9ee8c34e0135ff2e/href

Fitting the model:

https://medium.com/media/b1ba4fcf65194d4cc1eebad8a1dddd0a/href

There is no way to visualize all of these 689 categories together, so I will visualize 20 of them.

a_sample = pd.DataFrame(varying_intercept_fit['a'])
plt.figure(figsize=(20, 5))
g = sns.boxplot(data=a_sample.iloc[:,0:20], whis=np.inf, color="g")
# g.set_xticklabels(df.category_name.unique(), rotation=90)  # label counties
g.set_title("Estimates of log(price), by category")
g;

Observations:

There are quite some variations in prices by category, and we can see that for example, category Beauty/Fragrance/Women (index at 9) with a large number of samples (225) also has one of the tightest range of estimated values.
While category Beauty/Hair Care/Shampoo Plus Conditioner (index at 16) with the smallest number of sample (one only) also has one of the widest range of estimates.

We can visualize the distribution of parameter estimates for 𝜎 and β.

az.plot_trace(varying_intercept_fit, var_names = ['sigma_a', 'b']);

varying_intercept_fit['b'].mean()

The estimate for the shipping coefficient is approximately -0.27, which can be interpreted as products which shipping fee paid by seller at about 0.76 of (exp(−0.27)=0.76) the price of those shipping paid by buyer, after accounting for category.

Visualize the fitted model

plt.figure(figsize=(12, 6))
xvals = np.arange(2)
bp = varying_intercept_fit['a'].mean(axis=0) # mean a (intercept) by category
mp = varying_intercept_fit['b'].mean()      # mean b (slope/shipping effect)
for bi in bp:
    plt.plot(xvals, mp*xvals + bi, 'bo-', alpha=0.4)
plt.xlim(-0.1,1.1)
plt.xticks([0, 1])
plt.title('Fitted relationships by category')
plt.xlabel("shipping")
plt.ylabel("log(price)");

Observations:

It is clear from this plot that we have fitted the same shipping effect to each category, but with a different price level in each category.
There is one category with very low fitted price estimates, and several categories with relative lower fitted price estimates.
There are multiple categories with relative higher fitted price estimates.
The bulk of categories form a majority set of similar fits.

We can see whether partial pooling of category-level price estimate has provided more reasonable estimates than pooled or unpooled models, for categories with small sample sizes.

Partial Pooling — Varying Slope model

We can also build a model that allows the categories to vary according to shipping arrangement (paid by buyer or paid by seller) influences the price. With the equation:

The varying slope model:

https://medium.com/media/410f626d1b3edc5f64a95b8d5d6c5875/href

Fitting the model:

https://medium.com/media/c46f1284c4d604f1a809af03d0920c50/href

Following the process earlier, we will visualize 20 categories.

b_sample = pd.DataFrame(varying_slope_fit['b'])
plt.figure(figsize=(20, 5))
g = sns.boxplot(data=b_sample.iloc[:,0:20], whis=np.inf, color="g")
# g.set_xticklabels(df.category_name.unique(), rotation=90)  # label counties
g.set_title("Estimate of shipping effect, by category")
g;

Observations:

From the first glance, we may not see any difference between these two boxplots. But if you look deeper, you will find that the variation in median estimates between categories in varying slope model becomes smaller than those in varying intercept model, though the range of uncertainty is still greatest in the categories with fewest products, and least in the categories with the most products.

Visualize the fitted model:

plt.figure(figsize=(10, 6))
xvals = np.arange(2)
b = varying_slope_fit['a'].mean()
m = varying_slope_fit['b'].mean(axis=0)
for mi in m:
    plt.plot(xvals, mi*xvals + b, 'bo-', alpha=0.4)
plt.xlim(-0.2, 1.2)
plt.xticks([0, 1])
plt.title("Fitted relationships by category")
plt.xlabel("shipping")
plt.ylabel("log(price)");

Observations:

It is clear from this plot that we have fitted the same price level to every category, but with a different shipping effect in each category.
There are two categories with very small shipping effects, but the majority bulk of categories form a majority set of similar fits.

Partial Pooling — Varying Slope and Intercept

The most general way to allow both slope and intercept to vary by category. With the equation:

The varying slope and intercept model:

https://medium.com/media/99f0024765a13153517815289c68ed0f/href

Fitting the model:

https://medium.com/media/2b2e35474baef1997aa31d66cf469bb7/href

Visualize the fitted model:

plt.figure(figsize=(10, 6))
xvals = np.arange(2)
b = varying_intercept_slope_fit['a'].mean(axis=0)
m = varying_intercept_slope_fit['b'].mean(axis=0)
for bi,mi in zip(b,m):
    plt.plot(xvals, mi*xvals + bi, 'bo-', alpha=0.4)
plt.xlim(-0.1, 1.1);
plt.xticks([0, 1])
plt.title("fitted relationships by category")
plt.xlabel("shipping")
plt.ylabel("log(price)");

While these relationships are all very similar, we can see that by allowing both shipping effect and price to vary, we seem to be capturing more of the natural variation, compare with varying intercept model.

Contextual Effects

In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept.

Contextual effect model:

https://medium.com/media/7c604f58453416f772d0c991ee56a12d/href

Fitting the model:

https://medium.com/media/6b84784f1d5a7f049d18afc5e84f18dd/href

Prediction

we wanted to make a prediction for a new product in “Women/Athletic Apparel/Pants, Tights, Leggings” category, which shipping paid by seller, we just need to sample from the model with the appropriate intercept.

category_lookup['Women/Athletic Apparel/Pants, Tights, Leggings']

The prediction model:

https://medium.com/media/a65424df0289f71c90c9e6bf88374c61/href

Making the prediction:

https://medium.com/media/7e5baafd8775536b65d5be80b0650606/href

The prediction:

contextual_pred_fit.plot('y_wa');

Observations:

The mean value sampled from this fit is ≈3, so we should expect the measured price in a new product in “Women/Athletic Apparel/Pants, Tights, Leggings” category, when shipping paid by seller, to be ≈exp(3) ≈ 20.09, though the range of predicted values is rather wide.

Jupyter notebook can be found on Github. Enjoy the rest of the weekend.

References:

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

Category: Toronto People

How to effectively and creatively pre-process text data

The Data

Make Recommendations

Topic Modeling

Make Recommendations

Deep learning, text classification, NLP

The Data

Text Pre-processing

Create the model

Training

Evaluate the model

Visa Data Science Interviews

In 2018, Visa had $20.61 billion in revenue.

The Data

The Process in TensorFlow

Breaking down data science interview questions by category

According to Indeed, there is a 344% increase in demand for data scientists year over year.

Statistical modeling, partial pooling, Multilevel modeling, hierarchical modeling

The Data

Modeling

Conventional Approaches

Complete pooling:

Unpooling

Multilevel and Hierarchical Models

Partial Pooling — simplest

Partial Pooling — Varying Intercept

Partial Pooling — Varying Slope model

Partial Pooling — Varying Slope and Intercept

Contextual Effects

Prediction