Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Machine Learning Approach in detecting if companies are the same

I have a very large dataset of shipment data where the company names are not normalized (e.g. companies that are supposed to be the same are treated different, like Walmart Inc., Walmart Incorporated, Wallmart, WalmartInc.). A simple string normalization like regex would not do good on this.

I have thought of TEXT SIMILARITY approach (Levenshtein Distance, Waro-Jinkler, etc.) which theoretically would work but would not do good in practice. One is that you should set a threshold and thresholds are different for each of them and various problems would arise.


  1. Large and Short Company Names would skew the threshold: (Walmart Inc – Walmart Inc. vs ABC Co. – In behalf of ABC Group of Co.)
  2. Almost similar company names that are supposed to be different (ABC Company Thailand vs ABC Company Taiwan)

The problem for #1 is that Thresholding for text similarity ratio is tricky.

The problem for #2 is that these companies have high ratio but are supposed to be different companies shipping different products (for example, ABC Thailand ships dresses while ABC Taiwan ships gadgets).

I have shipping data that looks like this

company name products company postal address country zip code
ABC Company Thailand 1x dress pink Bangkok Thailand Thailand 11100
ABC Company Taiwan 20x Phones Taipei, Taiwan Taiwan 00291
Walmart California Inc. 100kgs banana California California 9929
In behalf of Walmart CaliforniaInc 200kgs meat California California 9929

I am thinking of a solution that uses TEXT similarity metrics but across fields that could indicate that they are the same company (such as country, zip code, even products).

My proposed solution is

– a new entry is compared to a constructed table consisting of columns that are distinguishing features (company name, zip code, country for example)

– the new entry is only compared using the company name. the highest similarity is returned. And text similarity across different columns on new entry and selected data is produced.

– text similarity ratio/points of these two is fed to a classifier that tells if they are similar companies or not. Basically, the input for the classifier is the text similarity ratio of the new entry and the nearest company name from the list.

Any easier approach? The approach should be able to tackle both an existing large data and new entry (for example, deduplication does not seem to tackle addition of new entries). Thanks!

submitted by /u/sarmientoj24
[link] [comments]