[D] Machine Learning Approach in detecting if companies are the same

Written by torontoai on June 21, 2019. Posted in Reddit MachineLearning.

I have a very large dataset of shipment data where the company names are not normalized (e.g. companies that are supposed to be the same are treated different, like Walmart Inc., Walmart Incorporated, Wallmart, WalmartInc.). A simple string normalization like regex would not do good on this.

I have thought of TEXT SIMILARITY approach (Levenshtein Distance, Waro-Jinkler, etc.) which theoretically would work but would not do good in practice. One is that you should set a threshold and thresholds are different for each of them and various problems would arise.

Example:

Large and Short Company Names would skew the threshold: (Walmart Inc – Walmart Inc. vs ABC Co. – In behalf of ABC Group of Co.)
Almost similar company names that are supposed to be different (ABC Company Thailand vs ABC Company Taiwan)

The problem for #1 is that Thresholding for text similarity ratio is tricky.

The problem for #2 is that these companies have high ratio but are supposed to be different companies shipping different products (for example, ABC Thailand ships dresses while ABC Taiwan ships gadgets).

I have shipping data that looks like this

company name	products	company postal address	country	zip code
ABC Company Thailand	1x dress pink	Bangkok Thailand	Thailand	11100
ABC Company Taiwan	20x Phones	Taipei, Taiwan	Taiwan	00291
Walmart California Inc.	100kgs banana	California	California	9929
In behalf of Walmart CaliforniaInc	200kgs meat	California	California	9929

I am thinking of a solution that uses TEXT similarity metrics but across fields that could indicate that they are the same company (such as country, zip code, even products).

My proposed solution is

– a new entry is compared to a constructed table consisting of columns that are distinguishing features (company name, zip code, country for example)

– the new entry is only compared using the company name. the highest similarity is returned. And text similarity across different columns on new entry and selected data is produced.

– text similarity ratio/points of these two is fed to a classifier that tells if they are similar companies or not. Basically, the input for the classifier is the text similarity ratio of the new entry and the nearest company name from the list.

Any easier approach? The approach should be able to tackle both an existing large data and new entry (for example, deduplication does not seem to tackle addition of new entries). Thanks!

submitted by /u/sarmientoj24
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] Machine Learning Approach in detecting if companies are the same