[D] Machine Learning Approach in detecting if companies are the same
I have a very large dataset of shipment data where the company names are not normalized (e.g. companies that are supposed to be the same are treated different, like Walmart Inc., Walmart Incorporated, Wallmart, WalmartInc.). A simple string normalization like regex would not do good on this.
I have thought of TEXT SIMILARITY approach (Levenshtein Distance, Waro-Jinkler, etc.) which theoretically would work but would not do good in practice. One is that you should set a threshold and thresholds are different for each of them and various problems would arise.
- Large and Short Company Names would skew the threshold: (Walmart Inc – Walmart Inc. vs ABC Co. – In behalf of ABC Group of Co.)
- Almost similar company names that are supposed to be different (ABC Company Thailand vs ABC Company Taiwan)
The problem for #1 is that Thresholding for text similarity ratio is tricky.
The problem for #2 is that these companies have high ratio but are supposed to be different companies shipping different products (for example, ABC Thailand ships dresses while ABC Taiwan ships gadgets).
I have shipping data that looks like this
|company name||products||company postal address||country||zip code|
|ABC Company Thailand||1x dress pink||Bangkok Thailand||Thailand||11100|
|ABC Company Taiwan||20x Phones||Taipei, Taiwan||Taiwan||00291|
|Walmart California Inc.||100kgs banana||California||California||9929|
|In behalf of Walmart CaliforniaInc||200kgs meat||California||California||9929|
I am thinking of a solution that uses TEXT similarity metrics but across fields that could indicate that they are the same company (such as country, zip code, even products).
My proposed solution is
– a new entry is compared to a constructed table consisting of columns that are distinguishing features (company name, zip code, country for example)
– the new entry is only compared using the company name. the highest similarity is returned. And text similarity across different columns on new entry and selected data is produced.
– text similarity ratio/points of these two is fed to a classifier that tells if they are similar companies or not. Basically, the input for the classifier is the text similarity ratio of the new entry and the nearest company name from the list.
Any easier approach? The approach should be able to tackle both an existing large data and new entry (for example, deduplication does not seem to tackle addition of new entries). Thanks!