[P] entity resolution system for large-scale databases
I’d like to share some insights about a Wikimedia Foundation project I’ve been contributing to.
Specifically, we leveraged Bernoulli Naïve Bayes, Linear Support Vector Machines, Single-layer Perceptrons, and Multi-layer Perceptrons. As an interesting finding, models based on Single-layer Perceptrons are the ones that work best for our input datasets, namely Discogs, IMDb, and MusicBrainz.
soweego partners with Mix’n’match, which mainly deals with small catalogs. soweego is currently uploading 255 k confident identifiers to Wikidata, see its activity. 126 k medium-confident links are instead getting into Mix’n’match for curation.
The soweego team has also worked hard to address the following community requests:
- sync Wikidata to external databases and check them to spot inconsistencies in Wikidata;
- import new databases with reasonable effort.
If you like the project, please consider starring it on GitHub: https://github.com/Wikidata/soweego