[P] entity resolution system for large-scale databases
Hello everyone,
I’d like to share some insights about a Wikimedia Foundation project I’ve been contributing to.
soweego is an entity resolution system that links the Wikidata knowledge base to large external databases through a set of supervised algorithms: https://soweego.readthedocs.io/
Specifically, we leveraged Bernoulli Naïve Bayes, Linear Support Vector Machines, Single-layer Perceptrons, and Multi-layer Perceptrons. As an interesting finding, models based on Single-layer Perceptrons are the ones that work best for our input datasets, namely Discogs, IMDb, and MusicBrainz.
soweego partners with Mix’n’match, which mainly deals with small catalogs. soweego is currently uploading 255 k confident identifiers to Wikidata, see its activity. 126 k medium-confident links are instead getting into Mix’n’match for curation.
The soweego team has also worked hard to address the following community requests:
- sync Wikidata to external databases and check them to spot inconsistencies in Wikidata;
- import new databases with reasonable effort.
If you like the project, please consider starring it on GitHub: https://github.com/Wikidata/soweego
submitted by /u/tupini07
[link] [comments]