[P] Using protein sequences to make better classifiers in bioinformatics
As a data scientist in the bioinformatics field, I often found it useful to add features describing proteins to my models. These were often manually engineered or based on heuristics and alignments, and lacked information on the structure of the protein, as that data is relatively sparse.
Recently I found a paper by Bepler and Berger, published at ICLR 2019, where they created a set of models that use weak supervision to create protein embeddings. In this blog post I take a look at the theory behind this paper and present an intermediate-level tutorial for people who want to include these embeddings in their own models. A comprehensive analysis of the predictive power of these embeddings is also included.