[Project] Model Based Byte Pair Encoding
Summary : Idea to generate Byte pair encodings, not based on frequency in the dataset, but on the quality of the prediction of our model. This enables us to predict multi word tokens like “New York” and address languages that don’t use spaces to split words.
Author here : I’m not a researcher, and could not find any paper related to that idea, if you know about any research in that direction please let me know. Or any comments on the post.