[P] Simple and effective phrase finding in multi-language?
Dueling with out-of-vocabulary word or phrases is been a problem on nlp, sometime using deep learning cost too much.
Maybe we can use a simple statistic way first, finding potential phrases base on word boundary.
there is a drop on the boundary of phrases in a sentence, for example, one of the sentence in attention is all you need:
…multi-head attention in three different ways…
multi-head — frequency 10 multi-head attention — frequency 8 multi-head attention in — frequency 1 <- drop !! multi-head attention in three — frequency 1
To capture this drop, it can give us some potential phrases.so I create a library to help this out.
GitHub project – Phraseg
phraseg = Phraseg(''' The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU , ByteNet  and ConvS2S , all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions . In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequence- aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks . To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and . ''') result = phraseg.extract()
The result will be:
[('the Transformer', 3), ('of the', 2), ('ConvS 2 S', 2), ('input and output', 2), ('output positions', 2), ('number of operations', 2), ('In the', 2), ('attention mechanism', 2), ('to compute', 2)]
we may use this to explore the daily trending of GitHub repo:
Detail about how it works: