[Research] Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)
Hi, I am one of the authors of this EMNLP 2019 paper.
We create Universal Adversarial Triggers:
Phrases that cause a specific model prediction when concatenated to 𝘢𝘯𝘺 input.
– GPT-2 to turn racist
– SQuAD models to predict “to kill american people” for 72% of “why” questions
– Text classifier accuracy 90%->1%.