First Language Attribution without Genre Specificity (FLAGS)

D. Guarrera, T. Mckinnon
Quantitative Scientific Solutions, LLC, United States

Keywords: natural language processing, linguistic forensics, linguistic fingerprinting, native language attribution

Quantitative Scientific Solutions (“QS-2”) is in the process of developing a system for First Language Attribution without Genre Specificity (FLAGS). Powered by Natural Language Processing algorithms, FLAGS integrates with existing analyst tools to generate a linguistic profile for anonymously authored text, enabling analysts to identify and understand cyber-threats and online manipulation using a small text sample. Our current prototype metaclassifier model of stacked machine learning primitive models identifies native language from English text with ~80% accuracy in a relatively small corpus of samples drawn from TOEFL essays. For comparison, a baseline model, which randomly assigned a text to one of the eleven native languages in the corpus, achieves less than 10% accuracy. This work demonstrates that high accuracy identification of linguistic communities is possible.