On the Applicability of Deep Metric Learning to Address Source Code Authorship Attribution Problem under Simulated Real-world Constraints
Abstract
Source code authorship attribution is a widely studied research topic in the information security domain. In this dissertation, we develop and evaluate models that enable us to solve source code authorship attribution using deep metric learning. In particular, first, we simulate a real-world setting. Second, we use a number of loss functions from the deep metric learning domain to train neural network models. Thirdly, we evaluate these different models' performance on a benchmark and determine whether there is a quantifiable performance difference between these deep metric loss functions. Lastly, we demonstrate how we can extend our proposed methodology address the open world scenario. We argue that these models, and the techniques they take advantage of, are a stepping stone towards achieving real-world source code authorship attribution that can work across multiple programming languages and even under large scale obfuscated settings.