Some programmers don’t follow any code standards. If you know their style, you can usually spot them from a code snippet they’ve written. Lack of indentations, comments on every line, curly braces in the same line are all signatures that programmers leave behind. You probably have a signature of your own, that makes your code snippet unique to a discerning eye.
Without the help of AI, it is possible to identify a coder whose style is known, out of a small group of people. A topic of real interest is whether or not it is possible to distinguish identities among those who do follow conventions. Do clean, elegant, seemingly expressionless code snippets hide bits of the programmer’s personality?
Researchers at Princeton and Drexel University say ‘Yes.’ They have also been able to identify authors with startling accuracy.
Your Code Snippet is Almost Like a Digital Fingerprint. A team at Princeton recently conducted a study using 100,000 contest submissions to Google Code Jam. They collected codes by the different authors trying to solve the same set of problems. They split the large sample group into datasets of 250. Then they ran their method – based on machine learning principles – on the datasets. The method used was a two-step one. The first step involved converting every input into a vector with numerical features. In the second stage, a classifier learned patterns in every programmer’s vector and then classified a new vector that it hadn’t seen before.
The highlight of this study is the level of detail the team achieved in the features that identified each code author. The team extracted features from layout and lexical attributes in the source code. They also extracted syntactic features – in other words, the unique language and grammar of your code – from abstract syntax trees.
Earlier results had been astounding. When the team used their method on a dataset of 250, they could link an anonymous program to its author with 95 percent accuracy.
In the case of the Google Code Jam study, the Princeton team and its intelligent algorithm were able to identify anonymous C++ text from 100,000 authors with a 20 percent precision. It is quite a big number when you think about it.
According to Drexel University’s Aylin Caliskan-Islam, every coder has a unique style, just like artists and writers. His team, along with contributors from Princeton, Germany’s University of Gottingen, and the University of Maryland, developed the algorithm used in the study.
What Identifies Your Code Snippet?
If the Princeton and Drexel team’s algorithm were to stumble across your code in the mix, one of the things it would look at would be quirks in the layout. The pattern of whitespaces you use can be a giveaway. Do you add a line break after every line? Do you like to space out your code and keep it clean? Lexical attributes can also be identifiers. You may tend to use a few tokens more often, and a token count may bring the AI one step closer to linking patterns it sees in your code, to new code that you’ve written.
The system of syntax tree analysis can parse the prose that you use and look for elements and patterns that go beyond your writing style.
The word ‘abstract’ will tell you that this system is not about the phrase or variable names that you use in your code. Even if you replace the variable names, spaces or comments, the abstract set of features and patterns the algorithm identifies will not change. In other words, these features can be your unique digital fingerprint.
This tree analysis is a complicated system that breaks up the syntax into a diagram of multiple layers of sentences.
It looks at details like the depths at which you nest functions in the code, the order in which you place commands, and similar features.
What are the Applications?
Researchers are interested in this area of code analysis – which they call code stylometry – because of security concerns. If they can parse code and find hidden clues about its author in the code, they can investigate their way to hackers and cyber criminals.
There is hope that with the tree analysis going beyond language and syntax, hackers won’t be able to disguise themselves with the digital equivalent of plastic surgery on the thumb!
What do you think are the features that set your code snippet apart from the work of other authors?