Source Code Plagiarism Detection using Information Retrieval Approach

Plagiarism is a developing issue in the scholarly community. Several plagiarism detection tools are used to detect comparative source code. The strategy for broadly checking pairwise likenesses between documents is not efficient enough for a large number of source code documents. Hence in order to make source code similarity recognition quick and adaptable, an information retrieval based approach is used for the same.

Fig. 1

In this approach, each and every report is presented like a query so as to fetch a ranked list of similar reports in decreasing sequence of their values of similarity with the given query. Then to report a set of records as potential instances of code reusability, a threshold value is applied on decremented similarity ratios. Here, from the annotated parse tree of the source code, terms are extracted. A field-based language model is used for indexing and for retrieving source code documents. As a result, we get improved plagiarism prediction accuracy with the help of source code parsing.

Programmers keep an eye on re-utilizing source code pieces that are accessible on the internet. Due to the use of same programming language specific keywords and constructs, the bag of words representation of source codes results in a large number of hits. This makes the issue of source code plagiarism, challenging. Hence, in a huge corpus, it is really inefficient to calculate pairwise similarity values between the source codes.

We can get rid of pairwise calculation by using an information retrieval based approach. Here, firstly each and every document is being added to an indexed organization based on the inverted list and then each of them is used in order to fetch a ranked list of matching documents from the corpus, wherein each of them is treated as a pseudo-query. Then from this retrieved list, all the plagiarised documents are selected.

Retrieval of Documents using Pseudo Query

Here, each and every document in the corpus is treated as a pseudo query and then in response to the query, we fetch most similar documents that are top ranked.

Then from the retrieved list of ranked documents, we obtain a candidate set of ranked documents. We then cut down the ranked list of documents at a specific point because the documents which are at the bottom of the list are less relevant. In the cut off strategy, we keep on taking the documents from the retrieved list of ranked documents until the similarity decrease of nth document w.r.t. the (n-1)th document is higher than the threshold value( say, t). This is because the first relative drop higher than the limit probably demonstrates the beginning of reports which are not plagiarised.

Plag(Q) = {Di : sim(Q, Di) − sim(Q, Di−1) sim(Q, Di−1) ≤ t}

Representation of documents

The bag of words representation of a source code report does not successfully handle those situations where a piece of the source code is copied into another one. Therefore, to get rid of this problem, while representing the document as a vector, the structure of the source code should be taken into account. From every source code document in our corpus, an annotated syntax tree is constructed using Java parser. Then terms are extracted from this annotated syntax tree. Then those words are indexed in a separate field. This field representation of the documents utilizes the structure of the documents more efficiently than the normal representation.

Representation of Query

As only a single portion of the source code is copied into another one, entire document shouldn’t be used as a query. Hence, to build a query some fixed numbers of terms are extracted out of different fields of a document. The term score of language modeling is given as,

LM(t, f, d) = λ (tf(t, f, d)/ len(f, d)) + (1 − λ) (cf(t) /cs)

Therefore, by using the above function, we score each term of the document ‘d’ to construct a pseudo-query. Then we select the top ‘k’. ‘λ’ is used to control the relative significance of term frequency over collection frequency.

So, the bottom-line suggests that in order to detect source code plagiarism, the structure of the program is very important, as the information retrieval based approach says.

References

1) Fig. 1: Cartoon used under Creative Commons from BLAUGH.com

2)http://www.uni-weimar.de/medien/webis/events/pan-at-fire-14/pan14-papers-final/pan14-source-code-reuse-detection/ganguly14-notebook.pdf

3) http://ceur-ws.org/Vol-1176/CLEF2010wn-PAN-Costa-jussaEt2010.pdf

IIITD IR MELANAGE

Search This Blog

Source Code Plagiarism Detection using Information Retrieval Approach

Comments

Post a Comment