Source code plagiarism detection using siamese BLSTM network and embedding models

Source code plagiarism is a severe ongoing problem that threatens academic integrity and intellectual rights. Students from computing disciplines commit plagiarism through diverse channels, in which direct in-class plagiarism being the most popular. Programming instructors struggle to manually inspe...

Full description

Saved in:
Bibliographic Details
Main Authors: Manahi, Mohammed, Sulaiman, Suriani, Awang Abu Bakar, Normi Sham
Format: Book Chapter
Language:en
Published: Springer Singapore 2022
Subjects:
Online Access:http://irep.iium.edu.my/97680/7/97680_Source%20code%20plagiarism%20detection.pdf
http://irep.iium.edu.my/97680/
https://doi.org/10.1007/978-981-16-8515-6
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Source code plagiarism is a severe ongoing problem that threatens academic integrity and intellectual rights. Students from computing disciplines commit plagiarism through diverse channels, in which direct in-class plagiarism being the most popular. Programming instructors struggle to manually inspect plagiarism activities in large volumes of submissions. Thus, many research works on detection approaches have been proposed to overcome prolonged manual inspection. In this article, we present a deep learning framework that leverages a Siamese BLSTM network and character-based embeddings to detect source code plagiarism. The goal of this research is to determine which character-based embedding architecture produces the most accurate plagiarism detection scores. The proposed framework uses Word2Vec and fastText models to obtain various pre-trained source code embedding sequences as input to the network. Subsequently, we utilise Manhattan distance to measure the plagiarism scores between the two outputs produced by the network. To the best of our knowledge, this is the first research work to utilise various embedding models for source code plagiarism detection. Experimental results showed that the embeddings from the Word2Vec Skip-Gram and Negative Sampling (W2V-SGNS) architecture produce the most accurate detection scores.