Source code plagiarism detection using siamese BLSTM network and embedding models
Source code plagiarism is a severe ongoing problem that threatens academic integrity and intellectual rights. Students from computing disciplines commit plagiarism through diverse channels, in which direct in-class plagiarism being the most popular. Programming instructors struggle to manually inspe...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Book Chapter |
| Language: | en |
| Published: |
Springer Singapore
2022
|
| Subjects: | |
| Online Access: | http://irep.iium.edu.my/97680/7/97680_Source%20code%20plagiarism%20detection.pdf http://irep.iium.edu.my/97680/ https://doi.org/10.1007/978-981-16-8515-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Source code plagiarism is a severe ongoing problem that threatens academic integrity and intellectual rights. Students from computing disciplines commit plagiarism through diverse channels, in which direct in-class plagiarism being the most popular. Programming instructors struggle to manually inspect plagiarism activities in large volumes of submissions. Thus, many research works on detection approaches have been proposed to overcome prolonged manual inspection. In this article, we present a deep learning framework that leverages a Siamese BLSTM network and character-based embeddings to detect source code plagiarism. The goal of this research is to determine which character-based embedding architecture produces the most accurate plagiarism detection scores. The proposed framework uses Word2Vec and fastText models to obtain various pre-trained source code embedding sequences as input to the network. Subsequently, we utilise Manhattan distance to measure the plagiarism scores between the two outputs produced by the network. To the best of our knowledge, this is the first research work to utilise various embedding models for source code plagiarism detection. Experimental results showed that the embeddings from the Word2Vec Skip-Gram and Negative Sampling (W2V-SGNS) architecture produce the most accurate detection scores. |
|---|
