akuraHT: What is plagiarism detection?

Due to the ever increasing electronic content and easy access to the world-wide web, plagiarism in academic, research, journalism and literature has become a major issue. But do you know what is plagiarism and how to prevent or detect it. if you're a university student or a content writer this article will be useful to you.

What is Plagiarism?

Actually it's very hard to give a extract definition for the word plagiarism but According to Merriam-Webster dictionary, the simple meaning for the plagiarism is “To use the words or ideas of another person as if they were your own words or ideas”. Plagiarism also includes:

Turning in someone Else's work as your own.
Copying words or ideas from someone else without giving credit.
Failing to put quotation in quotation marks.
Giving incorrect information about the source of quotation.
Changing words but copying the sentence structure of a source without giving credit.
Copying so many words or ideas from a source that it makes up the majority of your work, whether you give credit or not.

There are two main classes of methods used to reduce plagiarism.

Plagiarism Prevention :
Punishment routines and plagiarism drawback explanation procedures. Require long time to implement. But have long term positive effect.
Plagiarism Detection :
Include manual methods and software tool.Easy to implement, but have a momentary positive effect.

Plagiarism Detection

Plagiarism detection can be done in manually or using automated process. The automated process is very similar to natural language processing, visual identification and bio-metric process. All of these have a foundation of pattern recognition. Automated process doesn't give 100% accuracy. so the manual checking is still needs.

Internal Plagiarism Detection

Finding plagiarized passages within a document without access to potential original text. Also called Intrinsic plagiarism detection.

External Plagiarism Detection

External plagiarism detection consists in comparing suspicious plagiarized document against potential original documents.

Plagiarism Detection in source code

Detecting Plagiarism in source code is relatively easy than natural language plagiarism detection. Because there is neither ambiguity nor interference between words in programming languages. But in natural language every word may have many synonyms, and different meanings. Some plagiarism detection methods are language independent and some are language dependent.

Plagiarism Detection in natural language

Detecting plagiarism in a written documents. this methods can divide into two categories which is called language independent plagiarism detection and language dependent plagiarism detection.

Language Independent Plagiarism Detection

Language independent methods are based on evaluating text characteristics which are common to all language. Such as number of special characters and average length of a sentence. Paraphrasing techniques can be used to mislead the language independent systems.

Language Dependent Plagiarism Detection

These methods are based on evaluating text characteristics that are specific to one language. Such as counting the frequency of a special word in a particular language. Language dependent plagiarism detection is more effective than the language independent plagiarism detection.

Stylometry - based methods

Stylometry is a statistical approach used for authorship attribution.These are inspired by authorship attribution methods and consist basically in classifying writing styles of authors to identify similarity. It is based on the assumption that every author has a unique style. The writing style can be analyzed by using factors within the same document, or by comparing two documents of the same author. This is performed by dividing the documents into parts like paragraphs and sentences. The style features are then extracted and analyzed. The main linguistic stylometric features are Text statistics which operate at the character level (number of commas, question marks, word lengths, etc).

Syntactic features to measure writing style at the sentence level (sentence lengths, use of function words, etc.).
Syntactic features to measure writing style at the sentence level (sentence lengths, use of function words, etc.).
Closed-class word sets to count special words (number of stop words, foreign words, "difficult" words, etc.).
Structural features which reflect text organization (paragraph lengths, chapter lengths, etc.).
Using these features formulas can be derived to identify the writing style of an author. Stylometry-based methods can be used in internal and external plagiarism detection.

Content-Based methods

Analyzing specifications of texts in terms of logical structure and discover similarity. Content based methods can be used only in external plagiarism detection.

Fingerprinting technique

Fingerprint is a set of integers created by hashing subsets of a document represent its key content. The method consists to measure the similarity of two documents by comparing their fingerprints. Techniques to generate fingerprints are mainly based on k-grams (a k-gram is a contiguous sub string of length k) which serve as a basis for most fingerprint methods.

Latent Semantic Analysis (LSA)

In this technique, words that are close in meaning are assumed to occur close together. A matrix is constructed in which rows represent words, and columns represent documents. Every document contains only subset of all words. Singular Value Decomposition (SVD), a factorization method of real or complex matrix, is used to reduce the number of columns while preserving the similarity structure among rows. This decomposition is time consuming because of the sparseness of the matrix. Words are compared by taking the cosine of the angle between the two vectors formed by any two rows. Values close to 1 represent very similar words, while values close to 0 represent very dissimilar words.

Stanford Copy Analysis Mechanism (SCAM)

This is based on a registration copy detection scheme. Documents are registered in a repository and then compared with the pre-registered documents. The architecture of the copy detection server consists of a repository and a chunker. The chunking of a document breaks up a document into sentences, words or overlapping sentences. Documents are chunked before being registered. A new document must be chunked to the same unit before comparing it with pre-registered documents. Inverted index storage is used for sorting chunks of registered documents. Each entry of the chunk is a pointer to the documents in which that chunk occurs (posting). Each posting has two parts: document name and its related chunk occurrence number. A small unit of chunk increases the probability of finding similarity between documents. The chunk unit in SCAM is a word. Documents are compared using the Relative Frequency Model (RFM) which consists mainly in computing a set of words that occur with the same frequency in two documents.

Natural Language Processing and Machine Learning for PL detection

NLP is using in pre-processing stages such as Sentence segmentation, Tokenization, Stop-word removal,Synonym replacement, Stemming, Number Replacement, Punctuation removal etc. for identifying plagiarized texts. These pre-processing techniques improves the accuracy and efficiency of plagiarism detection algorithm. And also can address plagiarism detection through machine learning approach in a effective way. their are some undergoing researches to do this task using ML and neural networks and deep learning.

Popular software tools for plagiarism detection.

The detection of plagiarism is not a new research area. Various approaches have been developed to deal with source code and natural language plagiarism detection.

plagiarism.org and turnitin.com are popular tools to address web based plagiarism. Glatt Plagiarism Services, Inc. offers a user-end software-based approach to preventing and detecting plagiarism. The more details about these technologies you can find here.

Their are number of software tools available for the plagiarism detection but most of them are not popular because of the less accuracy of them. the methods used for plagiarism detection so far limited to very superficial level. So the plagiarism detection technologies still needs to grow.

2 comments:

Surin AthukoralaOctober 4, 2018 at 10:39 AM
Nicely written article. Very informative. :)
UnknownFebruary 20, 2022 at 3:50 PM
Halo madam, I am final year undergraduate of university of moratuwa. I need your help for how to measure word 2 vec method accuracy for trained data set in sinhala. Can you make me a favor?
Thankyou.

Thursday, October 4, 2018

What is plagiarism detection?