Thursday, October 4, 2018

What is plagiarism detection?

Due to the ever increasing electronic content and easy access to the world-wide web, plagiarism in academic, research, journalism and literature has become a major issue. But do you know what is plagiarism and how to prevent or detect it. if you're a university student or a content writer this article will be useful to you.

What is Plagiarism? 

Actually it's very hard to give a extract definition for the word plagiarism but According to Merriam-Webster dictionary, the simple meaning for the plagiarism is “To use the words or ideas of another person as if they were your own words or ideas”. Plagiarism also includes:
  1. Turning in someone Else's work as your own.
  2. Copying words or ideas from someone else without giving credit.
  3. Failing to put quotation in quotation marks.
  4. Giving incorrect information about the source of quotation.
  5. Changing words but copying the sentence structure of a source without giving credit.
  6. Copying so many words or ideas from a source that it makes up the majority of your work, whether you give credit or not.

There are two main classes of methods used to reduce plagiarism.

  1. Plagiarism Prevention :
    Punishment routines and plagiarism drawback explanation procedures. Require long time to implement. But have long term positive effect.

  2. Plagiarism Detection :
    Include manual methods and software tool.Easy to implement, but have a momentary positive effect.

Plagiarism Detection


Plagiarism detection can be done in manually or using automated process. The automated process is very similar to natural language processing, visual identification and bio-metric process. All of these have a foundation of pattern recognition. Automated process doesn't give 100% accuracy. so the manual checking is still needs.


Internal Plagiarism Detection


Finding plagiarized passages within a document without access to potential original text. Also called Intrinsic plagiarism detection.


External Plagiarism Detection


External plagiarism detection consists in comparing suspicious plagiarized document against potential original documents.

Plagiarism Detection in source code


Detecting Plagiarism in source code is relatively easy than natural language plagiarism detection. Because there is neither ambiguity nor interference between words in programming languages. But in natural language every word may have many synonyms, and different meanings. Some plagiarism detection methods are language independent and some are language dependent.

Plagiarism Detection in natural language


Detecting plagiarism in a written documents. this methods can divide into two categories which is called language independent plagiarism detection and language dependent plagiarism detection.

Language Independent Plagiarism Detection


Language independent methods are based on evaluating text characteristics which are common to all language. Such as number of special characters and average length of a sentence. Paraphrasing techniques can be used to mislead the language independent systems.

Language Dependent Plagiarism Detection


These methods are based on evaluating text characteristics that are specific to one language. Such as counting the frequency of a special word in a particular language. Language dependent plagiarism detection is more effective than the language independent plagiarism detection.

Stylometry - based methods 

 

Stylometry is a statistical approach used for authorship attribution.These are inspired by authorship attribution methods and consist basically in classifying writing styles of authors to identify similarity. It is based on the assumption that every author has a unique style. The writing style can be analyzed by using factors within the same document, or by comparing two documents of the same author. This is performed by dividing the documents into parts like paragraphs and sentences. The style features are then extracted and analyzed. The main linguistic stylometric features are Text statistics which operate at the character level (number of commas, question marks, word lengths, etc).
  • Syntactic features to measure writing style at the sentence level (sentence lengths, use of function words, etc.).
  • Syntactic features to measure writing style at the sentence level (sentence lengths, use of function words, etc.).
  • Closed-class word sets to count special words (number of stop words, foreign words, "difficult" words, etc.).
  • Structural features which reflect text organization (paragraph lengths, chapter lengths, etc.).
  • Using these features formulas can be derived to identify the writing style of an author. Stylometry-based methods can be used in internal and external plagiarism detection.

Content-Based methods


Analyzing specifications of texts in terms of logical structure and discover similarity. Content based methods can be used only in external plagiarism detection.

Fingerprinting technique


Fingerprint is a set of integers created by hashing subsets of a document represent its key content. The method consists to measure the similarity of two documents by comparing their fingerprints. Techniques to generate fingerprints are mainly based on k-grams (a k-gram is a contiguous sub string of length k) which serve as a basis for most fingerprint methods.

Latent Semantic Analysis (LSA)

In this technique, words that are close in meaning are assumed to occur close together. A matrix is constructed in which rows represent words, and columns represent documents. Every document contains only subset of all words. Singular Value Decomposition (SVD), a factorization method of real or complex matrix, is used to reduce the number of columns while preserving the similarity structure among rows. This decomposition is time consuming because of the sparseness of the matrix. Words are compared by taking the cosine of the angle between the two vectors formed by any two rows. Values close to 1 represent very similar words, while values close to 0 represent very dissimilar words.

Stanford Copy Analysis Mechanism (SCAM)

This is based on a registration copy detection scheme. Documents are registered in a repository and then compared with the pre-registered documents. The architecture of the copy detection server consists of a repository and a chunker. The chunking of a document breaks up a document into sentences, words or overlapping sentences. Documents are chunked before being registered. A new document must be chunked to the same unit before comparing it with pre-registered documents. Inverted index storage is used for sorting chunks of registered documents. Each entry of the chunk is a pointer to the documents in which that chunk occurs (posting). Each posting has two parts: document name and its related chunk occurrence number. A small unit of chunk increases the probability of finding similarity between documents. The chunk unit in SCAM is a word. Documents are compared using the Relative Frequency Model (RFM) which consists mainly in computing a set of words that occur with the same frequency in two documents.

Natural Language Processing and Machine Learning for PL detection

NLP is using in pre-processing stages such as Sentence segmentation, Tokenization, Stop-word removal,Synonym replacement, Stemming, Number Replacement, Punctuation removal etc. for identifying plagiarized texts. These pre-processing techniques improves the accuracy and efficiency of plagiarism detection algorithm. And also can address plagiarism detection through machine learning approach in a effective way. their are some undergoing researches to do this task using ML and neural networks and deep learning.

Popular software tools for plagiarism detection.


The detection of plagiarism is not a new research area. Various approaches have been developed to deal with source code and natural language plagiarism detection.

plagiarism.org and turnitin.com are popular tools to address web based plagiarism. Glatt Plagiarism Services, Inc. offers a user-end software-based approach to preventing and detecting plagiarism. The more details about these technologies you can find here.

Their are number of software tools available for the plagiarism detection but most of them are not popular because of the less accuracy of them. the methods used for plagiarism detection so far limited to very superficial level. So the plagiarism detection technologies still needs to grow.







Thursday, January 18, 2018

RESTful Web Services

What is REST?

REST stands for REpresentational State Transfer. REST was first introduced by Roy Fielding in year 2000.

REST is a web standards based architectural design for networked hypermedia applications, it is primarily used to build web services that are lightweight, maintainable, and scalable. REST uses HTTP protocol for data communication (but REST is not depend on any protocol, but almost every RESTful service uses HTTP as its Underlying protocol).

In REST architectural style, data and functionality are considered resources and are accessed using Uniform Resource Identifiers (URIs)

What is RESTful Web Service?

A web service is a collection of open protocols and standards used for exchanging data between applications or systems. Web services based on REST Architecture are known as RESTful Web Services. These web services use HTTP methods to implement the concept of REST architecture. A RESTful web service usually defines a URI (Uniform Resource Identifier), which is a service that provides resource representation such as JSON and a set of HTTP Methods.


Features of a RESTful Services

In general, RESTful services should have following properties and features.

1. Representations
2. Messages
3. URIs
4. Uniform interface
5. Stateless
6. Links between resources
7. Caching

Representations


The focus of a RESTful service is on resources and how to provide access to these resources. A resource can easily be thought of as an object as in OOP.
1. Identify the resources and determine how they are related to each other. (Familiar to first step of designing a database, Identifying entities and relations).
2. Find a way to represent these resources in the system. We can use any format such as JSON, XML. Also we can use more than one format and decide which one to use for a response depending on the type of client or some request parameters.

A good representation should have following qualities.

  1. Both Client and Server should be able to comprehend this format of representation.
  2. A representation should be able to completely represent a resource. If there is a need to partially represent a resource, then you should think about breaking this resource into child resources. (Smaller representations means less time required to create and transfer them).
  3. The representation should be capable of linking resources to each other. This can be done by placing the URI or unique ID of the related resource in a representation (more on this in the coming sections).

Messages

The client and service talk to each other via messages. Clients send a request to the server, and the server replies with a response.

HTTP request


Method : is the one of HTTP Methods , GET,PUT,DELETE,POST,OPTIONS
URL : is the URI of the resource on which the operation is going to performed.
Version : is the version of HTTP.
Request Header: contains the metadata as a collection of key-value pairs of headers and their values.
Entity Body/Request Body: is the actual message content. In a RESTful service, that's where the representations of resources sit in a message.

HTTP response


Version : is the version of HTTP.
Status Code :  This response code is generally the 3-digit HTTP status code.
phrase: which contains the status of the request. 
Response Header:  contains the metadata and settings about the response message.
Entity Body/Response Body: contains the representation if the request was successful.

URI


REST requires each resource to have at least one URI. A RESTful service uses a directory hierarchy like human readable URIs to address its resources. The job of a URI is to identify a resource or a collection of resources. The actual operation is determined by an HTTP verb. The URI should not say anything about the operation or action. This enables us to call the same URI with different HTTP verbs to perform different operations.

Important Recommendations for well-structured URI

  • Use plural nouns for naming your resources.
  • Avoid using spaces as they create confusion. Use an _ (underscore) or – (hyphen) instead.
  • A URI is case insensitive.
  • can have our own conventions, but should stay consistent throughout the service. Make sure your clients are aware of this convention. It becomes easier for your clients to construct the URIs programmatically if they are aware of the resource hierarchy and the URI convention you follow.
  • A cool URI never changes.
  • Avoid verbs for your resource names until your resource is actually an operation or a process.

Uniform Interfaces

RESTful systems should have a uniform interface. HTTP 1.1 provides a set of methods, called verbs, for this purpose. Among these the more important verbs are:

  • GET – Provide a Read Only access to a resource.
  • PUT – Used to create a new resource.
  • DELETE – Used to remote a resource.
  • POST – Used to update an existing resource or create a new resource.
  • OPTIONS – Used to get the supported operations on resource.
  • HEAD - Return only the response headers and no response body.

Statelessness

A RESTful service is stateless and does not maintain the application state for any client. A request cannot be dependent on a past request and a service treats each request independently. HTTP is a stateless protocol by design and you need to do something extra to implement a stateful service using HTTP. 

Links Between Resources

A resource representation can contain links to other resources like an HTML page contains links to other pages.

Caching

Caching is the Concept of storing the generated result and using stored results instead of generating them repeadly if the same request arrives in the near future.

Documenting a RESTful Service

RESTful services do not necessarily require a document to help clients discover them. Due to URIs, links, and a uniform interface, it is extremely simple to discover RESTful services at runtime. A client can simply know the base address of the service and from there it can discover the service on its own by traversing through the resources using links. The method OPTION can be used effectively in the process of discovering a service.

This does not mean that RESTful services require no documentation at all. There is no excuse for not documenting your service. You should document every resource and URI for client developers. 

I hope you understood the basic concept of the RESTful web services. in the next article I will Explain you how to Build a RESTful web Service with Spring boot.





What is plagiarism detection?

Due to the ever increasing electronic content and easy access to the world-wide web, plagiarism in academic, research, journalism and lit...