Differences in Plagiarism Detection between Persian and English Languages
Caution! Plagiarism is now on the rise! Actually, it has never caused as many issues as it is doing now. There is so much information everywhere, and you can access almost any data online. It is natural that some writers cannot resist the temptation and lay their hands on what is the work of another person. Without any citing, using somebody’s ideas gets the status of a crime, and such authors exert efforts to suppress all evidence and not to let others catch them.
Authors take credit for the thoughts of others using direct copy-pasting, rewriting or paraphrase of the original text, as well as translational plagiarism. For sure, there are also combinations of those categories that allow unscrupulous writers hide the truth. Cross-lingual or translational plagiarism is a common phenomenon as it is really complicated to detect it. Blending their own thoughts with the borrowed ideas translated from another language, writers double back and create a challenging task for those who need to evaluate the originality of writing.
For instance, reusing passages of text across the pair of the English and the Persian languages, it is possible to reveal that lack of proper citation from the source language to the target language forms the basis for plagiarism. It is a matter of crucial importance to have a detection system that can show adequate performance in identifying text similarities and be precise in evaluations. Still, Persian is a language with specific features, and that should be taken into account.
What Makes the Persian Language So Peculiar in Terms of Plagiarism Detection?
First of all, representation of this low-profile language on the Internet is extremely poor. It is a must to develop new techniques and NLP algorithms which can make up for this insufficient amount of resources online. Unfortunately, even if plagiarism is evident, the tools of machine translation may skip it.
Secondly, Persian belongs to a group of Indo-European languages. Its basis is the Arabic script as a part of a Semitic language family. Thus, even common tasks before processing of the text become more complicated. Stemming, normalization, and word recognition get restricted.
Thirdly, the distance between Persian and English is great and a typical character n-gram common for cross-lingual studies is not possible for detection systems.
Fourthly, a combination of translation and paraphrasing can be challenging to detect. The writer may summarize the text, merge sentences, split the ideas, or do careful paraphrasing of the sentences in the target language.
Replacing the words with synonyms and using a variety of sentence structures, the authors manage to embed copy-pasted passages from Farsi and Arabic that cannot be traced with the language-independent tools.
The level of similarity in the Persian control characters and those in the Arabic language is high, but there are also serious discrepancies and differences.
- It is complicated to process through the Persian text as the Persian and Arabic codes come together. Persian has a series of designed Unicode characters, while the texts may also use ASCII Arabic characters.
- In addition, there is a peculiar internal word boundary in the Persian language, and typists may either disregard it or apply a white space to replace it. This internal word boundary is supposed to be shown with a pseudo-space, but its optional character makes the process of processing Persian sentences rather confusing.
Thus, these challenging issues require tireless work on pre-processing normalization and creating algorithms with unified letters and zero-width non-joiner space.
Asghari et al. offered a cross-language method of plagiarism detection for papers in Persian and English. They developed a unique corpus for bilingual plagiarism search with a focus on looking for plagiarism issues in the Persian document via looking for a relevant source in English. In general, it is necessary to ensure pre-processing of the Persian texts with corresponding optimizations to develop a high-quality plagiarism search tool:
- Normalization of the text. Style of writing of different Persian authors may be different and that means the encoding will differ as well. In the course of processing, the input texts have to be prepared for the conversion and standardization. It means that numerals, Arabic letters, and all other characters are changed into Persian characters.
- Removing stop words. These are the words used most frequently as common vocabulary for all the texts, in particular articles, relative pronouns, and even punctuation marks.
- Stemming. Morphological analysis of the Persian language differs for verbs and nouns and that adds to the complexity of plagiarism search. Stemming implies eliminating the word endings and affixes.
- Replacing synonyms. Borrowing someone’s ideas, a plagiarizer removes parts of the phrase or inserts some fresh ideas along with paraphrasing. This step implies having an algorithm of checking for the use of all synonyms to every separate word.
- Tokenization at the word level.
Insufficient Persian corpus still makes it challenging to ensure efficient automatic plagiarism detection. Still, currently, dramatic increase in the volume of electronic resources in Persian as well their accessibility is making the problem of plagiarism burning in the scientific and research community. Exact re-writing and plagiarizing can be traced only on the basis of both semantic and structural analysis, and few automatic plagiarism detection systems are effective in their work with the Persian language as it is not adequately supported in them, and that requires deliberate attention.
PlagiarismSearch.com is a checker that completes plagiarism detection tasks with the Persian language brilliantly. Taking the uniqueness of all papers seriously, it ensures consistent checking of the originality in every written work. This efficient software tool runs texts in the Persian language smoothly and detects all plagiarism parts at the highest level of accuracy as it is upgraded consistently.
Our experience of detecting plagiarism in Persian texts as well as texts in Arabic, Hebrew, Farsi, Kurdish, Urdu, and others, which use right-to-left scripts, enables us to claim that we do that in a manner, effective and convenient for the users. PlagiarismSearch.com takes into consideration all specific features of the language, even those with a sophisticated linguistic structure. Identification of potential similarities in the documents on the basis of a reliable algorithm for comparison at a number of levels is done to avoid all kinds of inconsistencies in checking. PlagiarismSearch.com is capable of identifying indirect and direct copying with substitution of the words with synonyms and reordering of the sentences from the original text. It has undeniable advantages over other tools of plagiarism detection, one of which is its ease-of-use and convenience.
NB: Ticking on the option: Use right direction of the text helps to switch the settings for the reports to get right-to-left scripts. The changes will apply both to the pages of the main report and PDF report (HTML + pdf).