2
department of Computer Science and Engineering ,
Iran University of Science and Technology, Tehran, Iran
analoui@iust.ac.ir
Abstract: - a large number of articles and papers in various compilation. The plagiarism may be done through
languages are available due to the expanding of information manipulation of resource, such as text, film, speech and
technology and information retrieval. Some authors translate other literary compilation or artifacts, and through manner,
these papers from original languages to other languages so that which Maurer[1] listed four broader categories of plagiarism
they claim as their own papers. In this paper a method for as follows:
detection of exactly translated papers from English to Persian i. Accidental: due to lack of plagiarism knowledge.
language is proposed. Document classification is performed by ii. Unintentional: probably the initiation of the same
corner classification neural network (CC4) method to limit the
idea at same time.
investigated documents, before the main process is commenced.
For bi-lingual text processing, language unification is necessary
iii. Intentional: a deliberate act of copying someone
that is performed by a bi-lingual dictionary. After language work without any credit or reference.
unification, the suspicious text fragments are rephrased iv. Self-plagiarism: republished self published work.
according to the original language. Rephrasing is performed jalali et al.[2] believe that the first two cases occur in non-
like crystallization process. The similarity is computed based on English countries since authors are not competent to use
equivalent word place distance in English and Persian; in fact, English and to avoid plagiarism. They have suggested that
the paragraphs in a suspicious text are probed in the original plagiarism should be avoided by employing a combination
text. Proposed system is tested in two conditions, first with of measures such as explicit warning, using plagiarism
rephrasing and adopted distance based similarity and other with detection software, disseminating knowledge and improving
tradition method. By comparing the outcomes in two the academic and writing skills.
experiments distinguished that proposed method’s
In this paper we will discuss about text plagiarism that can
discriminability is good.
be occurred in same language (monolingual) or other
Keywords: plagiarism detection, document classification, bi- ones(cross-lingual). In first one, a person copies, rewrites or
lingual text processing rephrases another's text in a similar language and
plagiarism detection in same cases can be performed bye
1. Introduction and motivation tradition document similarity algorithms. But in second, a
thief author translates from source language to target
At first, there are several questions that will be discussed.
language. For detection; we need extra knowledge bases
They are
such as a dictionary, wordnet and word correlation table.
i. What is the research?
Actually; plagiarism detection is time consuming because
ii. What is the plagiarism?
there are many large text corpora that must be compared
iii. How an essay is criticized?; and
with suspected document, though most of them are
iv. What is the scientific misconduct?
irrelevant. For this reason if we eliminate irrelevant
These questions and instances of like are ambiguous
documents from comparison corpora, we’ll decrease an
questions that don’t have a clear answer or they share
indispensable time spontaneously. This is realized by
overlapped scopes. These terms are defined in table 1.
classification methods that will be explained later.
Some of the paradigms mentioned, are in the academic
The paper is organized as follow: in next section the related
dishonesty domain and some of them are not. As mentioned
works is reviewed and document classification is discussed
above, these fields have shared overlapped scopes and the
in section 3. Thereafter in section 4 the proposed approach
border line between correct works and incorrect works must
will be explained then datasets and experimental results are
be cleared. Plagiarism is prevalent due to spread
analyzed.
enlargement digital information especially text information
on the internet and other devices. The simple definition of
plagiarism is appropriating another author’s writing or
126 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 6, June 2010
method. There are three main tasks in proposed system. 4.2. Similarity evaluating
They are As mentioned before, each vector of English words is
i. Document classification: just as explained in changed into jagged matrix of the Persian words, the
previous section, the classification determines tradition vector space methods such as euclidean, cosine,
what documents must be processed. dice, jaccard similarity measures[15] that use the inner
ii. Document representation (translation): documents product of vectors can’t be used for similarity calculation
in different language must be uniformed that will because one side of problem is vector and other side is
be illustrated in next section. matrix. Let t1, t2, · · ·, tn be fragments conforming a
iii. Similarity calculation: this part of system uses a suspicious text T in Persian, and let w1, w2, · · · ,wm be a
word distance base method. collection of W original fragments in English. According to
the structural differences between investigated languages,
the smallest piece of text which must be processed is a
Persian
document
English Paragraph.
documents
The sentence length in English and Persian is different [16],
and one sentence in English during its translation may be
rendered in two or more sentences. Furthermore, the
structural differences exist. For example, the verb placement
Classifier Classifier
in the English sentences is in the middle of sentence but in
the Persian sentences is in the end of sentence. Moreover,
Sorted the sentence boundary detection is an extra task that can be
documents ignored.
At first, the jagged matrix of TW is created according to
Similarity meaning of including words of W like figure 3. The main
evaluating Translate
d aim is computing similarity between T and TW. For this
purpose, each word in T is probed in the TW matrix and a
Dictionary
vector (indi) based on indices of matching words such as (1)
is created.
Rate of similarity
∑
W1 p11 p12 p13
L( D1 , D2 ) = d1i - d 2i (3)
W2 p12 p22 i
W3 If most of a fragment's words of text in T are neighbors in
p31 p32 p33 p34 p35
CS, there are most probable that these two fragments are
... … … … same. For computing amount of similarity we use the
Wj p1
p 2
p 3j p 4j p 5j p 6j p 7j
j j relation (4).
… … … …
Wm p1
p 2
pm3 pm4
m m
0.6 and all English documents from same class for using in
similarity computation are extracted. The most similar
0.4
document along similarity rate of it for each Persian
0.2 document is determined. This is rational that we expect the
0 similarity coefficient in unparallel document would be
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 closed to 0 and in 200 parallel documents near to 1. For
distance final result the mean of all similarities in parallel and
unparallel groups are calculated as the table (2) shows.
Figure 4. effect of distance on similarity with sl= 15 and
p=8 Table 2: mean of similarities in parallel and unparallel
documents with sl= 15 and p=8
This fact is important that how far the neighbor words in Tf based method Proposed method
original paragraph are from each other in the suspicious Parallel 0.76 0.81
paragraph. Suppose tk and tk+1 are adjacent in T and wi and unparallel 0.35 0.09
wi+d are their equivalents in W. if d or in other words the
distance of two words were closed to each other, we can First column in table (2) represents tradition method results
consider wi and wi+d in an identical sentence. By in which similarity of two documents is computed based on
generalizing this fact, the W and T fragments are equal if words occurrences of a document in other one. In other
most of W words equivalences are neighbor in T. in other words this is a term frequency based method, also in this
side of coin; we restate T words by illustrated reconstruction case rephrasing is not performed. Considering, tf based
method in CS. If CS contents come from different part of method has high precision in parallel documents matching,
text or various W, we can not find a unique same for T. but its error rate in unparallel documents are excessive.
Total similarity between two documents with T and W Second column shows proposed method results that in this
fragments is computed by relation (5) that there Pdoc and method, term frequency doesn't have any influence on
Edoc are respectively Persian and corresponding English similarity, and similarity is computed based on distance. As
document. results discloses, similarity rate for each group are
meaningful.
sim(Ti , Wj )
similarity( Pdoc, Edoc) = ∑∑ (5)
min( Pdoc , Edoc )
i j
6. Conclusion
5. Experimental result Plagiarism is a challenging problem in research domain and
academic comportment. For distinguishing plagiarist from
Dataset gathering in bi-lingual research is the main and the researcher, automatic detection of plagiarism is
time consuming challenge especially if you are among the necessary. In this paper we developed a system for detection
beginners in that domain. In order to evaluate the proposed of exact translation from English to Persian. In bi-lingual
approach we used a parallel corpus containing 200 plagiarism, in which system dealing with two different
documents in Persian and English that were collected and languages that each word in one language has several
translated from internet then converted into .txt format. This equivalences in the other. Thus selecting a suitable meaning
corpus contains documents in English and exact translate of is essential. In proposed method, this is realized by
them in Persian from various genre such as social short rephrasing task that inspired from crystallization. Structural
(IJCNS) International Journal of Computer and Network Security, 129
Vol. 2, No. 6, June 2010
differences between investigated languages are another Semantic Web Research Laboratory, Sharif University
challenge in bi-lingual tasks. That this problem is solved by of Technology, tehran - iran 2006.
using a distance based similarity measurement. Obtained [14] G. Salton, A. Wong, C.S. Yang, ‘A vector space model
outcomes are very promising to detect exact translated texts. for information retrieval’, Journal of the American
Using all validate words in similarity computation is One of Society for information Science, 1975, 18, (11), pp.
the best properties of proposed method 613-620
[15] H.F. Ma, Q. He, Z.Z. Shi, ‘Geodesic distance based
approach for sentence similarity computation’. Proc.
7. References Machine Learning and Cybernetics, 2008 International
Conference on2008 pp. 2552-2557
[1] H. Maurer, F. Kappe, B. Zaka, ‘Plagiarism-a survey’, [16] C.D. Manning, H. Schütze, ‘Foundations of statistical
Journal of Universal Computer Science, 2006, 12, (8), natural language processing’ , MIT Press, 2002.
pp. 1050-1084 [17] E. Greengrass, ‘Information retrieval: A survey’,
[2] M. Jalalian, L. Latiff, ‘Medical researchers in non- University of Maryland, Baltimore County, 2000.
English countries and concerns about unintentional
plagiarism’, Journal of Medical Ethics and History of
Medicine, 2009, 2, (2), pp. 1-2
[3] H.H. Huang, H.C. Yang, Y.H. Kuo, ‘A Sense Based
Similarity Measure for Cross-Lingual Documents’.
Proc. Intelligent Systems Design and Applications,
2008. ISDA'08. Eighth International Conference
on2008 .
[4] O. Uzuner, B. Katz, T. Nahnsen, ‘Using syntactic
information to identify plagiarism’. Proc. Proceedings
of the 2nd Workshop on Building Educational
Applications Using NLP2005 pp.37-44
[5] A. Barron-Cedeno, P. Rosso, D. Pinto, A. Juan, ‘On
cross-lingual plagiarism analysis using a statistical
model’, Proc. of PAN-08, 2008
[6] C.H. Lee, C.H. Wu, H.C. Yang, ‘A Platform
Framework for Cross-Lingual Text Relatedness
Evaluation and Plagiarism Detection’. Proc. Innovative
Computing Information and Control, 2008. ICICIC'08.
3rd International Conference on 2008 pp. 303-307.
[7] A. Selamat, H.H. Ismail, ‘Finding English and
translated Arabic documents similarities using
GHSOM’2008 pp. 460-465
[8] N. Gustafson, M.S. Pera, Y.K. Ng, ‘Nowhere to hide:
Finding plagiarized documents based on sentence
similarity’. Proc. IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent
Technology, 2008. WI-IAT'082008 pp.690-696
[9] D. Zhang, W.S. Lee, ‘Question classification using
support vector machines’. Proc. Proceedings of the
26th annual international ACM SIGIR conference on
Research and development in informaion retrieval2003
[10] C. Enhong, Z. Zhenya, A. Kazuyuki, W. Xu-fa, ‘An
extended corner classification neural network based
document classification approach’, Journal of
Software, 2002, 13, (5), pp. 871-878
[11] M. Mohammadi, H. Alizadeh, B. Minaei-Bidgoli,
‘Neural Network Ensembles Using Clustering
Ensemble and Genetic Algorithm’. Proc. Convergence
and Hybrid Information Technology, 2008. ICCIT'08.
Third International Conference on2008 pp. 761-766
[12] M. Mohammadi, B. Minaei-Bidgoli, ‘Using CC4
neural networks for Persian document classification’.
Proc. 2nd Iran Data Mining Conference(IDMC008)
Tehran 2008
[13] K. Sheykh Esmaili, A. Rostami,: ‘A list of persian
stopwords’. Proc. Technical Report No. 2006-03,