Anda di halaman 1dari 5

(IJCNS) International Journal of Computer and Network Security, 125

Vol. 2, No. 6, June 2010

Using Word Distance Based Measurement for


Cross-lingual Plagiarism Detection
Moslem mohammadi1, Morteza analouei2
1
Payam Nour University of Miandoab, Miandoab, Iran
Mo_mohammadi@comp.iust.ac.ir

2
department of Computer Science and Engineering ,
Iran University of Science and Technology, Tehran, Iran
analoui@iust.ac.ir

Abstract: - a large number of articles and papers in various compilation. The plagiarism may be done through
languages are available due to the expanding of information manipulation of resource, such as text, film, speech and
technology and information retrieval. Some authors translate other literary compilation or artifacts, and through manner,
these papers from original languages to other languages so that which Maurer[1] listed four broader categories of plagiarism
they claim as their own papers. In this paper a method for as follows:
detection of exactly translated papers from English to Persian i. Accidental: due to lack of plagiarism knowledge.
language is proposed. Document classification is performed by ii. Unintentional: probably the initiation of the same
corner classification neural network (CC4) method to limit the
idea at same time.
investigated documents, before the main process is commenced.
For bi-lingual text processing, language unification is necessary
iii. Intentional: a deliberate act of copying someone
that is performed by a bi-lingual dictionary. After language work without any credit or reference.
unification, the suspicious text fragments are rephrased iv. Self-plagiarism: republished self published work.
according to the original language. Rephrasing is performed jalali et al.[2] believe that the first two cases occur in non-
like crystallization process. The similarity is computed based on English countries since authors are not competent to use
equivalent word place distance in English and Persian; in fact, English and to avoid plagiarism. They have suggested that
the paragraphs in a suspicious text are probed in the original plagiarism should be avoided by employing a combination
text. Proposed system is tested in two conditions, first with of measures such as explicit warning, using plagiarism
rephrasing and adopted distance based similarity and other with detection software, disseminating knowledge and improving
tradition method. By comparing the outcomes in two the academic and writing skills.
experiments distinguished that proposed method’s
In this paper we will discuss about text plagiarism that can
discriminability is good.
be occurred in same language (monolingual) or other
Keywords: plagiarism detection, document classification, bi- ones(cross-lingual). In first one, a person copies, rewrites or
lingual text processing rephrases another's text in a similar language and
plagiarism detection in same cases can be performed bye
1. Introduction and motivation tradition document similarity algorithms. But in second, a
thief author translates from source language to target
At first, there are several questions that will be discussed.
language. For detection; we need extra knowledge bases
They are
such as a dictionary, wordnet and word correlation table.
i. What is the research?
Actually; plagiarism detection is time consuming because
ii. What is the plagiarism?
there are many large text corpora that must be compared
iii. How an essay is criticized?; and
with suspected document, though most of them are
iv. What is the scientific misconduct?
irrelevant. For this reason if we eliminate irrelevant
These questions and instances of like are ambiguous
documents from comparison corpora, we’ll decrease an
questions that don’t have a clear answer or they share
indispensable time spontaneously. This is realized by
overlapped scopes. These terms are defined in table 1.
classification methods that will be explained later.
Some of the paradigms mentioned, are in the academic
The paper is organized as follow: in next section the related
dishonesty domain and some of them are not. As mentioned
works is reviewed and document classification is discussed
above, these fields have shared overlapped scopes and the
in section 3. Thereafter in section 4 the proposed approach
border line between correct works and incorrect works must
will be explained then datasets and experimental results are
be cleared. Plagiarism is prevalent due to spread
analyzed.
enlargement digital information especially text information
on the internet and other devices. The simple definition of
plagiarism is appropriating another author’s writing or
126 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 6, June 2010

Table 1: plagiarism related concepts


Research can be defined as the search for knowledge or any systematic
1 Research
investigation to establish facts.
Use or close imitation of the language and thoughts of another author and the
2 Plagiarism
representation of them as one's own original work.
Is the judgment (using analysis and evaluation) of the merits and faults of
3 Criticism
the actions or work of another individual.
built primarily and explicitly from existing texts in order to solve a writing
4 Assemblage
or communication problem in a new context
An essay mill (or paper mill) is a ghostwriting service that sells essays and
5 Essay mill
other homework writing to university and college students
A ghostwriter is a professional writer who is paid to write books, articles,
6 Ghostwriter
stories, reports, or other texts that are officially credited to another person.
Scientific misconduct is the violation of the standard codes of scholarly
7 Scientific misconduct conduct and ethical behavior in professional scientific research.( fabrication,
plagiarism, ghostwriting)

1. The X vector is a dictionary includes all words of


2. Related works documents; in other words, it is the feature set of documents
that must be classified. To save time, after stopwords are
The root of plagiarism detection can be found in the eliminated, we only use 20 frequently appeared words in
document similarity. Huang et al [3] proposed a sense based document for classification. Each entry of X vector is 1 if its
similarity measure for cross-lingual documents that uses corresponding word is in 20 frequently appeared words of
senses for document representation and adopted fuzzy set document, otherwise it is 0. The Y vector represents the
functions for resemblance calculating. Uzuner et al. [4] class tags and here the k value is the number of classes that
identify plagiarism when works are paraphrased using by in our system it is 3. There are distinct classifiers for each
syntactic information related to creative aspect of writing language.
such as ‘sentence-initial and -final phrase structure’, The non iterative training phase is one of important benefit
‘semantic class of verb’ and ‘syntactic class of verb’. Barron of these networks. Incremental training is the other benefit
et al.[5] proposed the cross-lingual plagiarism analysis with of these networks that appropriate these networks for
a probabilistic method which calculates a bilingual classifying enormous documents. In fact, adding the new
statistical dictionary on the basis of the IBM-1 model in training data to extant CC4 network is very easy and simple.
English and Spanish plagiarized examples. Their proposal Furthermore, when the documents are almost of the same
calculates the probabilistic association between two terms in size, CC4 neural network is an effective document
two different languages. classification algorithm.
In [6] an approach based on support vector machine(SVM)
classifier has been proposed to determine similarity between
English and Chinese text. Subsequently, evaluating
semantic similarity among texts measured by means of a
language-neutral clustering technique based on Self-
Organizing Maps (SOM). Selamat et al [7] have used a
Growing Hierarchical Self-Organized Map (GHSOM) to
detect translated English documents to Arabic texts.
Gustafan et al.[8] have proposed a method, relies on pre-
computed word-correlation factors for determining the
sentence-to-sentence similarity which handle various
plagiarism techniques based on substitution, addition, and
deletion of words in sentences. Consequently, the degree of
resemblance of any two documents to detect the plagiarized
copy is computed based on sentence similarity and the visual
Figure 1. CC4 structure
representation of same sentences is generated.
4. Proposed method
3. Document classification
Bi-lingual Plagiarism can be supposed as a translation
Document classification has an important place on text
problem. Clumsy translators or plagiarists are using a literal
mining tasks because that it limits the amount of texts must
translation. Thus in this case, sentence length in both texts
be processed. There are several methods for information
very probability are equal but innate features and structures
classification such as K- Nearest Neighbors (KNN), Naïve
of either one are different. Automatic detection of awkward
Bayes, Support Vector Machine (SVM)[9], Corner
translation seems easy. The professional plagiarist can
Classification Neural Network (CC4)[10], ensemble
change sentences sequence or split longer sentences in two
classifiers[11] and etc. In this system completion of the
or more sentences or do vice versa. For these reasons and
document classification are performed by CC4 that is a kind
other motivations that will be discussed, we used a distance
of three-layered feed forward neural network [12] like figure
based method. Figure 2 is the schematic view of proposed
(IJCNS) International Journal of Computer and Network Security, 127
Vol. 2, No. 6, June 2010

method. There are three main tasks in proposed system. 4.2. Similarity evaluating
They are As mentioned before, each vector of English words is
i. Document classification: just as explained in changed into jagged matrix of the Persian words, the
previous section, the classification determines tradition vector space methods such as euclidean, cosine,
what documents must be processed. dice, jaccard similarity measures[15] that use the inner
ii. Document representation (translation): documents product of vectors can’t be used for similarity calculation
in different language must be uniformed that will because one side of problem is vector and other side is
be illustrated in next section. matrix. Let t1, t2, · · ·, tn be fragments conforming a
iii. Similarity calculation: this part of system uses a suspicious text T in Persian, and let w1, w2, · · · ,wm be a
word distance base method. collection of W original fragments in English. According to
the structural differences between investigated languages,
the smallest piece of text which must be processed is a
Persian
document
English Paragraph.
documents
The sentence length in English and Persian is different [16],
and one sentence in English during its translation may be
rendered in two or more sentences. Furthermore, the
structural differences exist. For example, the verb placement
Classifier Classifier
in the English sentences is in the middle of sentence but in
the Persian sentences is in the end of sentence. Moreover,
Sorted the sentence boundary detection is an extra task that can be
documents ignored.
At first, the jagged matrix of TW is created according to
Similarity meaning of including words of W like figure 3. The main
evaluating Translate
d aim is computing similarity between T and TW. For this
purpose, each word in T is probed in the TW matrix and a
Dictionary
vector (indi) based on indices of matching words such as (1)
is created.
Rate of similarity

ind i = {j | ti ∈ TWj } (1)


Figure 2. Architect of proposed system
The similarity measurement is inspired from crystallization.
4.1. Document representation The crystallization process consists of two major events,
In this research we are dealing with two kinds of documents nucleation and crystal growth1. The main idea is creation of
in the Persian and English languages which the processing the fragments of text like crystal. indi vectors are equivalent
phase for similarity detection is accomplished in Persian. the solvent and words on them are solute. For crystal
The Persian documents after eliminating the non-verbal growth we need a lot of nucleus that here we use the words
stopwords[13], are represented in vector space model [14]. in T which their indi vectors contain only one word. After
In this adopted model, each paragraph is placed in a vector. define nucleuses, we should describe crystal growth process
For plagiarism detection from English documents to that perform based on T vector. Relation (2) describes it.
Persian, the unification of their language must be
performed. Whereas each word in the English text can have  ind i if t i is nucleus (2)

several meanings and equivalences in Persian, each vector CS i =  0 if t i don' t have equal in W
nearest index to a nucleus
of English words is changed into jagged matrix of the  otherwise
Persian words according to the bilingual dictionary. The CS is the resembled vector to the T vector that contains
schematic description of details comes in figure 3. nearest equvalence to the T words. The similarity amount of
Persian paragraph: this Resembled vector must be measured. For this, we use an
Tà t1 t2 t3 … ti … t adoptive euclidean distance[17] . Euclidean distance is
n computed for two vectors like relation (3), but in adoptive
English paragraph: form we want to influence similarity with word situation
Englis Persian equivalence distance in CS vector.
h (TW)
(W) 1
 2
2


W1 p11 p12 p13
L( D1 , D2 ) =  d1i - d 2i  (3)
W2 p12 p22  i 
W3 If most of a fragment's words of text in T are neighbors in
p31 p32 p33 p34 p35
CS, there are most probable that these two fragments are
... … … … same. For computing amount of similarity we use the
Wj p1
p 2
p 3j p 4j p 5j p 6j p 7j
j j relation (4).
… … … …
Wm p1
p 2
pm3 pm4
m m

Figure 3. Document representation


128 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 6, June 2010

  stories, computer science texts in scope of computer


 T 

1 networks and text mining. Furthermore, 100 Persian and
sim(T , W ) =   (4)
 (1 + [ (CSi+1 − CSi ) ] p )  100 English documents without corresponding translation in
i =1
  corpus were added in same genre. These added documents
 sl 
T and W vectors already are illustrated. CSi shows existence were used for false positive evaluation.
of ti word of T in W text fragment and in the case of For classification task, we supposed three types of
existence, it shows the location of that word in W. sl is the documents so called “story”, “network” and “text”.
average sentence length, where if distance of two words is When there is no lack of reports on measurement of
equal with sl, the similarity amount equals with 0.5. p is the semantic similarity of concepts, evaluation of the
slope of curve, which the great value express sharp slope performance of semantic similarity measure between cross-
and diminishes the effect of two far words in similarity lingual documents has not yet been standardized because
evaluating. Figure (4) shows the relation (4) behavior with there is no universally recognized benchmark[3] . However,
various numbers for |CSi+1- CSi|(distance of two words). a simple thumb of rule is that a document should be very
There are several details in computation and similar to its high-quality translation. With this assumption,
implementation such as treatment with words with CSi zero the evaluation criterions adopted in this paper, are the rate
value that here these details have not been explained. of successful match between the Persian documents and
their parallel English documents (true positive results) and
1.2 furthermore maximum similarity rate of documents that
don’t have corresponding translation in corpus (false
1
positive results).
0.8 Persian document (DPi) is classified by Persian classifier
effect on similar ity

0.6 and all English documents from same class for using in
similarity computation are extracted. The most similar
0.4
document along similarity rate of it for each Persian
0.2 document is determined. This is rational that we expect the
0 similarity coefficient in unparallel document would be
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 closed to 0 and in 200 parallel documents near to 1. For
distance final result the mean of all similarities in parallel and
unparallel groups are calculated as the table (2) shows.
Figure 4. effect of distance on similarity with sl= 15 and
p=8 Table 2: mean of similarities in parallel and unparallel
documents with sl= 15 and p=8
This fact is important that how far the neighbor words in Tf based method Proposed method
original paragraph are from each other in the suspicious Parallel 0.76 0.81
paragraph. Suppose tk and tk+1 are adjacent in T and wi and unparallel 0.35 0.09
wi+d are their equivalents in W. if d or in other words the
distance of two words were closed to each other, we can First column in table (2) represents tradition method results
consider wi and wi+d in an identical sentence. By in which similarity of two documents is computed based on
generalizing this fact, the W and T fragments are equal if words occurrences of a document in other one. In other
most of W words equivalences are neighbor in T. in other words this is a term frequency based method, also in this
side of coin; we restate T words by illustrated reconstruction case rephrasing is not performed. Considering, tf based
method in CS. If CS contents come from different part of method has high precision in parallel documents matching,
text or various W, we can not find a unique same for T. but its error rate in unparallel documents are excessive.
Total similarity between two documents with T and W Second column shows proposed method results that in this
fragments is computed by relation (5) that there Pdoc and method, term frequency doesn't have any influence on
Edoc are respectively Persian and corresponding English similarity, and similarity is computed based on distance. As
document. results discloses, similarity rate for each group are
meaningful.
sim(Ti , Wj )
similarity( Pdoc, Edoc) = ∑∑ (5)
min( Pdoc , Edoc )
i j
6. Conclusion
5. Experimental result Plagiarism is a challenging problem in research domain and
academic comportment. For distinguishing plagiarist from
Dataset gathering in bi-lingual research is the main and the researcher, automatic detection of plagiarism is
time consuming challenge especially if you are among the necessary. In this paper we developed a system for detection
beginners in that domain. In order to evaluate the proposed of exact translation from English to Persian. In bi-lingual
approach we used a parallel corpus containing 200 plagiarism, in which system dealing with two different
documents in Persian and English that were collected and languages that each word in one language has several
translated from internet then converted into .txt format. This equivalences in the other. Thus selecting a suitable meaning
corpus contains documents in English and exact translate of is essential. In proposed method, this is realized by
them in Persian from various genre such as social short rephrasing task that inspired from crystallization. Structural
(IJCNS) International Journal of Computer and Network Security, 129
Vol. 2, No. 6, June 2010

differences between investigated languages are another Semantic Web Research Laboratory, Sharif University
challenge in bi-lingual tasks. That this problem is solved by of Technology, tehran - iran 2006.
using a distance based similarity measurement. Obtained [14] G. Salton, A. Wong, C.S. Yang, ‘A vector space model
outcomes are very promising to detect exact translated texts. for information retrieval’, Journal of the American
Using all validate words in similarity computation is One of Society for information Science, 1975, 18, (11), pp.
the best properties of proposed method 613-620
[15] H.F. Ma, Q. He, Z.Z. Shi, ‘Geodesic distance based
approach for sentence similarity computation’. Proc.
7. References Machine Learning and Cybernetics, 2008 International
Conference on2008 pp. 2552-2557
[1] H. Maurer, F. Kappe, B. Zaka, ‘Plagiarism-a survey’, [16] C.D. Manning, H. Schütze, ‘Foundations of statistical
Journal of Universal Computer Science, 2006, 12, (8), natural language processing’ , MIT Press, 2002.
pp. 1050-1084 [17] E. Greengrass, ‘Information retrieval: A survey’,
[2] M. Jalalian, L. Latiff, ‘Medical researchers in non- University of Maryland, Baltimore County, 2000.
English countries and concerns about unintentional
plagiarism’, Journal of Medical Ethics and History of
Medicine, 2009, 2, (2), pp. 1-2
[3] H.H. Huang, H.C. Yang, Y.H. Kuo, ‘A Sense Based
Similarity Measure for Cross-Lingual Documents’.
Proc. Intelligent Systems Design and Applications,
2008. ISDA'08. Eighth International Conference
on2008 .
[4] O. Uzuner, B. Katz, T. Nahnsen, ‘Using syntactic
information to identify plagiarism’. Proc. Proceedings
of the 2nd Workshop on Building Educational
Applications Using NLP2005 pp.37-44
[5] A. Barron-Cedeno, P. Rosso, D. Pinto, A. Juan, ‘On
cross-lingual plagiarism analysis using a statistical
model’, Proc. of PAN-08, 2008
[6] C.H. Lee, C.H. Wu, H.C. Yang, ‘A Platform
Framework for Cross-Lingual Text Relatedness
Evaluation and Plagiarism Detection’. Proc. Innovative
Computing Information and Control, 2008. ICICIC'08.
3rd International Conference on 2008 pp. 303-307.
[7] A. Selamat, H.H. Ismail, ‘Finding English and
translated Arabic documents similarities using
GHSOM’2008 pp. 460-465
[8] N. Gustafson, M.S. Pera, Y.K. Ng, ‘Nowhere to hide:
Finding plagiarized documents based on sentence
similarity’. Proc. IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent
Technology, 2008. WI-IAT'082008 pp.690-696
[9] D. Zhang, W.S. Lee, ‘Question classification using
support vector machines’. Proc. Proceedings of the
26th annual international ACM SIGIR conference on
Research and development in informaion retrieval2003
[10] C. Enhong, Z. Zhenya, A. Kazuyuki, W. Xu-fa, ‘An
extended corner classification neural network based
document classification approach’, Journal of
Software, 2002, 13, (5), pp. 871-878
[11] M. Mohammadi, H. Alizadeh, B. Minaei-Bidgoli,
‘Neural Network Ensembles Using Clustering
Ensemble and Genetic Algorithm’. Proc. Convergence
and Hybrid Information Technology, 2008. ICCIT'08.
Third International Conference on2008 pp. 761-766
[12] M. Mohammadi, B. Minaei-Bidgoli, ‘Using CC4
neural networks for Persian document classification’.
Proc. 2nd Iran Data Mining Conference(IDMC008)
Tehran 2008
[13] K. Sheykh Esmaili, A. Rostami,: ‘A list of persian
stopwords’. Proc. Technical Report No. 2006-03,

Anda mungkin juga menyukai