itextsharp pdf to image c# example : Convert word doc to pdf with editable fields Library control class asp.net web page wpf ajax Barron%20,%20Vila%20et%20al0-part1651

Plagiarism Meets Paraphrasing:
Insights for the Next Generation in
Automatic Plagiarism Detection
AlbertoBarr´on-Cede˜no
∗†
UniversitatPolit`ecnicadeCatalunya
Marta Vila
∗∗†
UniversitatdeBarcelona
M. Ant`onia Mart´ı
UniversitatdeBarcelona
PaoloRosso
§
UniversitatPolit`ecnicadeVal`encia
Although paraphrasing is the linguistic mechanism underlying many plagiarism cases, little
attention has been paid to its analysis in the framework of automatic plagiarism detection.
Therefore, state-of-the-art plagiarism detectors find it difficult to detect cases of paraphrase
plagiarism. In this article, we analyze the relationship between paraphrasing and plagiarism,
payingspecial attentionto which paraphrasephenomena underlieactsof plagiarismand which
ofthemaredetectedbyplagiarismdetectionsystems.Withthisaiminmind,wecreatedtheP4P
corpus,a newresourcethatusesaparaphrasetypology to annotatea subsetofthePAN-PC-10
corpusforautomatic plagiarismdetection.TheresultsoftheSecond InternationalCompetition
onPlagiarismDetectionwereanalyzedinthelightofthisannotation.
The presented experiments showthat (i) morecomplexparaphrasephenomena and a high
density of paraphrasemechanismsmake plagiarism detection more difficult,(ii) lexical substi-
tutions are the paraphrase mechanismsused the most when plagiarizing,and (iii) paraphrase
mechanismstendto shorten theplagiarizedtext.Forthefirsttime,theparaphrasemechanisms
behindplagiarismhavebeenanalyzed, providing critical insightsforthe improvementof auto-
maticplagiarismdetectionsystems.
∗ TALPResearchCenter,JordiGironaSalgado1-3,08034Barcelona,Spain.E-mail:albarron@lsi.upc.es.
∗∗ CLiC,DepartmentofLinguistics,GranVia585,08007Barcelona,Spain.E-mail:marta.vila@ub.edu.
† Bothauthorscontributedequallytothiswork.
‡ CLiC,DepartmentofLinguistics,GranVia585,08007Barcelona,Spain.E-mail:amarti@ub.edu.
§ NLELab-ELiRF,DepartmentofInformationSystemsandComputation,CaminodeVeras/n,
46022Valencia,Spain.E-mail:prosso@dsic.upv.es.
Submissionreceived:13March2012;revisedsubmissionreceived:17October2012;acceptedforpublication:
7November2012.
doi:10.1162/COLI
a
00153
©2013AssociationforComputationalLinguistics
Convert word doc to pdf with editable fields - C# PDF Field Edit Library: insert, delete, update pdf form field in C#.net, ASP.NET, MVC, Ajax, WPF
Online C# Tutorial to Insert, Delete and Update Fields in PDF Document
add signature field to pdf; add an image to a pdf form
Convert word doc to pdf with editable fields - VB.NET PDF Field Edit library: insert, delete, update pdf form field in vb.net, ASP.NET, MVC, Ajax, WPF
How to Insert, Delete and Update Fields in PDF Document with VB.NET Demo Code
add submit button to pdf form; convert word doc to pdf with editable fields
ComputationalLinguistics
Volume39,Number4
1.Introduction
Plagiarism is the re-use of someone else’s prior ideas, processes, results, or words
withoutexplicitlyacknowledgingtheoriginalauthorandsource(IEEE2008).Although
plagiarism mayoccurincidentally,itisoften theoutcomeofa consciousprocess.Inde-
pendentlyfromthevocabulary orchannel through whichan idea is communicated, a
person who fails to provide its corresponding source is suspected of plagiarism. The
amount oftextavailable inelectronic media nowadays has caused cases ofplagiarism
toincrease.Intheacademicdomain,somesurveysestimatethataround30%ofstudent
reports include plagiarism (Association of Teachers and Lecturers 2008), and a more
recentstudyincreasesthispercentagetomorethan40%(Comasetal.2010).Asaresult,
its manual detection hasbecomeinfeasible.Modelsforautomaticplagiarismdetection
arebeingdevelopedasacountermeasure.Theirmainobjectiveisassistingpeopleinthe
taskofdetectingplagiarism—asasideeffect,plagiarismisdiscouraged.
Thelinguistic phenomenaunderlyingplagiarismhavebarelybeen analyzedin the
design of these systems, which we consider to be a key issue for their improvement.
Martin(2004)identifiesdifferentkinds ofplagiarism:ofideas,ofreferences,ofauthor-
ship, word by word, and paraphrase plagiarism. In the first case, ideas, knowledge,
or theories from another person are claimed without proper citation. In plagiarism
of references and authorship, citations and entire documents are included without
any mention of their authors. Word by word plagiarism, also known as copy–paste
or verbatim copy, consists of the exact copy of a text (fragment) from a source into
the plagiarized document. Regarding paraphrase plagiarism, in order to conceal the
plagiarism act, adifferentform expressing the samecontentis often used. Paraphras-
ing, generally understoodassamenessofmeaning between differentwordings, is the
linguistic mechanism underlying many plagiarism acts and the linguistic process on
whichplagiarismisbased.
In this article, the relationship between plagiarism andparaphrasing, which con-
sists of a largely unexplored problem, is analyzed, and the potential of such a rela-
tionship in automatic plagiarism detection is set out. We aim not only to investigate
howdifficultdetectingparaphrasecasesforstate-of-the-artplagiarismdetectorsis,but
to understandwhich types of paraphrases underlieplagiarism acts andwhich are the
mostdifficulttodetect.
Forthispurpose,wecreatedtheParaphraseforPlagiarismcorpus(P4P)annotating
aportionofthePAN-PC-10corpusforplagiarismdetection(Potthastetal.2010)onthe
basisofaparaphrasetypology,andwemappedtheannotationresultswiththoseofthe
Second International Competition on Plagiarism Detection (Pan-10 competition,here-
after).
1
Theresultsobtainedprovidecriticalinsightsfortheimprovementofautomatic
plagiarismdetectionsystems.
The rest of the article is structured as follows. Section 2 sets out the paraphrase
typology used in this research work. Section 3 describes the construction of the P4P
corpus. Section 4 gives an overview of the state of the art in automatic plagiarism
detection; special attention is given to the systems participating in the Pan-10 com-
petition.Section 5 discusses our experiments and thefindings derivedfrom mapping
the P4Pcorpus andthe Pan-10 competition results.Section 6draws someconclusions
andoffersinsightsforfutureresearch.
1 http://www.webis.de/research/events/pan-10.
2
C# PDF Convert to Word SDK: Convert PDF to Word library in C#.net
NET program. Convert PDF to multiple MS Word formats such as .doc and .docx. Create editable Word file online without email. Password
allow users to save pdf form; change font size pdf form reader
VB.NET Create PDF from Word Library to convert docx, doc to PDF in
formatting. Create PDF files from both DOC and DOCX formats. Convert multiple pages Word to fillable and editable PDF documents. Professional
allow saving of pdf form; add email button to pdf form
Barr´on-Cede˜noetal.
PlagiarismMeetsParaphrasing
2.ParaphraseTypology
Typologies are a precise and efficient way to draw the boundaries of a certain phe-
nomenon,identifyitsdifferentmanifestations,and,inshort,gointoitscharacterization
in depth. Also, typologies constitute the basis of many corpus annotation processes,
which have their own effects on the typologies themselves: The annotation process
tests the adequacy of the typology for the analysis of the data, and allows for the
identificationofnewtypesandtherevisionoftheexistingones.Moreover,anannotated
corpusfollowingatypologyisapowerfulresourceforthedevelopmentandevaluation
ofcomputational linguisticssystems.Inthissection,aftersettingoutabriefstateofthe
art on paraphrase typologies and the weaknesses they present, the typology used for
theannotationoftheP4Pcorpusisdescribed.
Paraphrasetypologieshavebeen addressedin differentfields,includingdiscourse
analysis, linguistics, and computational linguistics, which has originated typologies
that are very different in nature. Typologies coming from discourse analysis classify
paraphrases according to the reformulation mechanisms or communicative intention
behind them (G¨ulich 2003; Cheung 2009), but without focusing on the linguistic
nature of paraphrases themselves, which, in contrast, is our main focus of interest.
From the perspective of linguistic analysis, some typologies are strongly tied to
concrete theoretical frameworks, as the case of Meaning–Text Theory (Mel’ˇcuk 1992;
Mili
´
cevi
´
c2007). In this field, typologies of transformations and diathesis alternations
can be considered indirect approaches to paraphrasing in the sense that they deal
with equivalent expressions (Chomsky 1957; Harris 1957; Levin 1993). They do
not cover paraphrasing as a whole, however, but focus on lexical and syntactic
phenomena.Othertypologiescomefrom linguistics-relatedfields likeediting(Faigley
and Witte 1981), which is interesting in our analysis because it is strongly tied to
paraphrasing.
Anumber of paraphrasetypologies havebeen builtfrom the perspectiveofcom-
putational linguistics. Some of these typologies are simple lists of paraphrase types
usefulforaspecificsystemorapplication,orthemostcommontypesfoundinacorpus.
Theyarespecific-workorientedandfarfrombeingcomprehensive:Barzilay,McKeown,
and Elhadad (1999), Dorr et al. (2004), and Dutrey et al. (2011), among others. Other
typologies classify paraphrases in a very generic way, setting out only two or three
types(Barzilay2003;Shimohata2004);theseclassificationsdonotreachthecategoryof
typologies sensu stricto.Finally,therearemorecomprehensivetypologies,suchasthe
onesbyDras(1999),Fujita(2005),andBhagat(2009).Theyusuallytaketheshapeofvery
fine-grained lists of paraphrase types grouped into bigger classes following different
criteria. They generally focus on these lists of specific paraphrase mechanisms,which
will alwaysbeendless.
Our paraphrase typology is based on the paraphrase concept defined in
Recasens and Vila (2010) and Vila, Mart´ı, and Rodr´ıguez (2011), and consists of an
upgraded version of the one presented in the latter. Ourparaphrase concept is based
on the idea that paraphrases should have the same or an equivalent propositional
content, that is, the same core meaning. This conception opens the door to para-
phrasessometimesdisregardedintheliterature,mainlyfocusedonlexicalandsyntactic
mechanisms.
The paraphrase typology attempts to capture the general linguistic phenomena
of paraphrasing, rather than presenting a long, fine-grained, and inevitably incom-
plete list of concrete mechanisms. In this sense, it also attempts to be comprehen-
sive of paraphrasing as a whole: It was contrasted with, and sometimes inspired by,
3
VB.NET PDF Convert to Word SDK: Convert PDF to Word library in vb.
Create editable Word file online without email. is a professional and mature .NET solution which aims to convert PDF document to Word (DOC / DOCX) file
add image field to pdf form; pdf form save
C# Create PDF from Word Library to convert docx, doc to PDF in C#.
Convert multiple pages Word to fillable and editable PDF documents in both .NET WinForms and ASP.NET. Convert both DOC and DOCX formats to PDF files.
add form fields to pdf without acrobat; android edit pdf forms
ComputationalLinguistics
Volume39,Number4
C# Create PDF Library SDK to convert PDF from other file formats
Create PDF from Microsoft Word, Excel, PowerPoint. doc = PDFDocument.Create(2); // Save the new created PDF document into file doc.Save(outputFile);
add text fields to pdf; edit pdf form
C# Create PDF from Excel Library to convert xlsx, xls to PDF in C#
C#.NET PDF SDK- Create PDF from Word in Visual doc = new XLSXDocument(inputFilePath); // Convert Excel to doc.ConvertToDocument(DocumentType.PDF, outputFilePath
add jpg to pdf form; change font in pdf form field
Barr´on-Cede˜noetal.
PlagiarismMeetsParaphrasing
groupedinclassesaccordingtothenatureofsuchtriggerlinguisticmechanism:(i)those
types where theparaphrase phenomenon arises atthemorpholexicon level,(ii)those
that are the result of a different structural organization, and (iii) those types arising
at thesemantic level. Classes informaboutthe origin of theparaphrase phenomenon,
but such paraphrasephenomenon can involve changes in otherparts ofthe sentence.
Forinstance,amorpholexicon-basedchange(derivational)liketheonein Example(1),
where the nominal form failure is exchangedfor the verb failed, has obvious syntactic
implications;theparaphrasephenomenon,however,istriggeredbythemorpholexical
change.
5
A structure-based change (diathesis) like the one in Example (2) involves
an inflectional change in heard/hearamong others, but the trigger change is syntactic.
Finally, paraphrases in semantics are based on a different distribution of semantic
content across the lexical units involving multiple and varied formal changes, as in
Example (3). Miscellaneous changes comprise types not directly related to one single
class.Finally,thesubclassesfollowtheclassical organizationinformal linguisticlevels
frommorphologytodiscourseandsimplyestablishanintermediategroupingbetween
someclassesandtheirtypes.
(1)
a. thecomicalfailureoftheheadmaster’sattemptata“Parents’Committee”
b. howtheheadmasterfailedattheattemptata“Parent’sCommittee”
(2)
a. thereportofagunon shorewasstillheardatintervals
b. Wewereabletohearthereportofagunonshoreintermittently
(3)
a. I’vegotahunchthatwe’renotthroughwiththatgameyet
b. I’mguessingwewon’tbedoneforsometime
Although thetypes in our typology arepresented in isolation,they can be combined:
in Example(4), changesoforderofthe subject (β)andthe adverb (γ), andtwosame-
polaritysubstitutions (said/answered[α]andcautiously/carefully[γ]) can beobserved.A
difference between cases such as Example (4) and, for example, Example (1) should
be noted: In Example (1), the derivational change implies the syntactic one, so only
onesingleparaphrasephenomenonisconsidered;inExample(4),same-polaritysubsti-
tutions andchanges of orderare independent and can takeplacein isolation, so four
paraphrasephenomenaareconsidered.
(4)
a. “Yes,”[said]
α
[I]
β
[cautiously]
γ
b. “Yes,”[I]
β
[carefully]
γ
[answered]
α
Inwhatfollows,typesinourtypologyarebrieflydescribed.
Inflectionalchanges consist of changing inflectional affixes of words.In Example (5),
aplural/singularalternation(streets/street)canbeobserved.
(5)
a. itwaswithdifficultythatthecourseofstreetscouldbefollowed
b. You couldn’tevenfollowthepathofthestreet
5 AlltheexamplesinthisarticleareextractedfromtheP4Pcorpus.Insomeofthem,onlythefragment
wearereferringtoappears;inothers,itscontextisalsodisplayed(withthefragmentinfocusinitalics).
Neitherthefragmentsetoutnoritalicsnecessarilyrefertotheannotatedscope(cf.Section3),although
theysometimescoincide.Thesefragmentsarenotcompletecasesofplagiarism.RefertoTable4tosee
someentireinstancesofplagiarismintheP4Pcorpus.
5
C# Create PDF from PowerPoint Library to convert pptx, ppt to PDF
PPTXDocument doc = new PPTXDocument(inputFilePath); // Convert it to a PDF. doc.ConvertToDocument(DocumentType.PDF, outputFilePath);
chrome save pdf form; add date to pdf form
VB.NET Create PDF from PowerPoint Library to convert pptx, ppt to
to, VB.NET PDF, VB.NET Word, VB.NET doc As PPTXDocument = New PPTXDocument(inputFilePath) ' Convert it to doc.ConvertToDocument(DocumentType.PDF, outputFilePath
can save pdf form data; pdf form change font size
ComputationalLinguistics
Volume39,Number4
Modal verbchanges arechanges of modalityusing modal verbs,like might andcould
inExample(6).
(6)
a. I[...]wasstilllostinconjectureswhotheymightbe
b. Iwasponderingwhotheycouldbe
Derivationalchangesconsistofchangesofcategorywithorwithoutusingderivational
affixes. These changes imply a syntactic change in the sentence in which they occur.
In Example (7), the verbal form differing is changed to the adjective different, with the
consequentstructural reorganization.
(7)
a. Ihaveheardmanyaccountsofhim[...]all differingfromeachother
b. Ihaveheardmanydifferentthingsabouthim
Spelling and format changes comprise changes in the spelling and format of lexical
(or functional) units, such as case changes, abbreviations, or digit/letter alternations.
InExample(8),casechangesoccur(Peace/PEACE).
(8)
a. AndyettheyarecallingforPeace!–Peace!!
b. Yetstill theyshoutPEACE!PEACE!
Same-polarity substitutions change one lexical (or functional) unit for another with
approximately the same meaning.
6
Among the linguistic mechanisms of this type,
we findsynonymy, general/specific substitutions, or exact/approximate alternations.
InExample(9),verylittleismoregeneralthanateaspoonfulof.
(9)
a. ateaspoonfulof vanilla
b. verylittlevanilla
Synthetic/analytic substitutions consist of changing synthetic structures for analytic
structures, and vice versa. This type comprises mechanisms such as compounding/
decomposition, light element, or lexically emptied specifier additions/deletions, or
alternationsaffectinggenitivesandpossessives.InExample(10b),a(lexicallyemptied)
specifier(asequenceof)hasbeen deleted:itdidnotaddnewcontenttothelexical unit,
butemphasizeditsplural nature.
(10)
a. Asequenceofideas
b. ideas
Opposite-polarity substitutions. Two phenomena are considered within this type.
First, there is the case of double change of polarity, when a lexical unit is changed
forits antonymorcomplementaryandanotherchange ofpolarity has tooccur within
the same sentence in order to maintain the same meaning. In Example (11), failed is
substitutedfor its antonym succeed and a negation is added. Second, there is the case
6 Theobjectofstudyofbothparaphrasingandlexicalsemanticsfieldsconvergeinlexicon-basedchanges
ingeneralandsame-polaritysubstitutionsinparticular.Inthissense,manyworksandtasksinlexical
semanticsarealsorelevantforourpurposes.Bywayofillustration,thelexicalsubstitutiontaskwithin
SemEval-2007aimedtoproduceasubstituteword(orphrase),thatis,aparaphrase,forawordincontext
(McCarthyandNavigli2009).
6
C# Word - Word Creating in C#.NET
is searchable and can be fully populated with editable text and Create(outputFile); // Save the new created Word document into file doc.Save(outputFile);
create a pdf form in word; change font size in pdf fillable form
VB.NET Create PDF from Excel Library to convert xlsx, xls to PDF
to, VB.NET PDF, VB.NET Word, VB.NET As XLSXDocument = New XLSXDocument(inputFilePath) ' Convert Excel to doc.ConvertToDocument(DocumentType.PDF, outputFilePath).
pdf forms save; add photo to pdf form
Barr´on-Cede˜noetal.
PlagiarismMeetsParaphrasing
of change of polarity and argument inversion, where an adjective is changed for its
antonymincomparativestructures.Hereaninversionofthecomparedelementshasto
occur.In Example(12),the adjectival phrasesfardeeperandmoregeneral change tothe
opposite-polarityones lessseriousandlesscommon.Tomaintainthe samemeaning,the
orderofthecomparedelements(i.e.,whattheChurchconsidersandwhatisperceived
bythepopulation)hastobeinverted.
(11)
a. Leicester[...]failedinbothenterprises
b. hedidnotsucceedin eithercase
(12)
a. the sense of scandal given by this is far deeper and more general than the
Churchthinks
b. theChurchconsidersthatthisscandalislessseriousandlesscommonthanit
reallyis
Conversesubstitutions takeplacewhena lexical unitis changedforits converse pair.
In order tomaintain thesamemeaning,anargument inversion has tooccur. In Exam-
ple (13), awarded to is changed to receiving [...] from, andthe arguments the Geological
SocietyinLondonandhimareinverted.
(13)
a. the Geological Society of London in 1855 awarded to him the Wollaston
medal
b. resultedin him receiving the Wollaston medal from the Geological Society
inLondonin1855
Diathesis alternation type gathers those diathesis alternations in which verbs can
participate,suchastheactive/passivealternation(Example(14)).
(14)
a. theguidedrewourattentiontoagloomylittledungeon
b. ou[r]attentionwasdrawnbyourguidetoalittledungeon
7
Negationswitchingconsistsofchangingthepositionofthenegationwithinasentence.
InExample(15),nochangestodoesnot.
(15)
a. Inordertomoveus,itneedsnoreferencetoanyrecognizedoriginal
b. One doesnot needtorecognize atangibleobjecttobemovedbyitsartistic
representation
Ellipsisincludeslinguisticellipsis(i.e,thosecasesinwhichtheelidedfragmentscanbe
recoveredthrough linguistic mechanisms).InExample(16b), thesubject he appears in
bothclauses;inExample(16a),itisonlydisplayedinthefirstone.
(16)
a. In the scenes with Iago he equaled Salvini, yet did not in any one point
surpasshim
b. He equaled Salvini, in the scenes with Iago, but he did not in any point
surpasshimorimitatehim
7 Typosintheexamplesarealsopresentintheoriginalcorpus.Whentherewasanymodificationofthe
original,thisisindicatedwithsquarebrackets.
7
ComputationalLinguistics
Volume39,Number4
Coordination changes consist of changes in which one of the members of the pair
contains coordinated linguistic units, and this coordination is not present or changes
itspositionand/orformintheothermemberofthepair.Thejuxtaposedsentenceswith
afull stopinExample(17a)arecoordinatedwiththeconjunctionandin (17b).
(17)
a. Itis estimatedthathespentnearly£10,000on theseworks.Inaddition he
publishedalargenumberofseparatepapers
b. Altogethertheseworkscosthim almost£10,000andhewrotealotofsmall
papersaswell
Subordination andnestingchanges consist of changes in which one of the members
ofthepaircontainsasubordinationornestedelement,whichisnotpresent,orchanges
its positionand/orformwithin theothermemberofthepair.Whatis arelativeclause
in Example (18a) (whichlimits thepercentageof Jewish pupils in any school) is part of the
mainclauseinExample(18b).
(18)
a. theRussianlaw,whichlimitsthepercentageofJewishpupilsinanyschool,
barredhisadmission
b. theRussianlawhadlimitsforJewishstudentssotheybarredhisadmission
Punctuationand format changes consist of any change in the punctuation or format
ofa sentence(notofalexicalunit,cf.lexicon-basedchanges).In Example(19a),thelist
appearsnumberedand,inExample(19b),itdoesnot.
(19)
a. AtVictoriaStationyouwillpurchase(1)areturntickettoStreathamCom-
mon,(2)aplatformticket
b. You will purchase a return ticket to Streatham Common and a platform
ticketatVictoriastation
Direct/indirect style alternations consist of changing direct style for indirect style,
andviceversa.ThedirectstylecanbeseeninExample(20a)andtheindirectinExample
(20b).
(20)
a. “Sheismine,”saidtheGreatSpirit
b. TheGreatSpiritsaidthatsheisher[s]
Sentencemodalitychangesarethosecasesinwhichthereisachangeofmodality(not
provoked by modal verbs, cf. modal verb changes), but the illocutive value is main-
tained.InExample (21a),interrogativesentencescanbe observed; theyarechangedto
anaffirmativesentenceinExample(21b).
(21)
a. The real question is, will it pay? will it please Theophilus P. Polk or vex
HarrimanQ.Kunz?
b. He do it just for earning money or to please Theophilus P. Polk or vex
HarimanQ.Kunz
Syntax/discourse structurechangesgatherawidevarietyofsyntax/discoursereorga-
nizations not covered by the types in the syntax and discourse subclasses above. An
examplecanbeseeninExample(22).
(22)
a. Howhewouldstare!
b. Hewouldsurelystare!
8
Barr´on-Cede˜noetal.
PlagiarismMeetsParaphrasing
Semantics-based changes are those thatinvolve a differentlexicalization of thesame
contentunits.
8
Thesechanges affectmorethan one lexical unitanda clear-cutdivision
ofthese units in the mappingbetween thetwomembers ofthe paraphrasepair isnot
possible. In Example (23), the content units
TROPICAL
-
LIKE ASPECT
(scenery was [...]
tropical/tropical appearance) and
INCREASE OF THIS ASPECT
(more/added) are present in
bothfragments,butthereisnotaclear-cutmappingbetween thetwo.
(23)
a. Thescenerywasaltogethermoretropical
b. whichaddedtothetropicalappearance
Change of order includes any type of change of order from the word level to the
sentencelevel.InExample(24),firstchangesitspositioninthesentence.
(24)
a. Firstwecametothetallpalmtrees
b. Wegottosomeratherbiggishpalmtreesfirst
Addition/deletionThistypeconsistsofalladditions/deletionsoflexicalandfunctional
units.InExample(25b),onedayisdeleted.
(25)
a. One day she took a hot flat-iron, removed my clothes, and held it on my
nakedbackuntil Ihowledwithpain
b. Asaproof ofbadtreatment,shetooka hotflat-ironandputiton myback
afterremovingmyclothes
3.BuildingtheP4PCorpus
ThissectiondescribeshowP4P,anewparaphrasecorpuswithparaphrasetypeannota-
tion,wasbuilt.
9
First,wewillsetoutabriefstateoftheartonparaphrasecorpora.
Paraphrasecorpora in existenceareratherfew.Oneofthemostwidelyusedisthe
MSRPcorpus (Dolan andBrockett 2005),which contains 5,801 English sentencepairs
from news articles hand-labeled with a binary judgment indicating whether human
raters considered them to be paraphrases (67%) or not (33%). Cohn, Callison-Burch,
and Lapata (2008), in turn, built a corpus of 900 paraphrase sentence pairs aligned
at word or phrase level.
10
The pairs were compiled from three different types of
corpora: (i)sentence pairs judgedequivalent from theMSRPcorpus,(ii) theMultiple-
TranslationChinesecorpus,and(iii) themonolingual parallel corpususedbyBarzilay
and McKeown (2001). The WRPAcorpus (Vila,Rodr´ıguez, and Mart´ı Submitted) is a
corpus of relational paraphrases extracted from Wikipedia. It comprises paraphrases
expressingrelationslike person–date
of
birth in English and author–workin Spanish.
Moreover, Max andWisniewski (2010) builtthe Wikipedia Correction and Paraphrase
Corpus from the Wikipedia revision history.
11
Apart from paraphrases, the corpus
includes spelling corrections and other local text transformations. In the paper, the
authorsset outa typology ofthese revisions andclassifythem as meaning-preserving
8 ThistypeisbasedontheideasofTalmy(1985).
9 TheP4Pcorpusandguidelinesusedforitsannotationareavailableat
http://clic.ub.edu/corpus/en/paraphrases-en.ThesubsetsoftheMSRPandWRPAcorpora
annotatedwiththesametypologyarealsoavailableatthisWebsite.
10 http://staffwww.dcs.shef.ac.uk/people/T.Cohn/paraphrase
corpus.html.
11 http://wicopaco.limsi.fr/.
9
ComputationalLinguistics
Volume39,Number4
ormeaning-altering.Therealsoexistworkswherethefocusisnottobuildaparaphrase
corpus, but to create a paraphrase extraction or generation system, which ends up in
alsobuildingaparaphrasecollection,suchasBarzilayandLee(2003).
Plagiarism detection experts are starting to turn their attention to paraphrasing.
Burrows, Potthast, and Stein (2012) built the Webis Crowd Paraphrase Corpus by
crowd-sourcing more than 4,000 manually simulated samples of paraphrase plagia-
rism.
12
In order to create feasible mechanisms for crowd-sourcing paraphrase acqui-
sition,theybuiltaclassifiertorejectbadinstancesofparaphraseplagiarism(e.g.,cases
of verbatim plagiarism). These crowd-sourced instances are similar to the cases of
simulatedplagiarisminthePAN-PC-10corpus,andhencetheP4P(seethefollowing).
P4PwasbuiltuponthePAN-PC-10corpus,fromtheInternationalCompetitionon
Plagiarism Detection.
13
The PAN competition appeared with the aim of creating the
first large-scale evaluation framework for plagiarism detection. It relies on two main
resources:acorpuswith casesofplagiarismanda setofevaluation measures specially
suited to the problem of automatic plagiarism detection (cf. Section 4) (Potthast etal.
2010). We focus on the Pan-10 plagiarism detection competition. The corpus used in
this edition, known as PAN-PC-10, was composed of a set of suspicious documents
D
q
thatmay ormaynotcontain plagiarizedfragments,togetherwith asetofpotential
sourcedocumentsD.Inordertobuildit,textfragmentswereextractedrandomlyfrom
documents dDandinsertedintosomed
q
D
q
.ThePAN-PC-10containscirca70,000
casesofplagiarism; 40%ofthem are exactcopies, andthe rest involvedsomekindof
obfuscation (paraphrasing). Most of the obfuscated cases were generated artificially,
thatis,rewritingoperationswere imitatedbyacomputational process.
14
Therest (6%)
were created by humans who aimed at simulating paraphrase cases of plagiarism.
ThesecasesweregeneratedthroughAmazonMechanicalTurk,with clearinstructions
to rewrite text fragments to simulate the act of plagiarizing. According to Potthast
et al. (2010), most of the turkers had attended college and 62% identified themselves
asnativeEnglishspeakers.
15
Casesinthis subsetofthecorpusarereferredtoonwards
assimulatedplagiarism.
16
The P4Pcorpus was built using cases of simulated plagiarism in the PAN-PC-10
(plg
sim
). They consist of pairs of source and plagiarized fragments, where the latter
wasmanuallycreatedreformulatingtheformer.From thisset, weselectedthosecases
containing 50 words or less (|plg
sim
|≤50);847paraphrasepairsmettheseconditions
and were selected as our working subset. The decision was taken for the sake of
simplicity and efficiency, and is backed by state-of-the-art paraphrases corpora. As a
wayof illustration, theMSRPcontains 28wordspercase on average andthe Barzilay
andLee(2003)collectionincludesexamplesofabout20wordsinlength only.
The tagset and the scope. After tokenization of the working corpus, the annotation
was performed by, on the one hand, tagging the paraphrase phenomena present in
12 http://www.uni-weimar.de/cms/medien/webis/research/corpora/corpus-webis-cpc-11.html.
13 http://www.uni-weimar.de/cms/medien/webis/research/corpora/corpus-pan-pc-10.html.
14 Thestrategiesinclude:(i)randomlyshuffling,removing,inserting,orreplacingshortphrasesfrom
thesourcetotheplagiarizedfragment,(ii)randomlysubstitutingawordforitssynonym,hyponym,
orantonym,and(iii)randomlyshufflingthewords,butpreservingthePOSsequenceofthesource
text(Potthastetal.2010a,b).
15 Turkersaimedatfinishingthecasesassoonaspossibleinordertogetpaidforthetask,hencefacinga
similartimeconstrainttothatofpeopletemptedtotaketheplagiarismshortcut.
16 Incontrasttosimulatedplagiarism,paraphraseplagiarismisamoregeneraltermreferringtoplagiarism
basedonparaphrasemechanisms.
10
Documents you may be interested
Documents you may be interested