ghostscript.net convert pdf to image c# : Create fillable pdf form from word software application cloud windows winforms .net class wac30-part1597

TheCrúbadánProject:Corpusbuildingfor
under-resourcedlanguages
KevinP.Scannell
1
SaintLouisUniversity
Abstract
WepresentanoverviewoftheCrúbadánproject,theaimofwhcihisthecreationoftextcorporafora
largenumberofunder-resourcedlanguagesbycrawlingtheweb.
Keywords : web crawling, , corpus, , corpora, , minoritylanguages, , under-resourcedlanguages, , spell
checking,languagerecognition.
1.Introduction
1.1. BackgroundandGoals
Onlyaverysmallnumber(perhapsthirty)oftheworld's6000+languagescurrently
enjoythebenefitsofmodernlanguagetechnologiessuchassp eechrecognitionandma-
chinetranslation. Aslightlylargernumber(lessthan100)havemanagedtoassemble
thebasicresourcesneededasafoundationforadvancedend-usertechnologies:mono-
lingualandbilingualcorpora,machine-readable dictionaries,thesauri,part-of-speech
taggers,morphologicalanalyzers,parsers,etc.(inshort,theelementsofaso-calledBa-
sicLanguageResourceKit(BLARK)asin(Krauwer2003)).Theremainder(certainly
morethan98%oftheworld'slivinglanguages)lackmost,andusuallyall,ofthesetools,
andwethereforerefertotheseasunder-resourcedlanguages.
Since2000,theauthorandhiscollaboratorshavebeenengagedinthedevelopmentof
freely-availablelanguagekitsforalargenumberofunder-resourcedlanguages. The
lackoffundingopportunitiesorcommercialinterestinthisworkhasledtoanapproach
basedoncertainprinciplesthatoffermaximal“bangforthe buck”: : monolingualand
parallelcorporaharvestedbyweb-crawling,language-independenttoolswhenpossible,
anopen-sourcedevelopmentmodelthatleveragesvolunteerlaborbylanguageenthusi-
asts,andunsupervisedmachinelearningalgorithms.
ThepresentpaperdiscussesTheCrúbadánProject,
1
whichimplementsthefirstofthese
principles. Thevalueofthewebasasourceoflinguisticdatahasbeenwidelyrecog-
1
Department of Mathematics and Computer Science, , Saint t Louis University, , Missouri, USA,
.
1
See
.
CahiersduCental,5(2007),1
Create fillable pdf form from word - C# PDF Form Data fill-in Library: auto fill-in PDF form data in C#.net, ASP.NET, MVC, WinForms, WPF
Online C# Tutorial to Automatically Fill in Field Data to PDF
converting pdf to fillable form; change pdf to fillable form
Create fillable pdf form from word - VB.NET PDF Form Data fill-in library: auto fill-in PDF form data in vb.net, ASP.NET, MVC, WinForms, WPF
VB.NET PDF Form Data fill-in library: auto fill-in PDF form data in vb.net, ASP.NET, MVC, WinForms, WPF
convert pdf into fillable form; pdf fill form
2
K
EVIN
P.S
CANNELL
nizedfornearlyadecade(Resnik1999),(Kilgarriff2001),(Kilgarriff&Grefenstette
2003),andseveralauthorshaveaddressedtheparticularimportancethat“WebasCor-
pus”(WAC)researchholdsforunder-resourcedlanguages(Gh anietal.2001),(Ghani
etal.2005),(deSchryver2002).
In(Kilgarriff&Grefenstette2003),itisarguedthatthenotionofacorpusshouldbe
“reclaimed”from,amongotherthings,theconnotationofre presentativenessthatithas
acquiredovertheyears. Webelievethisisparticularlysofortheunder-resourcedlan-
guagesthatformthefocusofourwork,astherulesofthegamearecompletelydifferent
inthiscase. Indeed,formanyoftheCrúbadánlanguages,thefewdozendocuments
retrievedbythecrawlermayverywellrepresentthetotalityofallelectronicdocuments
inexistence,andthereforethenotionofassemblingarepresentativecorpus(withoutre-
sortingtoold-fashionedmethodssuchaskeyboardingorscanning)forsuchlanguages
isabsurd.
Inanycase,havingacceptedthisinclusivedefinitionof“cor pus”,wecanclaimtohave
createdcorporaformorethan400languages.
2
Moresignificantly,wehaveusedthese
corpora,incollaborationwithnativespeakers,inthecreationofnewlanguagetechnolo-
giesforalmost30languages. Mostofthe400+corporalackanylinguisticannotation
forthesimplereasonthatthetoolsforperformingsuchannotationsdonotyetexist(see
(Raysonetal.2006)and(Baroni&Kilgarriff2006)forrecentworkonannotatingweb
corporaformajorlanguages). Wehave,however,succeededinbootstrappingpart-of-
speechtaggersforasmallnumberoflanguages;thisisdiscussedbelowin§3.1.We
shouldalsomentionthat,inadditiontoourownwork,theCrúbadándatahavebeen
providedtomanyotherresearchgroupsandindividualsworkingindependentlyonopen
sourceprojectsonbehalfofunder-resourcedlanguages.
Manyofthecoreideasinthispaperarewell-knownandweholdnopretensionstoorig-
inality.Intechnicaltermsaswell,muchofthefunctionalityofourcrawlerisnowim-
plementedandeasilyavailableviatheopensourceBootCaTtools(Baroni&Bernardini
2004). Therefore,whilewewillsketchthebasicdesignofthecrawlerandoffersome
implementationdetailsthatwehopewillbefoundusefulbyothersworkinginthisarea,
weskirtovermanyofthecomplexissuesinvolvedinWACresearch(evaluatingcorpus
compositionandrepresentativeness,randomgenerationofseedURLs,duplicatestrip-
ping,disseminationissues)andrecommend(Ciaramita&Baroni2006),(Evert2006),
(Sharoff2006b),(Sharoff2006a)asstartingpointsforreadersinterestedinexploring
theseissuesmoredeeply. Ourfocuswillinsteadbeontopicsparticularlypertinentto
under-resourcedlanguages,includingthesociologicalaspectsoftheprojectwhichmake
itsomewhatuniqueinlanguage-processingcircles. Weexpectthatcommunity-based
approacheslikeourswillbebroadlyapplicableintryingtobreakthedatabottleneckin
NLPapplications,especiallyforminorityandunder-resourcedlanguages.
2
See
.
C# Create PDF Library SDK to convert PDF from other file formats
to create searchable PDF document from Microsoft Office Word, Excel and Create and save editable PDF with a blank page Create fillable PDF document with fields.
convert pdf file to fillable form online; auto fill pdf form fields
VB.NET Create PDF from OpenOffice to convert odt, odp files to PDF
Edit Bookmark. Metadata: Edit, Delete Metadata. Form Process. Create PDF document from OpenOffice Text Document with ODT, ODS, ODP forms into fillable PDF formats
create a writable pdf form; adding signature to pdf form
C
ORPUSBUILDINGFORUNDER
-
RESOURCEDLANGUAGES
3
1.2. BriefHistory
Therootsoftheprojectstretchbackto1999whentheauthorbegancreatingthefirst
spellcheckerfortheIrishlanguage.Theoriginalversionofthe“crawler”couldrecur-
sivelydownloadcompletewebsites(orthedocumentsbelowaspecifiedrootdirectory),
convertthemtoplaintext,tokenize,andcreateafrequencylistforuseinenhancingthe
spellcheckingdatabase.
By2003,thishadevolvedintoatruewebcrawler,withalanguageidentificationmodule
trainedforthesixCelticlanguages.Inthesummerof2004,manynewlanguagemodels
weretrained(usingthetechniquesdiscussedbelowin§2.1.)andamajorweb-crawlwas
undertakenthattargeted144under-resourcedlanguages. Atthispointtheprojectwas
dubbedAnCrúbadán.
3
Inearly2007,inpreparationforthepresentconference,anadditional200modelswere
trained(bringingthetotalto416languages)andallofthecorporawererecrawled.The
focusonunder-resourcedlanguagesmeansthattheamountofdatainquestionissur-
prisinglysmall;inthislatestcrawlwehavevisitedabouttwomillionURLs,resultingin
theadditionofapproximately350000documentstothecorpora.
2.TheCrawler
2.1. TrainingNewLanguages
Aswediscussin§2.3.below,thedefaultbehaviorofthecrawleristousesimplechar-
actertrigramsforlanguage recognition. . Therefore, , training a newlanguage model
amountstonothingmorethancollectingasufficientamounto fplaintextfromwhichre-
liabletrigramstatisticscanbegathered.Theamountoftextrequiredvariesgreatlyfrom
languagetolanguage,dependingprimarilyonwhetherornotthereareotherlanguages
thathavesimilartrigramprofiles.
Eachlanguagehassomeadditionalmetadatathatmustbeprovidedmanually:thename
ofthelanguageinEnglish,theISO639-3code,aflagindicati ngwhetherthelanguageis
under-resourced,andalistof“polluting”languages(lang uagesonemightexpecttosee
frequentlyinboilerplatetextindocumentsthatareotherwisewritteninthetargetlan-
guage;FrenchisapolluterofLingala,SpanishisapolluterofBasque,etc.,andEnglish
issetasapollutinglanguagebydefault). Markingalanguageas“under-resourced”is
mostlyimpressionistic
4
andisusedprimarilyforreportingpurposesontheprojectweb
site. Aftertheabove-mentionedfieldsareset,theISO639-3c c odeisusedtogather,
automatically, additionalmetadatabyscreen-scrapingthe Ethnologueweb site
5
(for
alternatelanguagenames,linguisticclassification,coun triesinwhichthelanguageis
spoken,etc.).
3
Thenamemeans“thecrawler”or“thecrawlingthing”inIrish.
Therootwordis
crúb
(“paw”),
whichlendstheappropriateconnotationofunwantedpawing,asin
Náleagdochrúbaorm
,roughly“Get
yourpawsoffme”.
4
See(Streiter
etal.
2007),(Maxwell&Hughes2006),or(Berment2004)fordiscussionsofhowthis
notionmightbequantified.
5
See
.
C# Create PDF from OpenOffice to convert odt, odp files to PDF in
Create PDF document from OpenOffice Presentation in both .NET WinForms and ASP to change ODT, ODS, ODP forms to fillable PDF formats in RasterEdge.XDoc.PDF.dll.
convert pdf forms to fillable; convert word to pdf fillable form online
VB.NET Create PDF Library SDK to convert PDF from other file
component to convert Microsoft Office Word, Excel and Create and save editable PDF with a blank Create fillable PDF document with fields in Visual Basic .NET
convert an existing form into a fillable pdf form; pdf fillable form
4
K
EVIN
P.S
CANNELL
ManylanguageshavenotfullyembracedtheuseofUnicode,andthiscanbeeithera
minorannoyance(whenastandard8-bitencodingisusedbutisnotindicatedcorrectly
inHTMLmetadata)oramajorannoyance(whenspecial8-bitencodingsareusedin
conjunctionwithlanguage-specificfonts). Inthelattercas s e,wehavereliedoninput
fromnativespeakersinordertomaptheencodingsthatexistinthewildtostandard
UTF-8.
Mongolianprovidesanice,simpleexample.Mostdocumentsonthewebareencoded,
atleastaccordingtothemetadata,asCP-1251.ButaccordingtoCP-1251,decimalbyte
values170,175,186, and191correspondto Unicode codepointsU+0404,U+0407,
U+0454,andU+0457,respectively.InMongoliandocuments,however,thesebytesare
intendedtorepresentcodepointsU+04E8,U+04AE,U+04E9,andU+04AF,andthe
conversionishandledsimplybyhavinganappropriateMongolianfontinstalledwhen
readingCP-1251documents.
ConsideralsoPolynesianlanguagessuchasHawaiianthathavemacronsovervowels
andthe“okina”(glottalstop).Manylegacytextseitherlea veoutthesespecialcharacters
entirely,orsimplyencodetheminLatin-1,withnoexpectationthatthemacronswillbe
renderedcorrectlyonthescreen. WehaveseeneachofÁ,À,Â,Ã,Äu
sedinthis
way,aswellasabevy ofdifferentcharactersfortheokina. . Itisinterestingtonote
thatseveralHawaiian-speaking contactshave suggested thatdocumentsnotencoded
correctlyinUTF-8beleftoutofthecorpus,eventhoughtheycouldeasilybeconverted;
theexpectationisthatthesewillnotbecarefully-editedtexts,andaremorelikelyto
containmisspellings,poorgrammar,etc.
Irishoffersthebest(andanalmostabsurd)example.Inthedaysbefore8-bitemail,Irish
speakersusedtoindicateacuteaccentsonvowelswithaforwardslashfollowingthe
vowel:“be/al”for“béal”,etc.,andthishabitpersistedwell
intothe2000's.Asitturns
out,certainmailinglistarchiveshostedat
formthesinglelargest
repositoryofIrishlanguagetextontheweb(andtherefore,presumably,thelargestIrish
textcollectioninhistory),butthesetextsarebasicallyinvisibletowebcrawlersand
searchengineslike Googlethatdonottake these conventionsintoaccount(Google
indexesawordlike“be/al”astwowords:“be”and“al”).
Thevastnumberofundocumentedencodingschemesofthiskindillustratestheimpor-
tanceofcollaborationwithnativespeakersforaprojectofthiskind.Indeed,weclaim
thatanyefforttocrawlthewebforalargenumberoflanguageswithoutattemptingto
harnessthecollectiveknowledgeofmanylanguageexperts,eitherviadirectcollabora-
tionorthroughalargedatabaseinthestyleofXNLRDF(Streiter&Stuflesser2005),is
doomedtofailure.
Themajorityoftrainingtextscomefromthreesites: theWikipedia,
6
theJehovah's
6
See
,
whichlists251languagesasof22April2007. StandardpracticeontheWikipediasiteistoencodeall
documentsasUTF-8,butthisisnotalwaysthecase,evenwhentheHTMLmetadataindicatesasmuch,
socareisneededwhenusingthesetextsfortrainingpurposes.
C# PDF Field Edit Library: insert, delete, update pdf form field
A professional PDF form creator supports to create fillable PDF form in C#.NET. An advanced PDF form maker allows users to create editable PDF form in C#.NET.
change font size in fillable pdf form; convert pdf to form fillable
VB.NET Create PDF from Word Library to convert docx, doc to PDF in
formatting. Create PDF files from both DOC and DOCX formats. Convert multiple pages Word to fillable and editable PDF documents. Professional
create fill in pdf forms; convert pdf to fillable form online
C
ORPUSBUILDINGFORUNDER
-
RESOURCEDLANGUAGES
5
Witnesseswebsite,
7
andtheUnitedNations'UniversalDeclarationofHumanRights
site.
8
Thetrainingtextsfromthesethreesiteswerecleanedusingadhocmethodssuited
tothesesites.Manyotherlanguagesweretrainedusingtextsprovideddirectlybynative-
speakingcontributors. Incaseswhereanopensourcespellcheckingpackage(hence
a word list)wasalreadyavailable,itwaspossibletogeneratesearch enginequeries
directly(see§2.2.below), and when thespellcheckerwasknown tobesufficiently
reliable,itcouldbeuseddirectlyforlanguageidentificat ionpurposes,bypassingthe
trigramapproach(andtheneedfortrainingdata)completely.
Next,insteadofusingthesetextsdirectlyforthetrigramstatistics,weperformsome
additionalprocessing. Awordfrequencylistisgenerated,andthenseveralfiltersare
appliedinanattempttoproduceacleanwordlist.Forexample,weremovewordscon-
tainingcharactersnotusuallyappearinginthetargetlanguage,wordswithnovowels
(whenthismakessense),wordswiththesamecharacterappearingthreeormoretimes
inarow,wordswithacapitalortitlecasecharacterappearingafterthefirstcharacter,
wordsthatappearinthewordlistforapollutinglanguage,andwordsthatcontainim-
probable trigrams(atlaterstages, afterthe statisticsareavailable). . Also,sinceitis
extremelycommoninwebcorporafordiacriticstobeomitted,wehavefoundituseful
toremoveASCII-onlywords(like“beal”)ifaversionwithdiac ritics(“béal”)appears
withhigherfrequency.Additionallanguage-specificfiltersc anbeappliedwhenanative-
speakingcontactisavailable,andthesecanbeverypowerful–e.g.Hawaiiandoesnot
allowtwoconsecutiveconsonantsandMalagasyhassimilarconstraintsthatallowfor
veryefficientfiltering. Theendresultisawordlistwecall(i i mprecisely)thelexicon.
Thetrigramstatisticsusedforlanguagerecognitionarecollectedfromthesubcorpusof
wordsappearinginthelexicon.
Threefinalbitsoflanguagemetadataaregathered,basedont hetrainingtexts.First,the
trigramvectorforthelanguageiscomparedwitheveryotherlanguageinthedatabase,
andalistofnearbylanguagesiscreated.Second,oneortwo“stopwords”areextracted
fromthefrequencylisttobeusedinsearchenginequeriesasthecrawlerruns(asde-
scribedin§2.2.). Whennonative-speakingcontactisavailable,thisisdoneautomati-
callybyselectingthehighestfrequencywordsthatdonotappearasahighfrequency
wordin anotherlanguageinthedatabase(in caseswhere itisdifficultto findgood
stopwords,onecanrestricttonearbylanguagesplusthesixtyorsothatarenotmarked
asunder-resourced). Third,alistofcharactersappearinginthetextsiscreatedtobe
usedfortokenizationpurposes.Gettingthetokenizationcorrectisverymuchlanguage-
dependentandweoftenrelyonnativespeakerinputforrefinin gthispartofthesoftware.
7
See
,with310languagesasof22April2007. The
documentsforsomelanguagesaregiveninPDF,presumablywhenthereisaconcernthatvisitorsto
thesitewillnothavethenecessaryfontsinstalledtoviewUTF-8-encodedHTML.VariousCrúbadán
contributorshavealsoreportedqualityissueswiththetranslations,andwhilethesedonotseemtobe
seriousenoughtoaffecttheirusefulnessforlanguagerecognition,oneshouldbecautiouswhendealing
withlanguagesforwhichthesetextsmakeupthemajorityofthewebpresence.
8
See
,whichhas331languageslistedasof22
April2007,though,liketheWatchtowersite,manyofthesearegivenasPDFfilesandcannoteasilybe
convertedtoplaintext.Someofthesearenowavailablefrom
.
C# Create PDF from Word Library to convert docx, doc to PDF in C#.
Convert multiple pages Word to fillable and editable PDF Convert both DOC and DOCX formats to PDF files. Easy to create searchable and scanned PDF files from
convert pdf to fillable form; convert word form to fillable pdf form
VB.NET Create PDF from PowerPoint Library to convert pptx, ppt to
Convert multiple pages PowerPoint to fillable and editable PDF documents. Easy to create searchable and scanned PDF files from PowerPoint.
convert pdf file to fillable form; create fillable form pdf online
6
K
EVIN
P.S
CANNELL
2.2. BasicDesign
Thecrawlerfocussesononelanguageatatime. Areasonablealternativewouldhave
beentocrawlthewebverybroadlyandcategorizeeachdownloadeddocumentusingthe
languagerecognizer,butthisisclearlyinefficientifonec aresprimarilyaboutfinding
textsinlanguagesthatdonothavealargepresenceontheweb.
SearchenginequeriesaregeneratedbyOR'ingtogetherrandomlychosenwordsfrom
theso-called“lexicon”(discussedabove),andthenAND'ingat leastone“stopword”.
AtypicalqueryforIrishmightlooklikethis:
where“agus”(En. “and”)isthestopword. Itisthefourthmost
commonwordinIrish
andsoappearsinanydocumentofnon-trivialsize,yetitdoesnotappearcommonlyin
anyotherlanguagewiththeexceptionofScottishGaelic.
Usingstopwordsinthiswayleadstoveryhighprecisionintermsofretrievingdocu-
mentsthatareactuallywritteninthetargetlanguage. ExtensivetestsforIrishconfirm
this,withqueriesoftheaboveformreturningIrishdocumentswithprecisionexceeding
98%. Overthelongterm,therecallisexcellentaswell,whichisnotsurprisingsince,
givenanyparticularIrishlanguagedocumentyoumighthopetoretrieve,onecaneasily
imagineproducingalargenumberofqueriesoftheaboveformsuchthatthedesired
documentappearsinthetoptenresultsreturnedbyGoogle.Notethatthehighprecision
forIrishisreallyameasureoftheeffectivenessoftheparticularstopword“agus”,and
forotherlanguagesitissometimesmoredifficultforfindsui tablecandidatesforstop-
words.Forexample,for121ofthe416Crúbadánlanguages(2%9),noneofthetop10
mostfrequentwordshavefourormoreletters.
TherandomlygeneratedqueriesarepassedtotheGoogleAPI
9
whichreturnsalistof
URLsofdocumentspotentiallywritteninthetargetlanguage. Thesearedownloaded
(usingthestandardLinuxtool
)andconvertedintoplaintext,encodedasUTF-8.
Fortheconversiontoplaintext,wehavehadthemostsuccesswiththeopensource
programs
,
10
,
11
and
.
12
Asdiscussedabove,forcertain
languagesthecorrectconversiontoUTF-8requiressomepre-orpost-processing.
Afterthisiscomplete,thelanguagerecognizer(§2.3.)isappliedtotheplain-textcan-
didatedocument. Ifitisdeemedtohavebeenwritteninthetargetlanguage,thenitis
addedtothecorpus,andallURLsappearinginthedocument(eitherashypertextlinks
orinrunningtext)areaddedtothelistof“pending”URLs.Ifi tisdeemedtohavebeen
9
SinceitappearsGoogleisnolongerofferingnewkeysforitssearchAPI,findingareliablealter-
nativehasbecomeamorepressingissueforWACresearch. Wehaveexperimentedwithothersearch
engines,viathe
Perlmodules,withmixedsuccess.
10
ForconvertingHTMLtoplaintext.Itisavailablefrom
.
11
ForconvertingPDFfiles. Thisispartofthe
package;see
.Wealsouse
forPostScriptfiles.
12
ForconvertingMicrosoftWorddocuments.See
;the
library
isnowintegratedintothe
program.
VB.NET Create PDF from Excel Library to convert xlsx, xls to PDF
Link: Edit URL. Bookmark: Edit Bookmark. Metadata: Edit, Delete Metadata. Form Process. Create fillable and editable PDF documents from Excel in Visual
create fillable pdf form; pdf fillable form creator
C# Create PDF from Excel Library to convert xlsx, xls to PDF in C#
C#.NET PDF SDK- Create PDF from Word in Visual Evaluation library and components for PDF creation from Create fillable and editable PDF documents from Excel in
converting a word document to pdf fillable form; create a fillable pdf form from a pdf
C
ORPUSBUILDINGFORUNDER
-
RESOURCEDLANGUAGES
7
writteninanearbylanguage,theURLisaddedtoalistofseedURLsforthatlanguage,
tobeusedlater,whenthecrawleristargetingthatnearbylanguage. Inallothercases,
thedocumentissimplydiscarded.
Forlanguagesflaggedas“under-resourced”,thisprocessco
ntinuesuntilthecollection
ofpendingURLsisdepleted,atwhichtimethecrawlcanbeterminated,orelseanewset
ofsearchenginequeriescanbegeneratedfromthenew,largercorpus. Oneimportant
noteforunder-resourcedlanguagesisthattruecrawling(i.e., followinglinksversus
relyingonlyonURLsfromsearchengines)isabsolutelyessentialinordertomaximize
thesizeofthecorpus.ForIrishwehavefoundwellover125000documentsonline,and
searchingforarandomsampleofthesewithGooglesuggeststhatonlyabout90%are
indexedbyGoogle.
Whenthecrawliscomplete,somehousecleaningisperformed: duplicatedocuments
areremovedfromthecorpus,alistof“unproductive”top-le veldomains(manyhitsbut
nodocumentsinthetargetlanguage)isproduced,thefrequencylistisrerun,thefilters
discussedaboveareappliedtogenerateanewlexicon,and,fi nally,thetrigramstatistics
areupdated.
2.3. LanguageRecognition
ThesoftwaremeasuresthesimilaritybetweendocumentsAandB(whereoneorbothof
thedocumentsmightconsistoftheexistingcorpusforalanguage)usingthecosineof
theanglebetweenvectorsrepresentingthedocumentsinthespaceofcharactertrigrams,
whichwedenotec
q
(
A
,
B
)
.Just this simple approach is sufficient for distinguishing the
vast majority of language pairs in our database;
13
anice survey of alternate approaches
is found in (Hughesetal.2006).
There are some subtle questions regarding language recognition that we will not treat in
detail for reasons of space. First isthe granularity at which language recognition should
be performed. Generally speaking, we work at the document level, but for certain lan-
guages of special interest (Irish) we have extracted paragraphs from HTML documents
(see also (Zuraw 2006) for interesting remarks, in the context of an under-resourced
language, on the value of retaining documents containing even small snippets in the
target language). Second, the language recognition threshold is very much language-
dependent and requires occasional tuning based on a number of factors. The most im-
portantfactor, ofcourse, iswhethertherearelanguageswithverysimilartrigramprofiles
in the database. One also has the ability to filter out “low qual ity”documents by setting
the threshold at a high level (say, more than 0
.
85), but this is counter to our goals when
working with under-resourced languages, and we generally set the cutoff to the lowest
value possible so that misclassifications are avoided. As an e xample, for Yoruba, the
closest language in the database (Sango) has cosine measure 0.460, so we are able to
use a cutoff of 0.50, and this is low enough that a large number of Yoruba documents
(e.g. those missing diacritical marks) are found which would not otherwise have been
13
The complete table of
c
q
(
A
,
B
)
values can be found at
.
8
K
EVIN
P. S
CANNELL
included in the corpus.
In certain problematic cases, we augment the language recognizer with a naive Bayes
classifier that works at the level of words. Examples where add itional help has been
required are the dialects of Ladin (Badiot, Fascian, Gherdina, and Standard Ladin) and
Occitan (Languedocien, Provençal, Gascon, Limousin). In the se and similar cases, we
hadoriginallytrainedthecrawler torecognizethelanguage in the broadsense (“Ladin”,
or “Occitan”). Then a list of URLs (on the order of 100-500) of ha rvested documents
was provided to an expert who manually classified them accordi ng to dialect, and these
were used to bootstrap the Bayesian classifier. Dialects are no t the only issue: Cornish,
asanexample, hasatleast threecompetingorthographiesandit wouldbeuseless for any
computationalpurposetomixcorporafor thethree. Andofcoursecertainlanguagepairs
are as difficult (or more difficult) to distinguish than even s ome dialects, for example
Zulu–Xhosa, Danish–Norwegian Bokmal, and Indonesian–Malay.
3. Applications
3.1. Corpus to Spelling and Grammar Checking
Themostsatisfyingapplications of the Crúbadáncorporahvae beentothemostseverely
under-resourced languages, in particular, those languages lacking even a simple word
list.
In §2.1. above, we discussed our algorithm for filtering a raw f requency list in order to
generate a (mostly) clean “lexicon”from which our trigram st atistics are gathered. To
create acompletely clean (not just statistically clean) word list, we must rely on human
editing. Our approach is to first provide the statistically-c leaned lexicon to a native-
speaking volunteer –since this list generally contains few errors, the editing goes quite
quickly. Then, excerpts from the output of the various filter s are examined, and any
correctly-spelled words are added to the official cleaned li st. New trigram statistics are
created based on this editing, and new excerpts are produced for editing. This process
continues until the word list is large enough for reasonable spell checking (a recall of
85% of words in typical documents is a reasonable target, but this varies widelyaccord-
ing to the morphological complexity of the language).
We have created new open source spell checkers for the following languages using
this approach: Azerbaijani, Chichewa, Frisian, Hiligaynon, Kashubian, Kinyarwanda,
Kurdish, Malagasy, Manx Gaelic, Mongolian, Scottish Gaelic, Setswana, Tagalog, and
Tetum.
14
Once a clean word list is in place, the next step is to work on morphological analysis,
at least to the extent that it is supported by existing open source tools like Hunspell.
15
Creating an “affix file”for Hunspell is quite easy, and while the
result is not as pow-
erful as a full transducer, the construction can be done easily by an individual with no
linguistic training. The affix file allows simple morphologic al analysis, and also allows
14
Thetruly hard work wasdoneby ourcollaborators;see theAcknowledgementsbelow.
15
See
.
C
ORPUS BUILDING FOR UNDER
-
RESOURCED LANGUAGES
9
the construction of a (partially) part-of-speech tagged word list. Volunteers can finish
tagging the word list manually. Finally, Brill's unsupervised learning algorithm (Brill
1995) can then be used to train a reasonably reliable part-of-speech tagger.
3.2. Lexicography
During 2004 we collected over 100 million words of Welsh, and about half of this text
was provided to the University of Wales Welsh Dictionary project.
16
Andrew Hawke
emphasized to usatthisearlystagethe valueofinclusivenesswhencorporaarecollected
for lexicographical purposes (for fear that interesting words might be discarded when
boilerplate text or near-duplicates are stripped), andthis has guided our actions since.
One benefit of working with under-resourced languages is that t hey are only rarely the
target of “WAC spam”–documents not written by humans who speak
the target lan-
guage but instead generated automatically by a computer one way or another. We en-
countered asmall amount of WAC spamwhiledeveloping the Welshcorpus (apparently
generated by a dim-witted word-for-word machine translation program) and we have
seen some in Irish also (ann-gram word model). In each case it was a simple matter
to write a language-specific filter to detect these, but creati ng a language-independent
filter, or filters for 400+ languages will be a major obstacle.
3.3. Other Applications
We have provided data (sentences, frequency lists, language identification data) to sev-
eral dozen other projects. These projects involve everything from lexicography, mor-
phology, and diacritic replacement (Wagachaetal.2006) to machine translation, word
sense disambiguation, and thesaurus construction. We will continue to share the data
with research groups that release their own software under an approved open source
license.
17
Acknowledgements
The Crúbadán project owes its success to the volunteer effosrt of more than 75 contrib-
utors from over 40 countries; the full list is available from the project web site.
18
We
hope that relegating their names to the web site in this way does not appear to diminish
our gratitude for their tireless work and enthusiasm. The project was undertaken (and
continues)with noexternalfunding, thoughtheauthoris gratefulto his homeinstitution
for a well-timed sabbatical during which the majority of the development was done.
References
B
ARONI
M. et B
ERNARDINI
S. (2004), “BootCaT: Bootstrapping Corporaand Terms from t he Web”,
inProceedingsofLREC2004,Lisbon, Portugal.
B
ARONI
M. et K
ILGARRIFF
A. (2006), “Large linguistically-processed Web corporafor multiplelan-
guages”, in Proceedingsofthe11thEACLConference,Trento, Italy.
16
See
.
17
See
for acompletelist.
18
See
.
10
K
EVIN
P. S
CANNELL
B
ERMENT
V.(2004),Méthodespourinformatiserdeslanguesetdesgroupsdelanguespeudotées,PhD
thesis,UniversitéJoseph Fourier.
B
RILL
E. (1995), “Unsupervised Learning of Disambiguation Rules f or Part of Speech Tagging”,
in Yarowsky D. & Church K. (Eds), Proceedings of the Third d Workshop on Very y Large Cor-
pora,Cambridge,Massachusetts.
C
IARAMITA
M. et B
ARONI
M. (2006),“Measuring Web-Corpus Randomness: AProgressRep ort”,in
Baroni M.& BernardiniS. (Eds),WaCky!WorkingPapersontheWebasCorpus,GEDIT,Bologna.
DE
S
CHRYVER
G.-M.(2002),“Web for/asCorpus: APerspectivefortheAfric anLanguages”,in Nordic
Journal of African Studies,n
o
2,vol.11.
E
VERT
S. (2006), “How Random is a Corpus? The Library Metaphor”, in ZeitschriftfürAnglistikund
Amerikanistik,n
o
2,vol.54.
G
HANI
R., J
ONES
R. et M
LADENI
´
C
D. (2001), “Mining the Web to Create Minority Language Cor-
pora”, in Proceedingsofthe10thinternationalconferenceonInformationandknowledgemanage-
ment,Athens,Georgia: 279–286.
G
HANI
R., J
ONES
R.et M
LADENI
´
C
D. (2005), “Building Minority LanguageCorporaby Learning to
GenerateWeb Search Queries”,in KnowledgeandInformationSystems,n
o
1,vol. 7.
H
UGHES
B., B
ALDWIN
T., B
IRD
S., N
ICHOLSON
J. et M
AC
K
INLAY
A. (2006), “Reconsidering
LanguageIdentification for Written Language Resources”, in
Proceedings of the 5th International
Conferenceon LanguageResourcesand Evaluation (LREC2006),Genoa,Italy: 485–488.
K
ILGARRIFF
A. (2001), “Web as corpus”, in ProceedingsoftheCorpusLinguistics2001Conference,
Lancaster University : 342–344.
K
ILGARRIFF
A. et G
REFENSTETTE
G. (2003), “Introduction to the Special Issue on the Web as Cor -
pus”,in ComputationalLinguistics,n
o
3,vol. 29.
K
RAUWER
S.(2003), “TheBasicLanguageResourceKit(BLARK)astheFirstM ilestonefortheLan-
guageResourcesRoadmap”, in ProceedingsoftheInternationalWorkshop“SpeechandComput er”,
SPECOM 2003,Moscow,Russia.
M
AXWELL
M. et H
UGHES
B. (2006), “Frontiers in Linguistic Annotation for Lower-Dens ity Lan-
guages”, in ProceedingsoftheCOLING/ACL2006workshop“FrontiersinLingui sticallyAnnotated
Corpora” ,Sydney: 29–37.
R
AYSON
P., W
ALKERDINE
J., F
LETCHER
W. H. et K
ILGARRIFF
A. (2006), “Annotated web ascor-
pus”, in Kilgarriff A. & BaroniM. (Eds), Proceedingsofthe2ndInternationalWorkshoponWebas
Corpus(EACL06),Trento,Italy: 27–34.
R
ESNIK
P. (1999), “Mining the web for bilingual text”, in Proceedingsofthe37thAnnualMeetingof
theAssociation forComputational Linguistics (ACL'99),CollegePark,Maryland: 527–534.
S
HAROFF
S.(2006a),“CreatingGeneral-PurposeCorporaUsingAutoma tedSearch EngineQueries”,in
Baroni M.& BernardiniS. (Eds),WaCky!WorkingPapersontheWebasCorpus,GEDIT,Bologna.
S
HAROFF
S. (2006b), “Open-sourcecorpora: Using thenet to fish forlin guisticdata”,in International
Journal of CorpusLinguistics,n
o
4,vol. 11.
S
TREITER
O., S
CANNELL
K. et S
TUFLESSER
M. (2007),ImplementingNLPProjectsforNon-Central
Languages: InstructionsforFundingBodies,StrategiesforDevelopers,Toappearin MachineTrans-
lation.
S
TREITER
O. et S
TUFLESSER
M. (2005), “XNLRDF, the Open Source Framework for Multilingual
Computing”, in Ties I.(Ed), Proceedingsoftheconference“LesserUsedLanguagesandComput er
Linguistics” :EuropeanAcademy,Bozen-Bolzano,Italy: 189–207.
W
AGACHA
P. W., D
E
P
AUW
G. et G
ITHINJI
P. W. (2006), “A Grapheme-Based Approach forAccent
Restoration in G˜ık ˜uy˜u”, in Proceedingsofthe5thInternationalConferenceonLanguageResources
and Evaluation (LREC2006),Genoa,Italy: 1937–1940.
Z
URAW
K. (2006), “Using the Web as a Phonological Corpus: a case stud y from Tagalog”, in Kil-
garriff A. & Baroni M. (Eds),Proceedingsofthe2ndInternationalWorkshoponWebasCorpus
(EACL06),Trento,Italy: 59–66.
Documents you may be interested
Documents you may be interested