pdf viewer c# open source : Add multiple jpg to pdf application software cloud windows winforms web page class owoputi+etal.naacl130-part2078

Improved Part-of-Speech Tagging for Online Conversational Text
with Word Clusters
Olutobi Owoputi
Brendan O’Connor
Chris Dyer
Kevin Gimpel
y
Nathan Schneider
NoahA. Smith
School ofComputerScience, CarnegieMellon University, Pittsburgh, PA15213, USA
y
ToyotaTechnologicalInstituteatChicago, Chicago, IL60637, USA
Corresponding author:
brenocon@cs.cmu.edu
Abstract
We consider the problem of part-of-speech
tagging for informal, online conversational
text. We systematically evaluate the use of
large-scale unsupervised word clustering
and new lexical features to improve tagging
accuracy. With these features, our system
achieves state-of-the-art tagging results on
both Twitter and IRC POS tagging tasks;
Twittertaggingisimprovedfrom90%to93%
accuracy (more than 3% absolute). Quali-
tative analysis of these word clusters yields
insightsaboutNLPandlinguistic phenomena
in thisgenre. Additionally, we contribute the
first POS annotation guidelines for such text
andreleaseanewdatasetofEnglishlanguage
tweets annotated using these guidelines.
Tagging software, annotation guidelines, and
large-scale wordclustersare availableat:
http://www.ark.cs.cmu.edu/TweetNLP
Thispaperdescribesrelease 0.3ofthe“CMU
TwitterPart-of-SpeechTagger”andannotated
data.
[Thispaperisforthcoming in Proceedingsof
NAACL2013;Atlanta,GA,USA.]
1 Introduction
Online conversational text, typified by microblogs,
chat, and text messages,
1
is a challenge for natu-
ral language processing. Unlike the highly edited
genres that conventional NLP tools have been de-
velopedfor, conversationaltextcontains manynon-
standardlexicalitemsand syntacticpatterns. These
aretheresultofunintentionalerrors,dialectalvaria-
tion,conversationalellipsis,topicdiversity,andcre-
ative use of language and orthography (Eisenstein,
2013). An example is shown in Fig. 1. As a re-
sult of this widespread variation, standard model-
1
Alsoreferredtoascomputer-mediatedcommunication.
ikr
!
smh
G
he
O
asked
V
fir
P
yo
D
last
A
name
N
so
P
he
O
can
V
add
V
u
O
on
P
fb
^
lololol
!
Figure 1: Automatically taggedtweetshowing nonstan-
dardorthography,capitalization,andabbreviation.Ignor-
ing the interjectionsand abbreviations, it glosses as He
askedforyourlastnamesohecanaddyouonFacebook.
The tagset isdefined in AppendixA. RefertoFig.2 for
wordclusterscorrespondingtosomeofthesewords.
ing assumptions that depend on lexical, syntactic,
andorthographicregularityareinappropriate. There
is preliminary workonsocial media part-of-speech
(POS) tagging (Gimpel et al., 2011), named entity
recognition(Ritteretal.,2011;Liuetal.,2011),and
parsing (Foster et al., 2011), but accuracy rates are
still significantly lower than traditional well-edited
genreslikenewswire. Evenwebtextparsing,which
is a comparatively easier genre than social media,
lags behindnewspaper text (Petrov and McDonald,
2012),asdoes speechtranscriptparsing(McClosky
etal.,2010).
To tackle the challenge of novel words and con-
structions, we create a new Twitter part-of-speech
tagger—building on previous work by Gimpel et
al. (2011)—that includes new large-scale distribu-
tionalfeatures. This leads tostate-of-the-artresults
inPOS tagging for both Twitter andInternetRelay
Chat(IRC)text. Wealsoannotatedanewdatasetof
tweets with POS tags, improved the annotations in
the previous datasetfrom Gimpel et al., and devel-
opedannotationguidelines formanualPOStagging
of tweets. We release all of these resources to the
researchcommunity:
• an open-source part-of-speech tagger for online
conversationaltext(§2);
• unsupervisedTwitterwordclusters (§3);
Add multiple jpg to pdf - insert images into PDF in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Sample C# code to add image, picture, logo or digital photo into PDF document page using PDF page editor control
how to add an image to a pdf in preview; how to add jpg to pdf file
Add multiple jpg to pdf - VB.NET PDF insert image library: insert images into PDF in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
Guide VB.NET Programmers How to Add Images in PDF Document
adding a jpg to a pdf; add picture to pdf preview
• animprovedemoticondetectorforconversational
text(§4);
• POSannotationguidelines (§5.1);and
• a new dataset of 547 manually POS-annotated
tweets (§5).
2 MEMMTagger
Our tagging model is a first-order maximum en-
tropyMarkovmodel(MEMM), adiscriminativese-
quence model for which training and decoding are
extremely efficient (Ratnaparkhi, 1996; McCallum
et al., 2000).
2
The probability of a tagy
t
is condi-
tionedonthe inputsequencexandthetagtoitsleft
y
t 1
,and is parameterized by a multiclass logistic
regression:
p(y
t
=k jy
t 1
;x;t;) /
exp
(trans)
y
t 1
;k
+
P
j
(obs)
j;k
f
j
(x;t)
We use transition features for every pair of labels,
and extract base observation features from token t
and neighboring tokens, and conjoin them against
all K = 25 possible outputs in our coarse tagset
(Appendix A). Our feature sets will be discussed
belowindetail.
Decoding. For experiments reportedinthispaper,
we use the O(jxjK
2
)Viterbi algorithm for predic-
tion; K is the number of tags. This exactly max-
imizes p(y j x), but the MEMM also naturally al-
lowsafasterO(jxjK)left-to-rightgreedydecoding:
fort =1:::jxj:
^y
t
argmax
k
p(y
t
=k j ^y
t 1
;x;t;)
whichwefindis3timesfasterandyieldssimilarac-
curacyasViterbi(aninsignificantaccuracydecrease
of lessthan0.1%absoluteontheD
AILY
547 testset
discussedbelow). Speedisparamountforsocialme-
dia analysis applications—which often require the
2
AlthoughwhencomparedtoCRFs, MEMMstheoretically
sufferfromthe“labelbias”problem(Laffertyetal.,2001),our
systemsubstantiallyoutperformstheCRF-basedtaggersofpre-
viouswork;andwhen comparingtoGimpeletal.systemwith
similar feature sets, we observed little differencein accuracy.
This is consistent with conventional wisdom that the quality
of lexicalfeatures is much more important than the paramet-
ricformof thesequencemodel,atleastinoursetting: part-of-
speechtaggingwithasmalllabeledtrainingset.
processing of millions to billions of messages—so
wemakegreedydecodingthedefaultinthereleased
software.
Thisgreedytaggerrunsat800tweets/sec.(10,000
tokens/sec.) on a single CPU core, about 40 times
faster than Gimpeletal.’ssystem. The tokenizerby
itself(§4)runsat3,500tweets/sec.
3
Training and regularization. During training,
theMEMMlog-likelihoodforataggedtweethx;yi
isthesumovertheobservedtokentagsy
t
,eachcon-
ditionalon thetweetbeingtaggedandthe observed
previoustag(withastartsymbolbeforethefirstto-
keninx),
‘(x;y;) =
P
jxj
t=1
logp(y
t
jy
t 1
;x;t;):
We optimize the parameters  with OWL-QN, an
L
1
-capable variant of L-BFGS (Andrew and Gao,
2007;LiuandNocedal,1989)tominimizetheregu-
larizedobjective
argmin
1
N
P
hx;yi
‘(x;y;)+R()
whereN is the number of tokens in the corpus and
the sum ranges over alltagged tweets hx;yi in the
trainingdata. Weuseelasticnetregularization(Zou
and Hastie, 2005), whichis a linear combination of
L
1
andL
2
penalties;herej indexesoverallfeatures:
R() =
1
P
j
j
j
j+
1
2
2
P
j
2
j
UsingevenaverysmallL
1
penaltyeliminatesmany
irrelevantornoisyfeatures.
3 Unsupervised Word Clusters
OurPOStaggercanmakeuseofanynumberofpos-
sibly overlapping features. While we have only a
small amount of hand-labeled data for training, we
also have access to billions of tokens of unlabeled
conversationaltextfromtheweb. Previousworkhas
shownthatunlabeledtextcanbeusedtoinduce un-
supervisedwordclusterswhichcanimprovetheper-
formanceofmanysupervisedNLPtasks(Kooetal.,
2008;Turianetal.,2010;Täckströmetal.,2012,in-
teralia). Weuseasimilarapproachheretoimprove
tagging performance for online conversational text.
We also make our induced clusters publicly avail-
able in the hope that they will be useful for other
NLPtasksinthisgenre.
3
RuntimesobservedonanIntelCorei52.4GHzlaptop.
VB.NET PDF Convert to Jpeg SDK: Convert PDF to JPEG images in vb.
Turn multiple pages PDF into multiple jpg files in VB.NET class. Support of converting from any single one PDF page and multiple pages. Add necessary references:
add image to pdf reader; adding image to pdf in preview
C# PDF Convert to Jpeg SDK: Convert PDF to JPEG images in C#.net
of converting from any single one PDF page and multiple pages. converter library will name the converted JPEG image file Output.jpg. Add necessary references:
add a jpeg to a pdf; add jpg to pdf file
Binarypath
Topwords(byfrequency)
A1
111010100010
lmaolmfaolmaoolmaooohahahahahaloolctfurofllooollmfaoolmfaooolmaoooolmbolololol
A2
111010100011
hahahahahahehehahahahahahahahaheheheahahahahhahahahkkhahaaahah
A3
111010100100
yesyepyupnopeyessyesssyessssofcourseyeaplikewiseyeppyeshywyuupyus
A4
111010100101
yeahyeanahnawyeahhnoooyehnoonooooyeaaikrnvmyeahhhnahhnooooo
A5
11101011011100
smhjk#fail#random#factsmfh#smh#winning#realtalksmdh#dead#justsaying
B
011101011
uyuyuhyhuuuyuuyewy0uyuhhyouhyhuuigetyoyyoohyuo
yuejuu
dyayouzyyou
C
11100101111001
wfofafrfroovferfirwhitabouaftserieforefahfuhw/herw/thatfronisnagains
D
111101011000
facebookfbitunesmyspaceskypeebaytumblrbbmflickraimmsnnetflixpandora
E1
0011001
trynagonfinnaboutatrynnabouttagnefinagonntryinafennaqonetrynaaqon
E2
0011000
gonnagunnagonagnagunagnnagannaqonnagonnnaganaqunnagonnegoona
F
0110110111
soosooosoooosooooosoooooosooooooosoooooooosooooooooosoooooooooo
G1
11101011001010
;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-d
G2
11101011001011
:) (: =) :)) :]
:’) =] ^_^:))) ^.^[: ;))
((: ^__^(= ^-^:))))
G3
1110101100111
:( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\-____- ;( /: :(( >_< =[ :[ #fml
G4
111010110001
<3
xoxo <33 xo <333
#love s2 <URL-twitition.com> #neversaynever <3333
Figure2:Examplewordclusters(HMMclasses): we listthe mostprobablewords,startingwiththe mostprobable,in
descendingorder. Boldfaced wordsappearinthe exampletweet(Figure 1). The binary stringsare root-to-leafpaths
through the binary clustertree. Forexample usage, see e.g. search.twitter.com, bing.com/social and
urbandictionary.com.
3.1 ClusteringMethod
We obtained hierarchical word clusters via Brown
clustering (Brown et al., 1992) on a large set of
unlabeled tweets.
4
The algorithm partitions words
into a base set of 1,000 clusters, and induces a hi-
erarchy among those 1,000clusters witha series of
greedyagglomerativemergesthatheuristicallyopti-
mizethelikelihoodofahiddenMarkovmodelwitha
one-class-per-lexical-typeconstraint. Notonlydoes
Brownclustering produce effective features for dis-
criminative models, but its variants are better unsu-
pervised POS taggers than some models developed
nearly 20 years later; see comparisons in Blunsom
andCohn(2011). Thealgorithmis attractiveforour
purposessinceitscalestolargeamountsofdata.
When training on tweets drawn from a single
day, we observed time-specific biases (e.g., nu-
merical dates appearing in the same cluster as the
word tonight), sowe assembled our unlabeled data
from a random sample of 100,000 tweets per day
from September 10, 2008 to August 14, 2012,
and filtered out non-English tweets (about 60% of
the sample) using langid.py (Lui and Baldwin,
2012).
5
Each tweet was processed with our to-
4
As implemented by Liang (2005), v. 1.3: https://
github.com/percyliang/brown-cluster
5
https://github.com/saffsd/langid.py
kenizer and lowercased. We normalized all at-
mentions to h@MENTIONi and URLs/email ad-
dresses to their domains (e.g. http://bit.ly/
dP8rR8 ) hURL-bit.lyi). In an effort to reduce
spam, we removed duplicatedtweettexts (this also
removes retweets) before word clustering. This
normalization and cleaning resulted in 56 million
unique tweets (847 million tokens). We set the
clusteringsoftware’scountthresholdto onlycluster
wordsappearing40ormoretimes,yielding216,856
wordtypes,whichtook42hourstocluster onasin-
gleCPU.
3.2 ClusterExamples
Fig. 2 shows example clusters. Some of the chal-
lengingwordsintheexampletweet(Fig.1)arehigh-
lighted. The term lololol (an extension of lol for
“laughingoutloud”) isgroupedwithalargenumber
of laughter acronyms (A1: “laughing my (fucking)
assoff,”“cracking thefuckup”). Sinceexpressions
oflaughteraresoprevalentonTwitter,thealgorithm
creates another laughter cluster (A1’s sibling A2),
thattendstohaveonomatopoeic,non-acronymvari-
ants(e.g.,haha). Theacronymikr(“Iknow,right?”)
is grouped with expressive variations of “yes” and
“no”(A4). NotethatA1–A4are groupedina fairly
specificsubtree;andindeed,inthismessageikr and
C# Create PDF from images Library to convert Jpeg, png images to
for combining multiple image formats into one or multiple PDF file in C# images.Add(new Bitmap(Program.RootPath + "\\" 1.jpg")); images.Add(new Bitmap
add a jpg to a pdf; add image in pdf using java
VB.NET Create PDF from images Library to convert Jpeg, png images
Turn multiple image formats into one or multiple PDF file. 1.bmp")) images.Add(New REImage(Program.RootPath + "\\" 1.jpg")) images.Add(New REImage
adding an image to a pdf form; add image to pdf acrobat
lolololarebothtaggedas interjections.
smh(“shakingmyhead,” indicatingdisapproval)
seems related, though is always tagged in the an-
notated data as a miscellaneous abbreviation (G);
the difference between acronyms that are interjec-
tions versus other acronyms may be complicated.
Here,smhisinarelatedbutdistinctsubtreefromthe
above expressions (A5); its usage in this example
is slightly different from its more common usage,
which it shares with the other words in its cluster:
message-endingexpressionsofcommentaryoremo-
tionalreaction,sometimesasametacommentonthe
author’s message; e.g., Maybe you could geta guy
to date you if you actually respectedyourself #smh
or Thereis reallyNOreasonwhyothergirls should
sendmyboyfriendagoodmorningtext#justsaying.
Weobservemanyvariantsofcategoriestradition-
allyconsideredclosed-class,includingpronouns(B:
u=“you”) andprepositions(C:fir=“for”).
There is also evidence of grammatical categories
specifictoconversationalgenresofEnglish;clusters
E1–E2 demonstrate variations of single-word con-
tractions for “going to” and “trying to,” some of
whichhavemorecomplicatedsemantics.
6
Finally,theHMMlearns aboutorthographicvari-
ants, eventhoughittreats allwordsasopaquesym-
bols; cluster F consists almost entirely of variants
of “so,” their frequencies monotonicallydecreasing
inthenumber of vowelrepetitions—aphenomenon
called“expressivelengthening”or“affectivelength-
ening” (Brody and Diakopoulos, 2011; Schnoebe-
len,2012). Thissuggestsafuturedirectiontojointly
model class sequence and orthographic informa-
tion(Clark,2003;SmithandEisner,2005;Blunsom
andCohn,2011).
We have built an HTML viewer to browse these
andnumerous otherinterestingexamples.
7
3.3 EmoticonsandEmoji
We use the term emoticon to mean a face or icon
constructed with traditional alphabetic or punctua-
6
Onecoauthor,anativespeakeroftheTexanEnglishdialect,
notes“finna”(shortfor“fixingto”, clusterE1)may bean im-
mediatefutureauxiliary, indicating an immediatefuture tense
thatis presentin many languages(thoughnot in standard En-
glish). Toillustrate:“Shefinnago”approximatelymeans“She
willgo,”butsooner,inthesenseof“Sheisabouttogo.”
7
http://www.ark.cs.cmu.edu/TweetNLP/
cluster_viewer.html
tion symbols, and emojito mean symbols rendered
insoftwareassmallpictures,inlinewiththetext.
Since our tokenizer is careful to preserve emoti-
cons and other symbols (see §4), theyareclustered
just like other words. Similar emoticons are clus-
tered together (G1–G4), includingseparate clusters
of happy [[ :) =)
^
_
^
]], sad/disappointed [[ :/ :(
-_- </3 ]], love [[ xoxo . ]] and winking [[
;) (
^
_-) ]] emoticons. The clusters are not per-
fectly aligned with our POS annotation guidelines;
for example, the “sad” emoticon cluster included
emotion-bearing termsthat our guidelines define as
non-emoticons,suchas#ugh,#tear,and#fml(“fuck
my life”), though these seem potentially useful for
sentimentanalysis.
One difficult task is classifying different types
of symbols in tweets: our annotation guidelines
differentiate between emoticons, punctuation, and
garbage (apparentlynon-meaningfulsymbols or to-
kenizationerrors). SeveralUnicodecharacterranges
are reservedfor emoji-style symbols (including the
three Unicode hearts in G4); however, depending
on the user’s software, characters in these ranges
might be rendered differently or not at all. We
have found instances where the clustering algo-
rithm groups proprietary iOS emoji symbols along
with normal emoticons; for example, the character
U+E056, which is interpreted on iOS as a smiling
face,isinthesameG2clusterassmileyfaceemoti-
cons. ThesymbolU+E12F, which represents apic-
ture of a bag of money, is grouped with the words
cashandmoney.
3.4 Cluster-BasedFeatures
Since Brown clusters are hierarchical in a binary
tree, each word is associated with a tree path rep-
resentedasabitstringwithlength16;weusepre-
fixesofthebitstringasfeatures(forallprefixlengths
2f2;4;6;:::;16g). Thisallows sharingof statisti-
calstrength between similar clusters. Using prefix
featuresofhierarchicalclustersinthiswaywassim-
ilarly foundto be effective for named-entity recog-
nition (Turian et al., 2010) and Twitter POS tag-
ging(Ritter etal.,2011).
Whencheckingtoseeifawordisassociatedwith
acluster, the tagger firstnormalizes thewordusing
the same techniques as described in §3.1, thencre-
ates a priority list of fuzzy match transformations
C# WPF PDF Viewer SDK to convert and export PDF document to other
Text. Add Text Box. Drawing Markups. Add Stamp Annotation. Create multiple pages Tiff file from PDF document. quality, support converting PDF to PNG, JPG, BMP and
adding an image to a pdf; add image to pdf in preview
C# Create PDF Library SDK to convert PDF from other file formats
Gratis control for creating PDF from multiple image formats such as tiff, jpg, png, gif this PDF document creating toolkit, if you need to add some text
how to add a jpg to a pdf; add photo to pdf
of the word by removing repeated punctuation and
repeated characters. If the normalized word is not
ina cluster, the tagger considers the fuzzymatches.
Althoughonlyabout 3%of thetokens in the devel-
opmentset (§6) did not appear in a clustering, this
method resultedin a relative error decrease of 18%
amongsuchwordtokens.
3.5 Other LexicalFeatures
Besides unsupervised word clusters, there are two
other sets of features that contain generalized lexi-
calclassinformation. Weusethetagdictionaryfea-
ture from Gimpel et al., which adds a feature for
a word’s most frequent part-of-speech tag.
8
This
canbeviewedasafeature-baseddomainadaptation
method,since itgiveslexicaltype-levelinformation
for standardEnglish words, whichthemodellearns
tomapbetweenPTBtagstothedesiredoutputtags.
Second, since the lack of consistent capitalization
conventions on Twitter makes it especially difficult
to recognize names—Gimpel et al. and Foster et
al. (2011) found relatively low accuracy on proper
nouns—we added a token-level name list feature,
which fires on (non-function) words from names
from several sources: Freebase lists of celebrities
and video games (Google, 2012), the Moby Words
listofUSLocations,
9
andlistsofmale,female,fam-
ily,andpropernamesfromMarkKantrowitz’sname
corpus.
10
4 Tokenization andEmoticon Detection
Word segmentation on Twitter is challenging due
to the lack of orthographic conventions; in partic-
ular,punctuation, emoticons, URLs,andothersym-
bolsmayhavenowhitespaceseparationfromtextual
8
FrequenciescamefromtheWallStreetJournalandBrown
corpus sectionsof thePenn Treebank. If aword has multiple
PTBtags,eachtagisafeaturewithvalueforthefrequencyrank;
e.g.forthreedifferenttagsinthePTB,thisfeaturegivesavalue
of 1for themostfrequenttag, 2/3for thesecond,etc. Coarse
versionsof thePTB tags areused(Petrov etal., 2011). While
88% of words in the dictionary haveonly one tag, using rank
informationseemedtogiveasmallbutconsistentgainoveronly
usingthemostcommontag,orusingbinaryfeaturesconjoined
withrankasinGimpeletal.
9
http://icon.shef.ac.uk/Moby/mwords.html
10
http://www.cs.cmu.edu/afs/cs/project/
ai-repository/ai/areas/nlp/corpora/names/
0.html
words (e.g. no:-d,yes should parse as four tokens),
and internally may contain alphanumeric symbols
thatcouldbe mistakenforwords:a naivesplit(/[^a-
zA-Z0-9]+/) tokenizer thinks the words “p”and“d”
areamongthetop100mostcommonwordsonTwit-
ter,duetomisanalysisof:pand:d. TraditionalPenn
Treebank–style tokenizers are hardly better, often
breaking a string of punctuation characters into a
singletokenpercharacter.
We rewrote twokenize (O’Connor et al.,
2010), a rule-based tokenizer, emoticon, and URL
detector, for use in the tagger. Emoticons are es-
pecially challenging, since they are open-class and
productive. We reviseO’Connor etal.’s regular ex-
pressiongrammarthatdescribespossibleemoticons,
addingagrammarofhorizontalemoticons(e.g.-_-),
knownas“Eastern-style,”
11
thoughweobservehigh
usage inEnglish-speaking Twitter (Fig.2, G2–G3).
Wealsoadda number of other improvementstothe
patterns. Because this system was used as prepro-
cessingforthewordclusteringexperimentin§3,we
were able to infer the emoticon clusters in Fig. 2.
Furthermore,whether atokenmatches theemoticon
patternisalsousedasafeatureinthe tagger (§2).
URL recognitionis alsodifficult,sincethehttp://
is often dropped, resulting in protocol-less URLs
likeabout.me. Weaddrecognitionpatternsforthese
byusingalistoftop-levelandcountrydomains.
5 Annotated Dataset
Gimpel et al. (2011) provided a dataset of POS-
tagged tweets consisting almost entirely of tweets
sampled from one particular day (October 27,
2010). Wewereconcernedaboutoverfittingtotime-
specificphenomena;for example,asubstantialfrac-
tion of the messages are about a basketball game
happeningthatday.
We created a new test setof 547 tweets for eval-
uation. The testsetconsistsof one random English
tweetfrom everyday betweenJanuary 1, 2011 and
June30,2012. Inorder for atweettobeconsidered
English, it had tocontain atleastone Englishword
otherthana URL, emoticon,orat-mention. We no-
ticed biases in the outputs of langid.py, so we
instead selected these messages completely manu-
11
http://en.wikipedia.org/wiki/List_of_
emoticons
C# PDF Convert to Images SDK: Convert PDF to png, gif images in C#
exporting PDF to multiple image forms, including Jpg, Png, Bmp high performance conversions from PDF document to multiple image forms. Add necessary references:
adding an image to a pdf in preview; add jpeg signature to pdf
VB.NET Create PDF Library SDK to convert PDF from other file
Gratis control for creating PDF from multiple image formats such as tiff, jpg, png, gif, bmp, etc. Add necessary references: RasterEdge.Imaging.Basic.dll.
add image to pdf acrobat reader; add image field to pdf form
ally (going through a random sample of one day’s
messages untilanEnglishmessagewasfound).
5.1 AnnotationMethodology
Gimpeletal.providedatagsetforTwitter(shownin
AppendixA), whichweusedunmodified. Theorig-
inalannotationguidelineswerenotpublished,butin
this work we recorded the rules governing tagging
decisions andmade further revisionswhile annotat-
ing the new data.
12
Some of our guidelines reiter-
ateor modifyrulesmade byPennTreebankannota-
tors,whileotherstreatspecificphenomenafoundon
Twitter(refertothe nextsection).
Ourtweetswereannotatedbytwoannotatorswho
attempted to match the choices made in Gimpel et
al.’s dataset. The annotatorsalsoconsultedthe POS
annotations in the Penn Treebank (Marcus et al.,
1993) as an additional reference. Differences were
reconciledbyathirdannotatorindiscussionwithall
annotators.
13
During this process, aninconsistency
was found in Gimpel et al.’s data, which we cor-
rected(concerningthetaggingofthis/that,achange
to100labels, 0.4%). The new versionof Gimpelet
al.’s data(calledO
CT
27),aswellas thenewermes-
sages (called D
AILY
547), are both included in our
datarelease.
5.2 Compounds inPennTreebankvs.Twitter
Ritter et al. (2011) annotated tweets using an aug-
mented version of the PTB tagset and presumably
followed the PTB annotation guidelines. We wrote
newguidelinesbecausethePTBconventionsarein-
appropriatefor Twitter inseveralways, as shownin
the design of Gimpel et al.’s tagset. Importantly,
“compound” tags (e.g., nominal+verbal and nomi-
nal+possessive)areusedbecausetokenizationisdif-
ficult or seemingly impossible for the nonstandard
wordformsthatarecommonplaceinconversational
text.
Forexample,thePTBtokenizationsplitscontrac-
tionscontainingapostrophes:I’m)I/PRP’m/VBP.
But conversational text often contains variants that
resist a single PTB tag (like im), or even chal-
lenge traditional English grammatical categories
12
Theannotationguidelinesareavailableonlineat
http://www.ark.cs.cmu.edu/TweetNLP/
13
Annotatorsarecoauthorsofthispaper.
(like imma or umma, which both mean “I am go-
ing to”). One strategy would be to analyze these
formsintoaPTB-styletokenization,asdiscussedin
Forsyth(2007),who proposesto analyzedonchaas
do/VBP ncha/PRP, but notes it would be difficult.
We think this is impossible to handle in the rule-
based frameworkusedbyEnglish tokenizers, given
the huge (and possibly growing) number of large
compounds like imma, gonna, w/that, etc. These
are notrare: the word clustering algorithm discov-
ers hundreds of such words asstatistically coherent
classes (e.g. clusters E1 and E2 in Fig. 2); and the
wordimma is the 962ndmostcommonwordin our
unlabeledcorpus,morefrequentthancator near.
We do not attempt todoTwitter “normalization”
into traditional written English (Han and Baldwin,
2011),whichweview as alossytranslationtask. In
fact,manyofTwitter’suniquelinguisticphenomena
areduenotonlytoits informalnature,butalsoaset
of authors thatheavilyskews towardsyounger ages
andminorities, withheavyusageof dialects thatare
different than the standard American English most
oftenseeninNLPdatasets(Eisenstein,2013;Eisen-
stein et al., 2011). For example, we suspect that
immamayimplicatetenseandaspectmarkers from
African-American Vernacular English.
14
Trying to
imposePTB-styletokenizationonTwitterislinguis-
ticallyinappropriate:shouldthelexico-syntacticbe-
haviorofcasualconversationalchatter byyoungmi-
noritiesbestraightjacketedintothestylisticconven-
tions of the1980sWallStreetJournal? Instead, we
would like to directly analyze the syntax of online
conversationaltextonitsownterms.
Thus, we choose to leave these word forms un-
tokenized and use compound tags, viewing com-
positional multiword analysis as challenging fu-
ture work.
15
We believe that our strategy is suf-
ficient for many applications, such as chunking or
named entity recognition; many applications such
as sentiment analysis (Turney, 2002;Pangand Lee,
2008, §4.2.3), open information extraction (Carl-
son et al., 2010; Fader et al., 2011), and informa-
tion retrieval(Allan and Raghavan, 2002) use POS
14
See “Tense and
aspect” examples in
http:
//en.wikipedia.org/wiki/African_American_
Vernacular_English
15
Forexample, wtf has compositionalbehaviorin“Wtf just
happened??”,butonlydebatablysoin“Huhwtf”.
#Msg. #Tok. Tagset Dates
O
CT
27
1,827 26,594 App.A
Oct27-28,2010
D
AILY
547
547 7,707 App.A
Jan2011–Jun2012
NPSC
HAT
10,578 44,997
PTB-like Oct–Nov2006
(w/osys.msg.)
7,935 37,081
R
ITTER
T
W
789 15,185
PTB-like unknown
Table 1: Annotated datasets: number of messages, to-
kens, tagset, and date range. More information in §5,
§6.3,and§6.2.
patterns that seem quite compatible with our ap-
proach. Morecomplexdownstreamprocessing like
parsingisaninterestingchallenge,sincecontraction
parsing on traditional text is probably a benefit to
current parsers. We believe that any PTB-trained
tool requires substantial retraining and adaptation
forTwitterduetothehugegenreandstylisticdiffer-
ences(Fosteretal.,2011);thustokenizationconven-
tionsare arelativelyminor concern. Our simple-to-
annotateconventions makeiteasier to produce new
trainingdata.
6 Experiments
We are primarily concerned with performance on
our annotated datasets described in §5 (O
CT
27,
D
AILY
547), though for comparison to previous
work we also test on other corpora (R
ITTER
T
W
in
§6.2, NPSC
HAT
in §6.3). The annotated datasets
arelistedinTable1.
6.1 MainExperiments
We use O
CT
27 to refer to the entire dataset de-
scribed in Gimpel et al.; it is split into train-
ing,development, and testportions(O
CT
27T
RAIN
,
O
CT
27D
EV
,O
CT
27T
EST
). We use D
AILY
547 as
an additional test set. Neither O
CT
27T
EST
nor
D
AILY
547 were extensively evaluatedagainst until
finalablationtestingwhenwritingthispaper.
The total number of features is 3.7 million, all
of which areusedunder pureL
2
regularization;but
only60,000areselectedbyelasticnetregularization
with(
1
;
2
)= (0:25;2),whichachievesnearlythe
same (but nobetter) accuracyas pure L
2
,
16
andwe
use it for all experiments. We observed that it was
16
Weconducted agrid search for the regularizer values on
partof D
AILY
547,andmanyregularizervaluesgivethebestor
nearlythebestresults. Wesuspectadifferentsetupwouldhave
yieldedsimilarresults.
l
l
l
l
ll
l
1e+03
1e+05
1e+07
75
80
85
90
Number of Unlabeled Tweets
Tagging Accuracy
l
l
l
l
ll
l
1e+03
1e+05
1e+07
0.60
0.65
0.70
Number of Unlabeled Tweets
Token Coverage
Figure 3: O
CT
27 development set accuracy using only
clustersasfeatures.
Model
Indict.
Outofdict.
Full
93.4
85.0
Noclusters
92.0( 1:4) 79.3( 5:7)
Totaltokens
4,808
1,394
Table3: D
AILY
547accuracies(%)fortokensinandout
ofa traditionaldictionary, formodelsreportedinrows1
and3ofTable2.
possible to get radically smaller models with only
aslight degradation in performance: (4;0:06) has
0:5%worseaccuracybutusesonly1,632features, a
smallenoughnumbertobrowse throughmanually.
First, we evaluateonthe new testset, trainingon
allof O
CT
27. Dueto D
AILY
547’sstatisticalrepre-
sentativeness,webelieve this gives the bestview of
the tagger’s accuracy on English Twitter text. The
fulltagger attains93.2%accuracy (final row of Ta-
ble2).
Tofacilitatecomparisons withprevious work,we
ranaseriesofexperimentstrainingonlyonO
CT
27’s
training and development sets, then report test re-
sults on both O
CT
27T
EST
and all of D
AILY
547,
showninTable2. Our tagger achievessubstantially
higheraccuracythanGimpeletal.(2011).
17
Feature ablation. A number of ablation tests in-
dicate the word clusters are a very strong source of
lexical knowledge. When dropping the tag dictio-
naries and name lists, the word clusters maintain
most of the accuracy (row 2). If we drop the clus-
ters and relyonlyon tag dictionariesandnamelists,
accuracy decreases significantly (row 3). In fact,
we can remove all observation features except for
word clusters—no word features, orthographic fea-
17
ThesenumbersdifferslightlyfromthosereportedbyGim-
pelet al., dueto the corrections we made to the O
CT
27 data,
noted in Section 5.1. We retrained and evaluated their tagger
(version0.2)onourcorrecteddataset.
Featureset
O
CT
27
TEST
D
AILY
547
NPSC
HAT
T
EST
Allfeatures
91.60
92.80
91.19
1
withclusters;withouttagdicts,namelists
91.15
92.38
90.66
2
withoutclusters;withtagdicts,namelists
89.81
90.81
90.00
3
onlyclusters(andtransitions)
89.50
90.54
89.55
4
withoutclusters,tagdicts,namelists
86.86
88.30
88.26
5
Gimpeletal.(2011)version0.2
88.89
89.17
6
Inter-annotatoragreement(Gimpeletal.,2011)
92.2
7
ModeltrainedonallO
CT
27
93.2
8
Table 2: Taggingaccuracies(%)in ablationexperiments. O
CT
27T
EST
andD
AILY
54795%confidence intervalsare
roughly0.7%.Ourfinaltaggerusesallfeaturesandalsotrainson O
CT
27T
EST
,achieving93.2%on D
AILY
547.
tures, affix n-grams, capitalization, emoticon pat-
terns, etc.—and the accuracy is in fact still better
thanthepreviouswork(row 4).
18
We also wanted to know whether to keepthe tag
dictionary and name list features, but the splits re-
ported in Fig. 2 did not show statistically signifi-
cant differences; so to better discriminate between
ablations, we created a lopsided train/test split of
alldata with a much larger test portion (26,974 to-
kens), having greater statisticalpower (tighter con-
fidenceintervals of  0.3%).
19
The fullsystem got
90.8%whiletheno–tagdictionary,no-namelistsab-
lation had 90.0%, a statistically significant differ-
ence. Thereforeweretainthesefeatures.
Comparedto the tagger inGimpelet al., mostof
our feature changes are in the new lexical features
described in §3.5.
20
Wedonotreuse the other lex-
ical features from the previous work, including a
phonetic normalizer (Metaphone), a name list con-
sistingof words thatarefrequentlycapitalized, and
distributionalfeaturestrainedonamuchsmallerun-
labeled corpus; they are all worse than our new
lexical features described here. (We did include,
however, a variant of thetag dictionary feature that
usesphoneticnormalizationforlookup;itseemedto
yieldasmallimprovement.)
18
Furthermore,whenevaluatingtheclustersasunsupervised
(hard) POS tags, weobtain amany-to-oneaccuracy of 89.2%
on D
AILY
547. Beforecomputing this, welowercasedthetext
to match theclustersand removed tokenstagged asURLsand
at-mentions.
19
Reported confidenceintervalsin thispaperare95%bino-
mialnormalapproximation intervalsfor theproportion ofcor-
rectlytaggedtokens:1:96
p
p(1 p)=n
tokens
.1=
p
n.
20
Detailsontheexactfeaturesetareavailablein atechnical
report(Owoputietal.,2012),alsoavailableonthewebsite.
Non-traditionalwords. Thewordclusters arees-
peciallyhelpfulwithwordsthatdonotappearintra-
ditional dictionaries. We constructed a dictionary
by lowercasing the union of the ispell ‘American’,
‘British’, and ‘English’ dictionaries, plus the stan-
dard Unix words file from Webster’s Second Inter-
national dictionary, totalling 260,985 word types.
After excluding tokens defined by the gold stan-
dard as punctuation, URLs, at-mentions, or emoti-
cons,
21
22%of D
AILY
547’stokensdonotappear in
this dictionary. Without clusters, theyare very dif-
ficult to classify(only79.2%accuracy), butadding
clusters generates a 5.7 point improvement—much
larger than the effect on in-dictionary tokens (Ta-
ble3).
Varyingtheamountofunlabeleddata. Atagger
thatonlyuseswordclustersachieves anaccuracyof
88.6%ontheO
CT
27developmentset.
22
Wecreated
several clusterings with different numbers of unla-
beled tweets, keeping the number of clusters con-
stantat800. As showninFig. 3, therewas initially
alogarithmicrelationshipbetweennumberoftweets
and accuracy, but accuracy (and lexical coverage)
levels out after 750,000 tweets. We use the largest
clustering(56million tweets and 1,000 clusters) as
thedefaultforthereleasedtagger.
6.2 EvaluationonR
ITTER
T
W
Ritter et al. (2011) annotated a corpus of 787
tweets
23
with a single annotator, using the PTB
21
Weretainhashtagssincebyourguidelinesa#-prefixedto-
kenisambiguousbetweenahashtagandanormalword,e.g.#1
orgoing#home.
22
The only observation features are the word clusters of a
tokenanditsimmediateneighbors.
23
https://github.com/aritter/twitter_nlp/
blob/master/data/annotated/pos.txt
Tagger
Accuracy
Thiswork
90.00.5
Ritteretal.(2011),basicCRFtagger
85.3
Ritteretal.(2011),trainedonmoredata
88.3
Table 4: Accuracy comparison on Ritter et al.’sTwitter
POScorpus(§6.2).
Tagger
Accuracy
Thiswork
93.40.3
Forsyth(2007)
90.8
Table 5: Accuracy comparison on Forsyth’s NPSC
HAT
IRCPOScorpus(§6.3).
tagset plus several Twitter-specific tags, referred
to in Table 1 as R
ITTER
T
W
. Linguistic concerns
notwithstanding(§5.2),foracontrolledcomparison,
we train and test our system on this data with the
same4-foldcross-validationsetuptheyused,attain-
ing 90.0% (0.5%) accuracy. Ritter et al.’s CRF-
basedtaggerhad85.3%accuracy,andtheirbesttag-
ger, trained on a concatenation of PTB, IRC, and
Twitter,achieved88.3%(Table4).
6.3 IRC: EvaluationonNPSC
HAT
IRC is another medium of online conversational
text, withsimilar emoticons, misspellings, abbrevi-
ations and acronyms as Twitter data. We evaluate
our tagger on the NPS Chat Corpus (Forsyth and
Martell, 2007),
24
a PTB-part-of-speech annotated
datasetofInternetRelayChat(IRC)roommessages
from2006.
First,wecomparetoatagger inthesamesetupas
experiments on this data inForsyth (2007), training
on 90%ofthedataandtestingon10%; we average
results across 10-fold cross-validation.
25
The full
tagger model achieved 93.4% (0.3%) accuracy,
significantlyimproving over the bestresulttheyre-
port, 90.8%accuracywithataggertrainedonamix
of severalPOS-annotatedcorpora.
Wealsoperform theablationexperiments onthis
corpus,witha slightlydifferentexperimentalsetup:
we first filter out system messages then split data
24
Release
1.0:
http://faculty.nps.edu/
cmartell/NPSChat.htm
25
Forsythactuallyused30different90/10randomsplits;we
prefer cross-validation because thesametest data is never re-
peated, thusallowing straightforward confidenceestimationof
accuracyfromthenumberoftokens(viabinomialsamplevari-
ance, footnote 19). In allcases, themodels aretrained on the
sameamountofdata(90%).
into5,067trainingand2,868testmessages. Results
show a similar pattern as the Twitter data (see final
columnof Table 2). Thus the Twitter word clusters
are also useful for language in the medium of text
chatrooms; we suspectthese clusterswillbe appli-
cable for deeper syntactic and semantic analysis in
other online conversational text mediums, such as
textmessagesandinstantmessages.
7 Conclusion
We have constructed a state-of-the-art part-of-
speech tagger for the online conversational text
genres of Twitter and IRC, and have publicly re-
leased our new evaluation data, annotation guide-
lines, open-source tagger, and word clusters at
http://www.ark.cs.cmu.edu/TweetNLP.
Acknowledgements
Thisresearchwassupportedin partbythe National Sci-
enceFoundation(IIS-0915187andIIS-1054319).
A Part-of-Speech Tagset
N
commonnoun
O
pronoun(personal/WH;notpossessive)
^
propernoun
S
nominal+possessive
Z
propernoun+possessive
V
verbincludingcopula,auxiliaries
L
nominal+verbal(e.g. i’m),verbal+nominal(let’s)
M
propernoun+verbal
A
adjective
R
adverb
!
interjection
D
determiner
P
pre-orpostposition,orsubordinatingconjunction
&
coordinatingconjunction
T
verbparticle
X
existentialthere,predeterminers
Y
X+verbal
#
hashtag(indicatestopic/categoryfortweet)
@
at-mention(indicatesauserasarecipientofatweet)
~
discourse marker, indications of continuation across
multipletweets
U
URLoremailaddress
E
emoticon
$
numeral
,
punctuation
G
otherabbreviations,foreignwords,possessiveendings,
symbols,garbage
Table6: POStagsetfromGimpeletal.(2011)usedinthis
paper, and described further in the released annotation
guidelines.
References
J. Allan and H. Raghavan. 2002. Using part-of-speech
patternstoreducequeryambiguity. InProc.ofSIGIR.
G.Andrew and J. Gao. 2007. Scalable training of L
1
-
regularizedlog-linearmodels. InProc.ofICML.
P.Blunsomand T.Cohn. 2011. A hierarchicalPitman-
YorprocessHMMforunsupervisedpartofspeechin-
duction. InProc.ofACL.
S. Brody
and
N. Diakopoulos.
2011.
Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!:
using
word lengthening to detect sentiment in microblogs.
InProc.ofEMNLP.
P. F. Brown, P. V. de Souza, R. L. Mercer, V. J.
Della Pietra,andJ.C.Lai. 1992. Class-basedn-gram
modelsof natural language. Computational Linguis-
tics,18(4).
A.Carlson,J.Betteridge,B.Kisiel, B.Settles,E.R.Hr-
uschkaJr,andT.M.Mitchell. 2010. Towardanarchi-
tecturefornever-endinglanguagelearning. InProc.of
AAAI.
A.Clark. 2003. Combining distributional and morpho-
logical information for part of speech induction. In
Proc.ofEACL.
J.Eisenstein,N.A.Smith,andE.P.Xing. 2011. Discov-
eringsociolinguisticassociationswithstructuredspar-
sity. InProc.ofACL.
J. Eisenstein. 2013. What to do about bad language on
theinternet. InProc.ofNAACL.
A.Fader,S.Soderland,andO.Etzioni. 2011. Identifying
relationsfor open information extraction. InProc.of
EMNLP.
E.N.ForsythandC.H. Martell. 2007. Lexical anddis-
courseanalysisofonlinechatdialog. InProc.ofICSC.
E. N. Forsyth. 2007. Improving automated lexical and
discourseanalysisofonlinechatdialog. Master’sthe-
sis,NavalPostgraduate School.
J.Foster,O.Cetinoglu,J.Wagner,J.L.Roux,S.Hogan,
J.Nivre,D.Hogan,andJ.vanGenabith. 2011. #hard-
toparse: POStaggingandparsingtheTwitterverse. In
Proc.ofAAAI-11WorkshoponAnalysingMicrotext.
K.Gimpel,N.Schneider,B.O’Connor,D.Das,D.Mills,
J.Eisenstein,M. Heilman,D.Yogatama, J. Flanigan,
and N. A. Smith. 2011. Part-of-speech tagging for
Twitter: Annotation, features, and experiments. In
Proc.ofACL.
Google. 2012. Freebase data dumps. http://
download.freebase.com/datadumps/.
B.HanandT.Baldwin. 2011. Lexicalnormalisationof
short textmessages: Makn sensa#twitter. InProc.of
ACL.
T.Koo,X.Carreras,andM.Collins. 2008. Simplesemi-
superviseddependencyparsing. InProc.ofACL.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
ditional random fields: Probabilistic models for seg-
mentingandlabelingsequencedata. InProc.ofICML.
P. Liang. 2005. Semi-supervised learning for natural
language. Master’s thesis, Massachusetts Institute of
Technology.
D.C.LiuandJ.Nocedal. 1989. Onthe limitedmemory
BFGSmethodforlargescaleoptimization. Mathemat-
icalprogramming,45(1).
X.Liu,S.Zhang,F.Wei,andM.Zhou. 2011. Recogniz-
ingnamedentitiesintweets. InProc.ofACL.
M. Lui and T. Baldwin. 2012. langid.py: An off-the-
shelflanguage identificationtool. InProc.ofACL.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz.
1993. Building a large annotated corpus of En-
glish:ThePennTreebank. ComputationalLinguistics,
19(2).
A. McCallum, D. Freitag, and F.Pereira. 2000. Maxi-
mum entropy Markov models forinformation extrac-
tionandsegmentation. InProc.ofICML.
D.McClosky,E.Charniak,andM.Johnson. 2010. Au-
tomatic domain adaptation for parsing. In Proc. of
NAACL.
B. O’Connor, M. Krieger, and D. Ahn.
2010.
TweetMotif: exploratorysearch andtopicsummariza-
tion forTwitter. In Proc.ofAAAIConference on We-
blogsandSocialMedia.
O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, and
N.Schneider. 2012. Part-of-speech taggingfor Twit-
ter: Wordclustersand otheradvances. Technical Re-
portCMU-ML-12-107,CarnegieMellonUniversity.
B.PangandL.Lee. 2008. Opinionminingandsentiment
analysis. NowPublishers.
S.PetrovandR.McDonald. 2012. Overviewofthe2012
shared task on parsing the web. Notes of the First
Workshop on Syntactic Analysis of Non-Canonical
Language(SANCL).
S. Petrov, D. Das, and R. McDonald. 2011. A
universal part-of-speech tagset.
arXiv preprint
arXiv:1104.2086.
A. Ratnaparkhi. 1996. A maximum entropy model for
part-of-speechtagging. InProc.ofEMNLP.
A. Ritter, S. Clark, Mausam, and O. Etzioni. 2011.
Namedentity recognitionin tweets: An experimental
study. InProc.ofEMNLP.
T. Schnoebelen. 2012. Do you smile with your nose?
StylisticvariationinTwitteremoticons. University of
PennsylvaniaWorkingPapersinLinguistics,18(2):14.
N.A.SmithandJ.Eisner. 2005. Contrastiveestimation:
Traininglog-linearmodelsonunlabeleddata. InProc.
ofACL.
O. Täckström, R. McDonald, and J. Uszkoreit. 2012.
Cross-lingual word clustersfor direct transfer of lin-
guisticstructure. InProc.ofNAACL.
Documents you may be interested
Documents you may be interested