GeneratingSequencesWith
RecurrentNeuralNetworks
AlexGraves
DepartmentofComputerScience
UniversityofToronto
graves@cs.toronto.edu
Abstract
ThispapershowshowLongShort-termMemoryrecurrentneuralnet-
workscanbeusedtogeneratecomplexsequenceswithlong-rangestruc-
ture, simply by y predicting one data point at a time. . The e approach is
demonstratedfortext(wherethedataarediscrete)andonlinehandwrit-
ing(wherethedataarereal-valued). Itisthen n extendedtohandwriting
synthesisbyallowingthenetwork tocondition its predictions on atext
sequence. Theresultingsystemisabletogeneratehighlyrealisticcursive
handwritinginawidevarietyofstyles.
1 Introduction
Recurrentneuralnetworks(RNNs)arearichclassofdynamicmodelsthathave
beenusedtogeneratesequencesindomainsasdiverseasmusic[6,4],text[30]
andmotioncapturedata[29].RNNscanbetrainedforsequencegenerationby
processingrealdatasequencesone stepat atimeandpredictingwhat comes
next.Assumingthepredictionsareprobabilistic,novelsequencescanbegener-
atedfromatrainednetworkbyiterativelysamplingfromthenetwork’soutput
distribution, thenfeeding in the sample as s input at the next step. . Inother
wordsbymakingthenetworktreatitsinventionsasiftheywerereal,muchlike
apersondreaming. Althoughthenetworkitselfisdeterministic,the e stochas-
ticityinjectedby pickingsamplesinducesadistributionover sequences. . This
distributionisconditional,sincetheinternalstateofthenetwork,andhenceits
predictivedistribution,dependsonthepreviousinputs.
RNNs are‘fuzzy’ ’ inthesense that they donot useexact templates s from
thetrainingdatatomakepredictions,butrather|likeotherneuralnetworks|
use their internal representationto perform a high-dimensional interpolation
betweentrainingexamples. Thisdistinguishesthem m fromn-grammodels and
compressionalgorithmssuchasPredictionbyPartialMatching[5],whosepre-
dictive distributions s are determined by y counting exact matches s between the
recent historyandthetrainingset. . Theresult|whichis s immediately appar-
1
arXiv:1308.0850v5  [cs.NE]  5 Jun 2014
Convert pdf into jpg format - Convert PDF to JPEG images in C#.net, ASP.NET MVC, WinForms, WPF project
How to convert PDF to JPEG using C#.NET PDF to JPEG conversion / converter library control SDK
convert multiple pdf to jpg online; convert pdf to jpg converter
Convert pdf into jpg format - VB.NET PDF Convert to Jpeg SDK: Convert PDF to JPEG images in vb.net, ASP.NET MVC, WinForms, WPF project
Online Tutorial for PDF to JPEG (JPG) Conversion in VB.NET Image Application
c# pdf to jpg; convert pdf document to jpg
ent fromthesamples inthispaper|is that RNNs (unliketemplate-basedal-
gorithms)synthesiseandreconstitutethetrainingdatainacomplexway,and
rarelygeneratethesamethingtwice. Furthermore,fuzzypredictionsdonotsuf-
ferfromthecurseofdimensionality,andarethereforemuchbetteratmodelling
real-valuedormultivariatedatathanexactmatches.
InprinciplealargeenoughRNNshouldbesucienttogeneratesequences
of arbitrary y complexity. . In n practice e however, , standard d RNNs are e unable to
storeinformationaboutpastinputsforverylong[15]. Aswellasdiminishing
theirabilitytomodellong-rangestructure,this‘amnesia’makesthemproneto
instabilitywhengeneratingsequences. Theproblem(commontoallconditional
generativemodels)isthatifthenetwork’spredictionsareonlybasedonthelast
fewinputs,andtheseinputswerethemselvespredictedbythenetwork,ithas
littleopportunitytorecoverfrompastmistakes. Havingalongermemoryhas
astabilisingeect,becauseevenifthenetworkcannotmakesenseofitsrecent
history,it canlookfurtherbackinthepasttoformulateitspredictions. . The
problemofinstabilityisespeciallyacutewithreal-valueddata,whereitiseasy
forthepredictionstostrayfromthemanifoldonwhichthetrainingdatalies.
Oneremedythathasbeenproposedforconditionalmodelsistoinjectnoiseinto
thepredictionsbeforefeedingthembackintothemodel[31],therebyincreasing
themodel’s robustnesstosurprisinginputs. . Howeverwebelievethatabetter
memoryisamoreprofoundandeectivesolution.
LongShort-termMemory(LSTM)[16]isanRNNarchitecturedesignedto
bebetteratstoringandaccessinginformationthanstandardRNNs. LSTMhas
recentlygivenstate-of-the-artresultsinavarietyofsequenceprocessingtasks,
includingspeechandhandwritingrecognition[10,12]. The e maingoalof this
paperistodemonstratethatLSTMcanuseits memorytogeneratecomplex,
realisticsequencescontaininglong-rangestructure.
Section2denesa‘deep’RNNcomposedofstackedLSTMlayers,andex-
plainshowitcanbetrainedfornext-steppredictionandhencesequencegener-
ation.Section3appliesthepredictionnetworktotextfromthePennTreebank
andHutterPrize Wikipediadatasets. . The e network’s performanceiscompet-
itive withstate-of-the-artlanguage models,andit worksalmostaswellwhen
predictingonecharacteratatimeaswhenpredictingonewordatatime.The
highlightofthesectionisageneratedsampleofWikipediatext,whichshowcases
thenetwork’sabilitytomodellong-rangedependencies.Section4demonstrates
howthepredictionnetworkcanbeappliedtoreal-valueddatathroughtheuse
ofamixturedensityoutputlayer,andprovidesexperimentalresultsontheIAM
OnlineHandwritingDatabase. Italsopresentsgeneratedhandwritingsamples
provingthenetwork’s abilitytolearnletters andshortwords directfrompen
traces,andtomodelglobalfeaturesofhandwritingstyle. Section5introduces
anextensiontothepredictionnetworkthatallowsittoconditionitsoutputson
ashortannotationsequencewhosealignmentwiththepredictionsisunknown.
This makes it suitable for handwritingsynthesis,where ahumanuser inputs
atextandthealgorithmgeneratesahandwrittenversionof it. . Thesynthesis
network is trainedontheIAMdatabase,thenusedtogeneratecursivehand-
writingsamples,someofwhichcannotbedistinguishedfromrealdatabythe
2
Online Convert Jpeg to PDF file. Best free online export Jpg image
So, feel free to convert them too with our tool. Easy converting! If you want to turn PDF file into image file format in C# application, then RasterEdge
convert pdf into jpg format; changing pdf file to jpg
C# PDF Convert to Images SDK: Convert PDF to png, gif images in C#
Description: Convert all the PDF pages to target format images and output into the directory. Parameters: Name, Description, Valid Value.
bulk pdf to jpg converter; convert pdf document to jpg
Figure1: Deeprecurrentneural l networkpredictionarchitecture. . The
circlesrepresentnetworklayers,thesolidlinesrepresentweightedconnections
andthedashedlinesrepresentpredictions.
nakedeye. Amethodforbiasingthesamplestowardshigherprobability(and
greater legibility) is described, , along withatechniquefor r ‘priming’ ’ the sam-
ples onreal l dataand thereby mimicking aparticular r writer’s style. . Finally,
concludingremarksanddirectionsforfutureworkaregiveninSection6.
2 PredictionNetwork
Fig.1illustratesthebasicrecurrentneuralnetworkpredictionarchitectureused
inthis paper. . Aninput t vector sequence x = (x
1
;:::;x
T
)is passedthrough
weightedconnections to o astack of N N recurrently y connectedhiddenlayers to
compute rst t the e hidden n vector sequences h
n
= (h
n
1
;:::;h
n
T
) and then the
output vector sequence y = (y
1
;:::;y
T
). Eachoutput t vector y
t
is used to
parameteriseapredictivedistributionPr(x
t+1
jy
t
)overthepossiblenextinputs
x
t+1
.Therstelementx
1
ofeveryinputsequenceisalwaysanullvectorwhose
entries areallzero; ; the network therefore emits s apredictionfor x
2
, the rst
real input, , with h no prior information. . The e network k is ‘deep’ ’ in n both space
andtime,inthesensethateverypieceofinformationpassingeithervertically
or horizontally through h the e computationgraphwill be acted d on by y multiple
successiveweightmatricesandnonlinearities.
Notethe‘skipconnections’fromtheinputstoallhiddenlayers,andfrom
allhiddenlayerstotheoutputs. Thesemakeiteasiertotraindeepnetworks,
3
C# Create PDF from images Library to convert Jpeg, png images to
Batch convert PDF documents from multiple image formats, including Jpg, Png, Bmp, Gif, Tiff, Bitmap, .NET If you want to turn PDF file into image file
changing pdf to jpg file; convert pdf file to jpg on
C# PDF Convert to Tiff SDK: Convert PDF to tiff images in C#.net
with specified zoom value and save it into stream The magnification of the original PDF page size Description: Convert to DOCX/TIFF with specified resolution and
bulk pdf to jpg; .net convert pdf to jpg
byreducingthenumberofprocessingstepsbetweenthebottomofthenetwork
andthetop,andtherebymitigating the ‘vanishinggradient’problem [1]. . In
thespecialcasethatN=1thearchitecturereducestoanordinary,singlelayer
nextsteppredictionRNN.
Thehiddenlayeractivationsarecomputedbyiteratingthefollowingequa-
tionsfromt=1toT andfromn=2toN:
h
1
t
=H
W
ih1
x
t
+W
h1h1
h
1
t 1
+b
1
h
(1)
h
n
t
=H
W
ihn
x
t
+W
hn 1hn
h
n 1
t
+W
hnhn
h
n
t 1
+b
n
h
(2)
where the e W W terms s denote weight t matrices (e.g. W
ihn
is the weight t matrix
connectingtheinputstothen
th
hiddenlayer,W
h1h1
istherecurrentconnection
atthersthiddenlayer,andsoon),thebtermsdenotebiasvectors(e.g.b
y
is
outputbiasvector)andHisthehiddenlayerfunction.
Giventhehiddensequences,theoutputsequenceiscomputedasfollows:
^y
t
=b
y
+
XN
n=1
W
hny
h
n
t
(3)
y
t
=Y(^y
t
)
(4)
whereY istheoutputlayerfunction. . Thecompletenetworkthereforedenes
afunction,parameterisedbytheweight matrices,frominputhistories x
1:t
to
outputvectorsy
t
.
Theoutput vectorsy
t
areusedtoparameterisethepredictivedistribution
Pr(x
t+1
jy
t
)forthenextinput.TheformofPr(x
t+1
jy
t
)mustbechosencarefully
tomatchtheinputdata. Inparticular,ndingagoodpredictivedistribution
forhigh-dimensional,real-valueddata(usuallyreferredtoasdensitymodelling),
canbeverychallenging.
Theprobabilitygivenbythenetworktotheinputsequencexis
Pr(x)=
YT
t=1
Pr(x
t+1
jy
t
)
(5)
andthesequencelossL(x)usedtotrainthenetworkisthenegativelogarithm
ofPr(x):
L(x)= 
XT
t=1
logPr(x
t+1
jy
t
)
(6)
Thepartialderivativesofthelosswithrespecttothenetworkweights canbe
eciently calculated d with backpropagationthrough time e [33] applied d to the
computationgraphshowninFig.1,andthenetworkcanthenbetrainedwith
gradientdescent.
2.1 LongShort-TermMemory
InmostRNNs thehiddenlayerfunctionHisanelementwiseapplicationofa
sigmoidfunction. However r wehavefoundthattheLongShort-TermMemory
4
JPEG Image Viewer| What is JPEG
is an easy-to-use interface enabling you to quickly convert your JPEG images into other file formats, including Bitmap, Png, Gif, Tiff, PDF, MS-Word
convert pdf image to jpg image; batch pdf to jpg online
VB.NET Word: Word to JPEG Image Converter in .NET Application
NET example below on how to convert a local doc "Sample.docx" has been converted into an individual powerful & profession imaging controls, PDF document, image
convert multiple pdf to jpg; convert pdf to jpg converter
Figure2: LongShort-termMemoryCell
(LSTM)architecture[16],whichusespurpose-builtmemorycellstostoreinfor-
mation,isbetteratndingandexploitinglongrangedependenciesinthedata.
Fig.2illustratesasingleLSTMmemorycell. FortheversionofLSTMusedin
thispaper[7]Hisimplementedbythefollowingcompositefunction:
i
t
=(W
xi
x
t
+W
hi
h
t 1
+W
ci
c
t 1
+b
i
)
(7)
f
t
=(W
xf
x
t
+W
hf
h
t 1
+W
cf
c
t 1
+b
f
)
(8)
c
t
=f
t
c
t 1
+i
t
tanh(W
xc
x
t
+W
hc
h
t 1
+b
c
)
(9)
o
t
=(W
xo
x
t
+W
ho
h
t 1
+W
co
c
t
+b
o
)
(10)
h
t
=o
t
tanh(c
t
)
(11)
whereisthelogisticsigmoidfunction,andi,f,oandcarerespectivelythe
inputgate,forgetgate,outputgate,cellandcellinputactivationvectors,allof
whicharethesamesizeasthehiddenvectorh. Theweightmatrixsubscripts
havetheobviousmeaning, for r example W
hi
is the hidden-input gate matrix,
W
xo
is the input-output gate matrix etc. . The e weight matrices from the cell
togatevectors(e.g.W
ci
)arediagonal,soelementmineachgatevectoronly
receives input from element mof the cellvector. . The e bias terms (whichare
addedtoi,f,cando)havebeenomittedforclarity.
The originalLSTMalgorithm usedacustom designedapproximategradi-
entcalculationthatallowedtheweightstobeupdatedaftereverytimestep[16].
Howeverthefullgradientcaninsteadbecalculatedwithbackpropagationthrough
time[11],themethodusedinthispaper. OnedicultywhentrainingLSTM
withthefullgradientisthatthederivativessometimesbecomeexcessivelylarge,
5
C# Image: How to Download Image from URL in C# Project with .NET
convert between different image formats or into other formats to byte, and how to convert an image powerful & profession imaging controls, PDF document, tiff
convert pdf to jpeg on; changing pdf file to jpg
VB.NET Create PDF from images Library to convert Jpeg, png images
Components to batch convert PDF documents in Visual Basic .NET class. If you want to turn PDF file into image file format in VB.NET class, then RasterEdge
convert pdf image to jpg; reader pdf to jpeg
leadingtonumericalproblems. Topreventthis,alltheexperimentsinthispa-
perclippedthederivativeofthelosswithrespecttothenetworkinputstothe
LSTMlayers(beforethesigmoidandtanhfunctionsareapplied)toliewithin
apredenedrange
1
.
3 TextPrediction
Textdataisdiscrete,andistypicallypresentedtoneuralnetworksusing‘one-
hot’inputvectors. Thatis,ifthereareKtextclassesintotal,andclasskisfed
inattimet,thenx
t
isalengthK vectorwhoseentriesareallzeroexceptfor
thek
th
,whichisone.Pr(x
t+1
jy
t
)isthereforeamultinomialdistribution,which
canbenaturallyparameterisedbyasoftmaxfunctionattheoutputlayer:
Pr(x
t+1
=kjy
t
)=y
k
t
=
exp
^yk
t
P
K
k0=1
exp
^y
k0
t
(12)
SubstitutingintoEq.(6)weseethat
L(x)= 
XT
t=1
logy
x
t+1
t
(13)
=)
@L(x)
@^y
k
t
=y
k
t
k;x
t+1
(14)
The onlythingthat remains tobedecidedis whichset ofclasses touse. . In
most cases,textprediction(usuallyreferredtoaslanguage modelling)isper-
formedatthewordlevel.K isthereforethenumberofwordsinthedictionary.
This can n be e problematic for realistic tasks, where the number r of f words s (in-
cludingvariantconjugations, proper r names, , etc.) ) oftenexceeds s 100,000. . As
wellasrequiringmanyparameterstomodel,havingsomanyclassesdemandsa
hugeamountoftrainingdatatoadequatelycoverthepossiblecontextsforthe
words. Inthecaseofsoftmaxmodels,afurtherdicultyisthehighcomputa-
tionalcostofevaluatingalltheexponentialsduringtraining(althoughseveral
methodshavebeentodevisedmaketraininglargesoftmaxlayersmoreecient,
includingtree-basedmodels[25,23],lowrankapproximations[27]andstochas-
ticderivatives[26]).Furthermore,word-levelmodelsarenotapplicabletotext
datacontainingnon-wordstrings,suchasmulti-digitnumbersorwebaddresses.
Character-levellanguagemodellingwithneuralnetworkshasrecentlybeen
considered[30,24], andfoundtogiveslightlyworseperformancethanequiv-
alent word-level models. . Nonetheless, , predicting g one character at a time is
moreinterestingfromtheperspectiveofsequencegeneration,becauseitallows
thenetworktoinventnovelwords andstrings. . Ingeneral,theexperimentsin
thispaperaimtopredictatthenestgranularityfoundinthe data,soasto
maximisethegenerative exibilityofthenetwork.
1Infact this techniquewasusedin all myprevious papersonLSTM,and inmypublicly
availableLSTMcode,butIforgottomentionit anywhere|meaculpa.
6
VB.NET PDF insert image library: insert images into PDF in vb.net
Support various image formats, like Jpeg or Jpg, Png, Gif Insert images into PDF form field in VB.NET. to provide users the most individualized PDF page image
convert pdf images to jpg; batch convert pdf to jpg
XDoc.HTML5 Viewer for .NET, All Mature Features Introductions
View, Convert, Edit, Sign Documents and Images. By integrating this mature .NET SDK into your ASP.NET PowerPoint: PPT, PPTX, PPS, PPSX; PDF: Portable Document
conversion pdf to jpg; c# convert pdf to jpg
3.1 PennTreebankExperiments
TherstsetoftextpredictionexperimentsfocusedonthePennTreebankpor-
tionoftheWallStreetJournalcorpus[22]. Thiswasapreliminarystudywhose
mainpurposewastogaugethepredictivepowerofthenetwork,ratherthanto
generateinterestingsequences.
Althougharelativelysmalltextcorpus(alittleoveramillionwordsintotal),
thePennTreebankdataiswidelyusedasalanguagemodellingbenchmark.The
trainingsetcontains930,000words,thevalidationsetcontains74,000wordsand
thetestsetcontains82,000words. Thevocabularyislimitedto10,000words,
withallother words mappedtoaspecial‘unknownword’token. . Theend-of-
sentence tokenwas included d in the e input sequences, and d was s countedinthe
sequence loss. . The e start-of-sentence marker was ignored, , because e its s role e is
alreadyfullledbythenullvectorsthatbeginthesequences(c.f.Section2).
The experiments compared d the e performance e of f word and d character-level
LSTMpredictorsonthePenncorpus. Inbothcases,thenetworkarchitecture
wasasinglehiddenlayerwith1000LSTMunits.Forthecharacter-levelnetwork
theinputandoutputlayersweresize49,givingapproximately4.3Mweightsin
total,whiletheword-levelnetworkhad10,000inputsandoutputsandaround
54Mweights. Thecomparisonisthereforesomewhatunfair,astheword-level
networkhadmanymoreparameters.However,asthedatasetissmall,bothnet-
workswereeasilyabletoovertthetrainingdata,anditisnotclearwhetherthe
character-levelnetworkwouldhavebenetedfrommoreweights. Allnetworks
weretrainedwithstochasticgradientdescent,usingalearnrateof0.0001anda
momentumof0.99.TheLSTMderivateswereclippedintherange[ 1;1](c.f.
Section2.1).
Neuralnetworksareusuallyevaluatedontestdatawithxedweights. For
predictionproblemshowever,wheretheinputs are thetargets,itislegitimate
toallowthe networktoadapt itsweightsas it isbeingevaluated(solongas
itonlyseesthetestdataonce). Mikolovrefers s tothisas dynamic evaluation.
Dynamicevaluationallowsforafairercomparisonwithcompressionalgorithms,
forwhichthereisnodivisionbetweentrainingandtestsets,asalldataisonly
predictedonce.
Sincebothnetworksovertthetrainingdata,wealsoexperimentwithtwo
typesofregularisation:weightnoise[18]withastd.deviationof0.075applied
tothe network weights at the start of each h trainingsequence, and adaptive
weightnoise[8],wherethevarianceofthenoiseislearnedalongwiththeweights
usingaMinimumdescriptionLength(orequivalently,variationalinference)loss
function. Whenweight t noise was used, the network was initialisedwiththe
nal weights of f the e unregularisednetwork. . Similarly, , when adaptive e weight
noisewas used, , theweightswereinitialisedwiththoseofthenetworktrained
withweight noise. . We e have foundthat retrainingwithiteratively increased
regularisationis considerably faster than n trainingfrom m random weights with
regularisation. Adaptive e weight noise was foundtobe prohibitively slowfor
theword-levelnetwork, soitwas s regularisedwithxed-varianceweightnoise
only. One e advantageof adaptive weight is that early stoppingis notneeded
7
Table1: Penn n Treebank k Test Set t Results. . ‘BPC’ ’ is s bits-per-character.
‘Error’isnext-stepclassicationerrorrate,foreithercharactersorwords.
Input
Regularisation
Dynamic
BPC
Perplexity
Error (%)
Epochs
Char
none
no
1.32
167
28.5
9
char
none
yes
1.29
148
28.0
9
char
weightnoise
no
1.27
140
27.4
25
char
weightnoise
yes
1.24
124
26.9
25
char
adapt. wt. . noise
no
1.26
133
27.4
26
char
adapt. wt. . noise
yes
1.24
122
26.9
26
word
none
no
1.27
138
77.8
11
word
none
yes
1.25
126
76.9
11
word
weightnoise
no
1.25
126
76.9
14
word
weightnoise
yes
1.23
117
76.2
14
(thenetworkcansafelybestoppedatthepointofminimumtotal‘description
length’onthetrainingdata). However,tokeepthecomparisonfair,thesame
training,validationandtestsetswereusedforallexperiments.
The results are presentedwithtwo equivalent metrics: : bits-per-character
(BPC),whichistheaveragevalueof log
2
Pr(x
t+1
jy
t
)overthewholetestset;
andperplexitywhichistwotothepoweroftheaveragenumberofbitsperword
(theaveragewordlengthonthetestsetisabout5.6characters,soperplexity
2
5:6BPC
).Perplexityistheusualperformancemeasureforlanguagemodelling.
Table1showsthattheword-levelRNNperformedbetterthanthecharacter-
levelnetwork,butthegapappearedtoclosewhenregularisationisused.Overall
the results compare favourably withthose collected d in Tomas s Mikolov’s the-
sis[23]. Forexample,herecordsaperplexityof141fora5-gramwithKeyser-
Neysmoothing,141.8forawordlevelfeedforwardneuralnetwork,131.1forthe
state-of-the-artcompressionalgorithmPAQ8and123.2foradynamicallyeval-
uatedword-levelRNN.HoweverbycombiningmultipleRNNs,a5-gramanda
cachemodelinanensemble,hewasabletoachieveaperplexityof89.4. Inter-
estingly,thebenetofdynamicevaluationwasfarmorepronouncedherethan
inMikolov’sthesis(herecords aperplexity improvement from 124.7to123.2
withword-levelRNNs). ThissuggeststhatLSTMisbetteratrapidlyadapting
tonewdatathanordinaryRNNs.
3.2 WikipediaExperiments
In2006MarcusHutter,JimBoweryandMattMahoneyorganisedthefollowing
challenge, commonly known n as Hutter prize e [17]: : to o compress the e rst t 100
millionbytes of thecompleteEnglishWikipediadata a (as s it was at acertain
timeonMarch3rd2006)toassmallaleaspossible. Thelehadtoinclude
notonlythecompresseddata,butalsothecodeimplementingthecompression
algorithm. Its s size e can therefore e be e considered a a measure e of f the minimum
descriptionlength[13]ofthedatausingatwopartcodingscheme.
Wikipediadataisinterestingfromasequencegenerationperspectivebecause
8
itcontainsnotonlyahugerangeofdictionarywords,butalsomanycharacter
sequences that t would d not t be includedin n text t corpora a traditionally usedfor
language modelling. . For r example foreignwords s (including letters s from non-
LatinalphabetssuchasArabicandChinese),indentedXMLtagsusedtodene
meta-data,websiteaddresses,andmarkupusedtoindicatepageformattingsuch
asheadings,bulletpointsetc.AnextractfromtheHutterprizedatasetisshown
inFigs.3and4.
Therst96Mbytesinthedatawereevenlysplitintosequencesof100bytes
andusedtotrainthenetwork,withtheremaining4Mwereusedforvalidation.
Thedatacontainsatotalof205one-byteunicodesymbols. Thetotalnumber
ofcharactersismuchhigher,sincemanycharacters(especiallythosefromnon-
Latinlanguages) aredenedasmulti-symbolsequences. . Inkeepingwiththe
principleofmodellingthe smallest meaningfulunits inthedata,thenetwork
predictedasinglebyteatatime,andthereforehadsize205inputandoutput
layers.
Wikipediacontains long-range regularities,suchasthetopicofanarticle,
whichcanspanmanythousandwords. Tomakeitpossibleforthenetworkto
capturethese,itsinternalstate(thatis,theoutputactivationsh
t
ofthehidden
layers,andtheactivationsc
t
oftheLSTMcellswithinthelayers)wereonlyreset
every100sequences. Furthermoretheorderofthesequenceswasnotshued
duringtraining,asitusuallyisforneuralnetworks.Thenetworkwastherefore
abletoaccessinformationfromupto10Kcharactersinthepastwhenmaking
predictions.Theerrortermswereonlybackpropagatedtothestartofeach100
bytesequence,meaningthat thegradientcalculationwas approximate. . This
form of truncated d backpropagationhas s beenconsidered d before e for RNN lan-
guagemodelling[23],andfoundtospeeduptraining(byreducingthesequence
lengthandhenceincreasingthefrequencyofstochasticweightupdates)without
aectingthenetwork’sabilitytolearnlong-rangedependencies.
AmuchlargernetworkwasusedforthisdatathanthePenndata(re ecting
thegreatersizeandcomplexityofthetrainingset)withsevenhiddenlayersof
700LSTMcells,givingapproximately21.3Mweights.Thenetworkwastrained
withstochasticgradientdescent,usingalearnrateof0.0001andamomentum
of 0.9. . It t took four trainingepochs to o converge. . TheLSTM M derivates were
clippedintherange[ 1;1].
AswiththePenndata,wetestedthenetworkonthevalidationdatawith
andwithout dynamicevaluation(wheretheweights are updatedasthedata
ispredicted). AscanbeseenfromTable2performancewasmuchbetterwith
dynamic evaluation. . This s isprobablybecauseofthelongrangecoherence of
Wikipediadata; for r example, , certainwords s aremuchmorefrequent insome
articlesthanothers,andbeingabletoadapt tothisduringevaluationis ad-
vantageous. Itmayseemsurprisingthatthedynamicresultsonthevalidation
set were substantiallybetter thanonthe trainingset. . However r this is easily
explainedbytwo factors: : rstly,thenetwork k undert thetrainingdata, and
secondly some portions of the data are muchmore dicult than n others s (for
example,plaintextishardertopredictthanXMLtags).
To put the results s incontext, , the current winner r of the Hutter r Prize e (a
9
Table2: WikipediaResults(bits-per-character)
Train
Validation(static) Validation n (dynamic)
1.42
1.67
1.33
variantofthePAQ-8compressionalgorithm[20])achieves1.28BPConthesame
data (including the code requiredto implement the algorithm), mainstream
compressorssuchaszipgenerallygetmorethan2,andacharacterlevelRNN
appliedtoatext-onlyversionofthedata(i.e.withalltheXML,markuptags
etc. removed)achieved1.54onheld-outdata,whichimprovedto1.47whenthe
RNNwascombinedwithamaximumentropymodel[24].
AfourpagesamplegeneratedbythepredictionnetworkisshowninFigs.5
to8. Thesampleshowsthatthenetworkhas s learnedalot of structurefrom
thedata,at awiderangeofdierentscales. . Most t obviously,ithaslearneda
largevocabularyofdictionarywords,alongwithasubwordmodelthatenables
ittoinventfeasible-lookingwordsandnames: forexample\LochroomRiver",
\MughalRalvaldens",\submandration",\swalloped".Ithasalsolearnedbasic
punctuation,withcommas,fullstopsandparagraphbreaksoccurringatroughly
therightrhythminthetextblocks.
Beingabletocorrectlyopenandclosequotationmarksandparenthesesis
aclearindicatorofalanguagemodel’smemory,becausetheclosurecannotbe
predictedfromtheinterveningtext,andhencecannotbemodelledwithshort-
rangecontext[30]. Thesampleshowsthatthenetworkisabletobalancenot
onlyparenthesesandquotes,butalsoformattingmarkssuchastheequalssigns
usedtodenoteheadings,andevennestedXMLtagsandindentation.
The networkgenerates non-Latincharacterssuchas Cyrillic,Chinese and
Arabic, andseems tohavelearneda rudimentary modelfor r languages other
thanEnglish(e.g.itgenerates\es:Geotniaslago"fortheSpanish‘version’ofan
article,and\nl:Rodenbaueri"fortheDutchone)Italsogenerates convincing
lookinginternetaddresses(noneofwhichappeartobereal).
Thenetworkgeneratesdistinct,large-scaleregions,suchasXML headers,
bullet-pointlistsandarticletext. ComparisonwithFigs.3and4suggeststhat
theseregionsareafairlyaccuratere ectionoftheconstitutionoftherealdata
(althoughthegeneratedversionstendtobesomewhatshorterandmorejumbled
together). This s is signicant because eachregionmayspanhundredsoreven
thousandsof timesteps. . Thefact t thatthenetworkis abletoremaincoherent
oversuchlargeintervals(evenputtingtheregionsinanapproximatelycorrect
order,suchashavingheadersatthestartofarticlesandbullet-pointed‘seealso’
listsattheend)istestamenttoitslong-rangememory.
As withalltextgeneratedbylanguagemodels,thesampledoes notmake
sensebeyondthelevelofshortphrases.Therealismcouldperhapsbeimproved
withalarger network and/or moredata. . However, , it seems s futile to o expect
meaningfullanguagefromamachinethathasneverbeenexposedtothesensory
10
Documents you may be interested
Documents you may be interested