how to display pdf file in c# windows application : C# get text from pdf application software utility azure windows asp.net visual studio faq2-part424

Chapter 2
Existing users
(including everyonewhouses abrowser)
2.1 What doI havetodotouseXML?
Toreadit: useanymodernwebbrowser;tocreateit: useanXML
editor.
FortheaverageuseroftheWeb,youdon’tneedanything exceptabrowser
whichworkswithXML(seethequestionaboutbrowsers). Remembernew
XML-relatedfacilitiesarebeinginventedorimplementedallthetime(seethe
W3Cwebsite),sosomerecentfeaturesmaynotworkinallbrowsersyet.
YoucanuseXML-conformantbrowserstolookatsomeofthestableXML
material, suchasJonBosak’sShakespeareplays andthemolecular
experimentsoftheChemicalMarkupLanguage(CML).Therearesome
moreexamplesourceslisted at
http://xml.coverpages.org/xml.html#examples,andyouwillfindXML
(particularlyintheguiseofXHTML)being introducedinplaces whereit
won’tbreakolderbrowsers.
Ifyou want to startpreparations forcreatingyourownXMLfiles, seethe
questions intheAuthors’SectionandtheDevelopers’Section,particularly
thequestiononQuestion4.10onpage82.
21
C# get text from pdf - extract text content from PDF file in C#.net, ASP.NET, MVC, Ajax, WinForms, WPF
Feel Free to Extract Text from PDF Page, Page Region or the Whole PDF File
copy text from pdf without formatting; copy text from pdf in preview
C# get text from pdf - VB.NET PDF Text Extract Library: extract text content from PDF file in vb.net, ASP.NET, MVC, Ajax, WinForms, WPF
How to Extract Text from PDF with VB.NET Sample Codes in .NET Application
copying text from pdf to word; copy formatted text from pdf
2.2 What doesXMLlook like(inside)?
PointybracketslikeHTML
ThebasicstructureofXMLissimilartootherapplications ofSGML,
includingHTML.Thebasiccomponents canbeseeninthefollowing
examples. AnXMLdocumentstartswithanoptionalProlog,whichcanhave
two (optional)parts:
1. TheXML Declaration:
<?xml version="1.0" encoding="utf-8"?>
Thisspecifiesthat thisis anXMLdocumentandthatitusestheUTF-8
characterrepertoire(thedefault;othersareavailablebutsupport is
onlymandated forUTF-8);
2. ADocumentTypeDeclarationifyou areusing aDTD:
<!DOCTYPE report SYSTEM "http://sales.acme.corp/dtds/salesrep.dtd">
whichidentifiesthetypeofdocument(here,‘report’)andsayswhere
theDocument Type Description(DTD)isstored;
ThePrologis followedbytheDocumentInstance:
1. Aroot element,whichistheoutermost(top level)element(start-tag
plusend-tag)whichencloseseverythingelse: intheexamplesbelowthe
rootelements areconversationandtitlepage;
2. Astructuredmixofdescriptiveorprescriptiveelementsenclosing the
characterdatacontent (text), and optionallyanyattributes
(‘name="value"’pairs)insidesomestart-tags.
XMLdocumentscanbeverysimple, withstraightforwardnestedmarkup of
yourowndesign:
<?xml version="1.0" standalone="yes"?>
<conversation>
<greeting>Hello, world!</greeting>
<response>Stop the planet, I want to get off!</response>
</conversation>
22
C# PDF insert text Library: insert text into PDF content in C#.net
try this C# demo. // Open a document. String inputFilePath = Program.RootPath + "\\" 1.pdf"; PDFDocument doc = new PDFDocument(inputFilePath); // Get a text
can't copy text from pdf; copy text from pdf online
C# PDF Annotate Library: Draw, edit PDF annotation, markups in C#.
for adding text box to PDF and edit font size and color in text box field Learn how to retrieve all annotations from PDF file in C# project. // Get PDF document
delete text from pdf file; extract text from pdf c#
Ortheycanbemorecomplicated, withaSchemaorDTD,andmaybean
internalsubset(localDTDchangesin[squarebrackets]withinthe
DocumentTypeDeclarationliketheENTITYdeclarationbelow);andan
arbitrarilycomplexnestedstructure:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE titlepage
SYSTEM "http://www.foo.bar/dtds/typo.dtd"
[<!ENTITY % active.links "INCLUDE">]>
<titlepage xml:id="BG12273624">
<white-space type="vertical" amount="36"/>
<title font="Baskerville" alignment="centered"
size="24/30">Hello, world!</title>
<white-space type="vertical" amount="12"/>
<!-- In some e copies the following
decoration is hand-colored, presumably
by the author -->
<image location="http://www.foo.bar/fleuron.eps"
type="URI" alignment="centered"/>
<white-space type="vertical" amount="24"/>
<author font="Baskerville" size="18/22"
style="italic">Vitam capias</author>
<white-space type="vertical" role="filler"/>
</titlepage>
Ortheycanbeanywherebetween: a lotwilldependonhowyouwantto
defineyourdocumenttype(orwhoseyouuse)andwhatitwillbeusedfor.
Database-generatedorprogram-generated XMLdocumentsusedin
e-commerceareusuallyunformattedbecausetheyareformachine
consumption,notforhumanreading,andtheymayuseverylong names or
values,with multipleredundancyandsometimesnocharacterdatacontentat
all, justvalues inattributes:
<?xml version="1.0"?>
<ORDER-UPDATE AUTHMD5="4baf7d7cff5faa3ce67acf66ccda8248"
ORDER-UPDATE-ISSUE="193E22C2-EAF3-11D9-9736-CAFC705A30B3"
ORDER-UPDATE-DATE="2005-07-01T15:34:22.46"
ORDER-UPDATE-DESTINATION="6B197E02-EAF3-11D9-85D5-997710D9978F"
ORDER-UPDATE-ORDERNO="8316ADEA-EAF3-11D9-9955-D289ECBC99F3">
<ORDER-UPDATE-DELTA-MODIFICATION-DETAIL ORDER-UPDATE-XML:ID="BAC352437484">
<ORDER-UPDATE-DELTA-MODIFICATION-VALUE ORDER-UPDATE-ITEM="56"
23
C#: Use OCR SDK Library to Get Image and Document Text
On this Visual C# tutorial page, you will see how SDK in your application to extract and get text from Tiff Extracted text can be output to Word or PDF document
extract pdf text to word; copy and paste pdf text
C# PDF Image Extract Library: Select, copy, paste PDF images in C#
C# users are able to extract image from PDF document page and get image information for indexing and accessing. C# Project: DLLs for PDF Image Extraction.
copy text from scanned pdf to word; copy text from pdf reader
ORDER-UPDATE-QUANTITY="2000"/>
</ORDER-UPDATE-DELTA-MODIFICATION-DETAIL>
</ORDER-UPDATE>
2.3 ShouldIuseXMLinsteadofHTML?
Yesifyouneedrobustness,accuracy,andpersistence.
XMLallows authorsandproviderstodesigntheirowndocumentmarkup
insteadofbeing limitedbyHTML.Documenttypescanbeexplicitlytailored
to anapplication,so thecumbersomefudgingandpoodlefakingthathasto
takeplacewithHTMLbecomesa thingofthepast: yourmarkup canalways
saywhatit means. Trivialexample:
<date YYYY-MM-DD="2005-12-26">last Monday</date>
• Informationcontentcanbericherandeasierto use, becausethe
descriptiveandhypertextlinkingabilitiesofXMLaremuchgreater
thanthoseavailableinHTML.
• XMLcanprovidemoreand betterfacilities forbrowserpresentation
andperformance,usingXSLT andCSS stylesheets;
• Itremoves manyoftheunderlyingcomplexities ofSGML-format
HTML(whichledtothembeingignoredandbroken) infavourofa
moreflexiblemodel, sowritingprograms to handleXMLismuch
easierthandoingthesameforalltheoldbrokenHTML.
• Informationbecomesmoreaccessibleandreusable,becausethemore
flexiblemarkup ofXMLcanbeusedbyanyXMLsoftwareinsteadof
beingrestrictedtospecificmanufacturersas hasbecomethecasewith
HTML.
• XMLfilescanbeused outsidetheWebaswell, inexisting
document-handling environments(egpublishing).
Ifyourinformationis transient, orcompletelystaticandunreferenced, or
veryshortandsimple, andunlikelyto needupdating,HTMLmaybeallyou
need.
24
C# PDF Page Extract Library: copy, paste, cut PDF pages in C#.net
C#.NET Sample Code: Extract PDF Pages and Save into a New PDF File in C#.NET. You can easily get pages from a PDF file, and then use these pages to create and
extract text from pdf file; cut text from pdf document
C# PDF File Merge Library: Merge, append PDF files in C#.net, ASP.
page reordering and PDF page image and text extraction In addition, C# users can append a PDF file get PDFDocument object from one file String inputFilePath1
delete text from pdf acrobat; copy paste pdf text
2.4 Someonesent meanXMLfile. HowdoI readit?
OpenitinanXMLbrowserorXMLeditor.
Ifthefileiswell-formedorvalid XML,youcanjustopenitwithany
XML-conformantbrowser(seeQuestion2.1onpage21andQuestion2.6on
page28).Thiswilldisplaythefileinanunformattedview,showingallthe
markupinaformatthatlets you folduporunfoldthenestedhierarchy
(clickonthelittleplusandminussymbols),whichwillatleastletyou read
something.
Ifthefilecontainsa linktoanXSLTorCSSstylesheet(andthestylesheet
wasprovidedoris web-accessible)thenthebrowsershouldformatthefilein
areadablemanner(butbewarethatin-browserformattingis not robust).
Ifyou want to edit thefile, you needanXMLeditor(seeQuestion4.10on
page82).Unlessyouareveryskilledwithpointy-bracketmarkup,donottry
to editXMLfileswith non-XMLeditors.
2.5 HowdoIcontrol theformatting ofXML?
UseCSSoranXSLT2stylesheet.
InHTML,defaultstyling wasbuilt intothebrowsers becausethetagset of
HTMLwaspredefinedandhardwiredinto browsers. This isstilltruefor
XHTMLandHTML5tosomeextent. InotherXML, whereyoucandefine
yourowntagset, browserscannotpossiblybeexpected to guess orknowin
advancewhatnames you aregoingto useandwhattheywillmean, soyou
needastylesheetifyouwanttodisplayformattedtext.
Browserswhich readXMLwillacceptanduseaCSSstylesheetata
minimum,butyou canalso usethemorepowerfulXSLTstylesheetlanguage
to transformyourXMLintoHTML— whichbrowsers, ofcourse,already
knowhowtodisplay(andthatHTMLcanstilluseaCSSstylesheet). Thisway
yougetallthedocumentmanagement benefits ofusingXML, butyou don’t
havetoworryaboutyourreaders needingXMLsmarts intheirbrowsers.
25
VB.NET PDF Annotate Library: Draw, edit PDF annotation, markups in
annotating features, provides developers with a great .NET solution to annotate .pdf file with both text & graphics. From this page, you will get a simple VB
extract text from pdf online; copy pdf text to word document
C# PDF Text Search Library: search text inside PDF file in C#.net
Able to find and get PDF text position details in C#.NET application. Allow to search defined PDF file page or the whole document.
export text from pdf to word; copy paste text pdf file
Thistransformationis usuallydonebythedocument owner, ontheirserver,
so you justgettheHTMLanyway, possiblyunawarethatitwas XML
originally. Butitisalsopossibleto usethe(ratherlimited)built-inXSLT1.0
transformerinsomebrowsers,andserveroperatorscannowalsouse
Saxon CE, whichisadownloadablein-browserversionofXSLT2.
MikeBrownwrites:
XSLTisanXMLdocumentprocessinglanguagethatusessourcecodethat
happenstobewritteninXML.AnXSLTdocumentdeclaresasetofrules
foranXSLTprocessortousewheninterpretingthecontentsofanXML
document.TheserulestelltheXSLTprocessorhowtogenerateanew
XML-likedatastructureandhowthatdatashouldbeemitted—asanXML
document,asanHTMLdocument,asplaintext,orperhapsinsomeother
format.
Thistransformationcanbedoneeitherinsidethebrowser,orbythe
serverbeforethefileissent.Transformationinthebrowseroffloadsthe
processingfromtheserver,butmayintroducebrowserdependencies,
leadingtosomeofyourreadersbeingexcluded.Transformationinthe
servermakestheprocessbrowser-independent,butplacesaheavier
processingloadontheserver.
Aswithanysystemwherefilescanbeviewedatrandombyarbitraryusers,
theauthorcannotknowwhatresources(suchas fonts)areontheuser’s
system,so thesamecareis neededaswithHTMLusingfonts. To invokea
stylesheetfromanXMLfileforstandaloneprocessing inthebrowser,
includeoneofthestylesheetdeclarations:
<?xml-stylesheet href="foo.xsl" type="text/xsl"?>
<?xml-stylesheet href="foo.css" type="text/css"?>
(substitutingtheURIofyourstylesheet,ofcourse). See
http://www.w3.org/TR/xml-stylesheet/forthefulldetails. The Cascading
StylesheetSpecification(CSS)providesasimplesyntaxforassigningstylesto
elements,andhas beenimplementedinmostbrowsers.
DavePawsonmaintainsacomprehensiveXSLFAQat
http://www.dpawson.co.uk/xsl/,andhisbookPawson[Pawson],2002[the
Foxbook]is availablefromO’Reilly. XSLusesXMLsyntax(anXSL
26
stylesheetisjustanXMLfile) and haswidespreadsupportfromseveral
majorbrowservendors(seethequestionsonbrowsersandothersoftware).
XSLcomes intwoflavours:
• XSLitself,whichisa pureformatting language,outputtinga Formatted
Objects (FO)file,whichneedsa textformatterlikeFOP,XEP,orothers
to createprintable(PDF) output(butseeQuestion2.5). CurrentlyIam
notawareofanyWeb browsers whichsupportdirectXSLrenderingto
PDF;
• XSLT (TforTransformation), whichis alanguageto specify
transformationsofXMLintoHTMLeitherinsidethebrowseroratthe
serverbeforetransmission. Itcanalsospecifytransformations from
onevocabularyofXMLtoanother,andfromXMLto plaintext (which
canbeanyformat,includingRTFandL
A
T
E
X).
Allcurrentversions ofMicrosoftInternetExplorer, Firefox,Chrome,
Mozilla, Safari,andOpera handleXSLT1.0insidethebrowser. Beware
obsoletebrowsers likeMSIE5.5whichneedssomepost-installationsurgery
to removethelong-obsoleteWD-xslandreplaceitwiththecurrent
XSL-Transformprocessor.
WYSIWYG
FOR
XSL
Therehavebeenattemptstoproducepseudo-WYSIWYGeditorsfor
creatingXSL[T]stylesheets,buttheyhavemostlybeenrestrictedtosimple
mappingbetweeninputelementsandoutputelements(egaDocBook
paratoaHTMLp).Anythingbeyondthisseemslikelytofailbecauseofthe
infinitecomplexityofwhatpeoplewanttodowiththeirinformation.If
youhaveaccesstotheACMdatabase,seethepaperbyPietriga,Vion-Dury,
andQuintonVXT,fromtheACMDocEng’01(Atlanta)Proceedings.
G
ENERATING
HTML
ONTHESERVER
Thereisagrowinguseofserver-sideprocessorslikeCocoonandothers,
whichletyoucreate,store,andmanageyourinformationinXMLbutserve
itauto-convertedtoHTMLorsomeotherformat,thusallowingtheoutput
tobeusedbyanybrowser.XSLTisalsowidelyusedtotransformXMLinto
non-SGMLformatsforinputtoothersystems(forexampletotransform
XMLintoLAT
E
Xfortypesetting).
A
LTERNATIVESTO
XSL:FO
27
InsteadofgeneratingPDFviaanFOprocessor,itispossibletouseXSLT2
totransformXMLtoL
A
T
E
XfortypesettingPDF(asisdonefortheprint
versionsofthisFAQ,fromDocBooktoLAT
E
X).Thishastheadvantageof
beingabletomakeuseofLAT
E
X’sextensivelibraryofprewrittenformatting
modules(‘packages’),whichavoidsmuchofthewheel-reinventing
currentlyrequiredwithXSL:FO.
Alternatively,DavidCarlisle’sxmltexreadsXMLdirectly,offering
anotherpracticalifexperimentalsolutiontotypesettingXML.Oneuseof
aT
E
XsystemthatcantypesetXMLfilesisasabackendprocessorfor
XSL:FO,serialisedasXML.SebastianRahtz’sPassiveT
E
Xusesxmltexto
achievethisend.
TheT
E
XFAQisathttp://www.tex.ac.uk/faq.Silmarilmaintainsthe
onlineversionofPeterFlynn’sbookonLAT
E
X,FormattingInformation,
whichhassomeexamplesofXSLT2conversionFlynn,2014.
SGMLsystemsusedasimilarstylesheetmechanism: someofthecommon
ones weretheFOSI(FormattedOutputSpecificationInstance), whichwas
standardindefenceandindustrialengineeringapplications, especiallywhen
usingtheArbortexteditor(Adept, thenEpic,probablysomethingelsenext
week);theDynaText/DynaWebstylesheetusedinSGMLpublishingtothe
web;andtheSynexstylesheet usedinbrowsers basedontheSynexengine
(egPanorama, whosestylinginterfacewaspartlyadoptedinXMetaL),the
expertiseofwhosedesignerspersists intheDocZillabrowser.
2.6 WherecanI getan XML browser?
AllmodernbrowserssupportXML
Currentstateofexisting browsersupportforXML(1August 2014):
• Currentversions ofMicrosoftInternetExplorer, Firefox,Safari,
Chrome,Mozilla, andOperaallappeartosupportXMLwithCSS
and/orXSLT 1.0stylesheets. Theeditorwouldwelcomeadditional
informationandcorrections.
• Don’tuseNetscape(anyversion), Internet Explorer6orearlier, orany
earlyversions ofMozilla ifyouwantXMLsupport: theyeitherdon’t
haveitorwerehopelesslybroken. Upgradetoamodernbrowseras
soonaspossible.
28
Theremainderofthislistisofhistoricalinterestonly.
• MicrosoftInternetExplorer5.0and5.5handledXML,processingitby
defaultusinga built-instylesheet writteninaMicrosoft-specific,
obsoletepredecessorofXSLTcalled XSL(nottobeconfusedwith the
realXSLT).Theoutputofthestylesheetis DHTML, which,when
renderedinthebrowser, showsa coloured, syntax-highlightedversion
oftheXMLdocument, withcollapsibleviews. IftheXMLdocument
referencesa stylesheet, thatstylesheetwillbeusedinstead, withinthe
limitationsofMSIE’sincompleteimplementationofCSS.MSIE5.0and
5.5canalso usestylesheetsinanotherobsoletesyntaxcalled WD-xsl,
whichshouldbeavoided. Theseversions canbeupgradedtosupport
realXSLT:seetheMSXMLFAQ.
MSIE6.0andlateruserealXSLT 1.0,but canuseboth theobsolete
syntaxesaswell.
• MozillaFirefox0.9up,Netscape6and7(thereis noNetscape5), and
GaleonallhavefullXMLsupport withXSLTandCSS.Ingeneral,
Firefoxis morerobustthanMSIE,andprovidesbetterstandards
adherence.
IhaveauserreportthatNetscape4.6and4.8supportsXML, butno
independentverification.
• TheauthorsoftheformerMultiDocProSGMLbrowser,CITEC(whose
enginewas also usedinPanorama andotherbrowsers), joinedforces
withMozillatoproducea multi-everything browsercalled DocZilla,
whichreadHTML, XML, andSGML, withXSLTandCSSstylesheets.
ThisranunderWindowsandLinuxandwasatrelease1.0at thetimeit
becameunavailable. Thiswasbyfarthemost ambitiousbrowser
project,andwas backedbyverysolid markup-handling expertise.
Ihaveless informationontheXMLcapabilitiesoftheMacOSXbrowser
Safari, whichisbasedontheKHTMLengineused inKonqueror. Konqueror
itselfdoesnotappeartosupportXMLorXSLT(atleastinKDEunder
Fedora Core, forexample),but Safari1.3.2(v312.6)underOS 10.3did
providepartialsupportforXML, butdoesnothonouranexternalDTD
modified byaninternalsubset(thanksto JohnHayniefortestingthis).
29
MikeBrownwrites:
Theconceptof‘browsing’isprimarilytheresultofHTMLhavingthe
semanticsthatitdoes.InanHTMLdocumenttherearesectionsoftext
calledanchorsthatare‘hyperlinked’tootherdocumentsthatmightbeat
remotelocationsonanetworkorfilesystem.HTMLdocumentsprovide
cuestoawebbrowserregardinghowthedocumentshouldbedisplayed
andwhatkindofbehavioursareexpectedofthebrowserwhentheuser
interactswithit.TheHTMLspecificationprovidesmanysuggestionsand
requirementsforthebrowser,andprovidesspecificmeaningsformany
differentexamplesofmarkup,suchasthefactthatan<img>element
referstoanimagethatshouldberetrievedbythebrowserandrendered
inlinewiththeadjacenttext.
UnlikeHTML,XMLdoesnothavesuchinherentsemanticsatall.There
isnoprescribedmethodforrenderingXMLdocuments.Therefore,whatit
meansto‘browse’XMLisopentointerpretation.Forexample,anXML
documentdescribingthecharacteristicsofamachinepartdoesnotcarry
anyinformationabouthowthatinformationshouldbepresentedtoa
user.Anapplicationisfreetousethedatatoproduceanimageofthe
part,generateaformattedtextlistingoftheinformation,displaytheXML
document’smarkupwithaprettycolorscheme,orrestructurethedata
intoaformatforstorageinadatabase,transmissionoveranetwork,or
inputtoanotherprogram.
However,despitethefactthatXMLdocumentsarepurelydescriptive
datafiles,itispossibleto‘browse’theminasense,byrenderingthemwith
stylesheets.Astylesheetisaseparatedocumentthatprovideshintsand
algorithmsforrenderingortransformingthedataintheXMLdocument.
HTMLusersmaybefamiliarwithCascadingStyleSheets(CSS).TheCSS
stylesheetlanguageisgeneralandpowerfulenoughtobeappliedtoXML
documents,althoughitisorientedtowardvisualrenderingofthe
documentanddoesnotallowforcomplexprocessingofthedocument’s
data.ByassociatinganXMLdocumentwithaCSSstylesheet,itmaybe
possibletoloadanXMLdocumentinaCSS-awarewebbrowser,andthe
browsermaybeabletoprovidesomekindofrenderingofit,evenifthe
browserdoesnototherwiseknowhowtoreadandprocessXML
documents.However,notallwebbrowserswillloadanXMLdocument
correctly,andtheyarenotrequiredtorecognisetheXMLmarkupthat
associatesthedocumentwithastylesheet,soonecannotassumethat
XMLdocumentscanbeopenedwithjustanywebbrowser.
AmorecomplexandpowerfulstylesheetlanguageisXSLT,the
30
Documents you may be interested
Documents you may be interested