C# PDF Text Reader Library
How to read, extract text from PDF file using C#
C# Demo Code to read, extract text from Adobe PDF document
In this C# tutorial, you will learn how to read, extract text from PDF file using C# in ASP.NET MVC Web, Windows applications.
- Read text from all pages, specified pages, or from a page region on PDF
- Extract text with lines
- Read, extract special formated text, such as highlighted text content in PDF
Read text content from a PDF page region using C#
The C# source code below will show you how to use class PDFTextMgr to read text from a region on PDF page using C# in ASP.NET MVC Web, Windows applications.
- Get PDFTextMgr object from method PDFTextHandler.ExportPDFTextManager() with a PDF file loaded
- Utilize method PDFTextMgr.SelectChar() to get all text characters at specified postion from the first PDF page
- Alos utilize method PDFTextMgr.SelectChar() to get all text characters at specified region RectangleF from the first PDF page
// open a document
String inputFilePath = Program.RootPath + "\\" + "2.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
// get a text manager from the document object
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
// get the first page from the document
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);
// select char at position (245F, 155F)
PointF cursor = new PointF(245F, 155F);
PDFTextCharacter aChar = textMgr.SelectChar(page, cursor);
if (aChar == null)
{
Console.WriteLine("No character has been found.");
}
else
{
Console.WriteLine("Value: " + aChar.GetChar() + "; Boundary: " + aChar.GetBoundary().ToString());
}
// select chars in the region (250F, 150F, 100F, 100F)
RectangleF region = new RectangleF(250F, 150F, 100F, 100F);
List<PDFTextCharacter> chars = textMgr.SelectChar(page, region);
foreach (PDFTextCharacter obj in chars)
{
Console.WriteLine("Value: " + obj.GetChar() + "; Boundary: " + obj.GetBoundary().ToString());
}
Read line text from a PDF page region in C# code
// select a line at 150F from the top of the page
PDFTextLine aLine = textMgr.SelectLine(page, 150F);
if (aLine == null)
{
Console.WriteLine("No character has been found.");
}
else
{
Console.WriteLine("Line: " + aLine.GetContent());
}
How to read, extract highlighted text from PDF using C#
The code below is only for text markup annotations
- PDFAnnotHighlight
- PDFAnnotUnderLine
- PDFAnnotDeleteLine
- PDFAnnotTextReplace
String inputFilePath = Program.RootPath + "\\" + "1.pdf";
// Open the PDF file.
PDFDocument doc = new PDFDocument(inputFilePath);
// Retreive all annotations in the document.
List<IPDFAnnot> annots = PDFAnnotHandler.GetAllAnnotations(doc);
foreach (IPDFAnnot annot in annots)
{
// For PDFAnnotHighlight, PDFAnnotUnderLine, PDFAnnotDeleteLine and PDFAnnotTextReplace.
if (annot is IPDFMarkupAnnot)
{
// Get the parent page of the annotation.
PDFPage page = (PDFPage)doc.GetPage(annot.PageIndex);
// Extract text from the target text markup annotation.
String[] text = PDFAnnotHandler.ExtractText(page, (IPDFMarkupAnnot)annot);
// Show the markup text related to the annotation.
Console.WriteLine("Content: ");
foreach (String line in text)
{
Console.WriteLine(line);
}
}
}