C# PDF Text Reader Library
How to get, read, extract text from PDF file with line, table position coordinates using c# .net. Free open source examples
Use C# to Freely Extract Text from PDF Page, Page Region or the Whole PDF File with .NET PDF Control
In this tutorial, you learn how to read, extract text from PDF file using C# in ASP.NET MVC Web, Windows applications.
- Read text characters, words, lines, paragraphs from PDF file
- Read, get text with coordinates
- Read specified text from PDF document
- Extract text from PDF specified pages
- Read text from target PDF page region coordinates
- Read text from PDF file annotations
- Scan, extract text from scanned PDF document
How to read, extract text from PDF file using C#
- Best PDF C#.NET PDF edit SDK, supports extracting PDF text in Visual Studio .NET framework
- Free library and component able to extract text from PDF in both .NET WinForms application and ASPX webpage
- C#.NET Core PDF text, image library:
how to add page numbers in pdf using c#,
extract image from pdf c# pdfs,
c# replace text in pdf file,
add image in pdf using c#,
c# remove images from pdf.
- Online C# source code for quick extracting text from adobe PDF document in C#.NET class
- Support .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint
- Support extracting OCR text from PDF in C#.NET by working with .NET XImage.OCR SDK
- Able to extract and get all and partial text content from PDF file
- Supports text extraction from scanned PDF in .NET console application
- Enable extracting PDF text to another PDF file, or to TXT and SVG formats
Although it is feasible for users to extract text content from source PDF document file with a copy-and-paste method, it is time-consuming and difficult for us to obtain text information and edit PDF text content. Instead, using this C#.NET PDF text extracting library package, you can easily extract all or partial text content from target PDF document file, edit selected text content, and export extracted text with customized format.
Use text manager to read, extract text contents and information from a PDF page using C#
PDF Text Manager class (PDFTextMgr) will help you easily read, extract text information from a PDF page.
You can read all the following text information from a PDF document or pages.
- Characters: use method ExtractTextCharacter() to get a list of PDFTextCharacter objects.
- Words: use method ExtractTextWord() to get a list of PDFTextWord objects.
- Lines: use method ExtractTextLine() to get a list of PDFTextLine objects.
- Paragraphs: use method ExtractTextParagraph() to get a list of PDFTextParagraph objects.
// open a document
String inputFilePath = Program.RootPath + "\\" + "2.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
// get text manager from the document
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
// extract different text content from the first page
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);
// get all characters in the page
List<PDFTextCharacter> allChars = textMgr.ExtractTextCharacter(page);
// report characters
foreach (PDFTextCharacter obj in allChars)
{
Console.WriteLine("Char: " + obj.GetChar() + "; Boundary: " + obj.GetBoundary().ToString());
}
// get all words in the page
List<PDFTextWord> allWords = textMgr.ExtractTextWord(page);
// report characters
foreach (PDFTextWord obj in allWords)
{
Console.WriteLine("Word: " + obj.GetContent() + "; Boundary: " + obj.GetBoundary().ToString());
}
// get all lines in the page
List<PDFTextLine> allLines = textMgr.ExtractTextLine(page);
// report characters
foreach (PDFTextLine obj in allLines)
{
Console.WriteLine("Line: " + obj.GetContent() + "; Boundary: " + obj.GetBoundary().ToString());
}
Read, get text coordinates from PDF file using C#
After reading, extracting text content from a pdf document or pdf pages, you will get
a list of PDFTextParagraph, PDFTextLine, PDFTextWord, or PDFTextCharacter objects.
There is one common method for all of the four classes, GetBoundary(). You can use the method to text coorinates inside the PDF page.
List<PDFTextParagraph> textParagraphs = textMgr.ExtractTextParagraph(page);
foreach (PDFTextParagraph obj in textParagraphs)
{
RectangleF textBoundary = obj.GetBoundary();
Console.WriteLine("Text coorinates: left top point X: " + textBoundary.X);
Console.WriteLine("Text coorinates: left top point Y: " + textBoundary.Y);
Console.WriteLine("Text coorinates: area width: " + textBoundary.Width);
Console.WriteLine("Text coorinates: area height: " + textBoundary.Height);
Console.WriteLine("Boundary: " + textBoundary.ToString());
}
C# read, extract text from pdf document
The following C# example source code shows how to read text information (chars, words, lines) from a pdf document.
#region extract text from pdf document
internal static void extractTextFromPdfFile()
{
String inputFilePath = @"C:\demo.pdf";
// Open a document.
PDFDocument doc = new PDFDocument(inputFilePath);
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
// Get all lines in the page.
List<PDFTextLine> lines = textMgr.ExtractTextLine();
// Get all words in the page.
List<PDFTextWord> words = textMgr.ExtractTextWord();
// Get all characters in the page.
List<PDFTextCharacter> allChar = textMgr.ExtractTextCharacter();
}
#endregion
C# extract text from specified pdf page
The following C# example source code shows how to read text data (chars, words, lines) from a pdf page.
#region extract text from specified pdf page
internal static void extractTextFromPdfPage()
{
String inputFilePath = @"C:\demo.pdf";
// Open a document.
PDFDocument doc = new PDFDocument(inputFilePath);
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
// Extract text content from first page.
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);
// Get all lines in the page.
List<PDFTextLine> lines = textMgr.ExtractTextLine(page);
// Get all words in the page.
List<PDFTextWord> words = textMgr.ExtractTextWord(page);
// Get all characters in the page.
List<PDFTextCharacter> allChar = textMgr.ExtractTextCharacter(page);
}
#endregion
C# extract PDF document text with coordinates
The following C# example source code shows how to read text data (chars, words, lines) from a pdf page region.
#region extract PDF document text with coordinates
internal static void extractTextFromPdfSpecifiedPosition()
{
String inputFilePath = @"C:\demo.pdf";
// Open a document.
PDFDocument doc = new PDFDocument(inputFilePath);
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
// Extract text content from first page.
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);
PointF location = new PointF(200f,200f);
SizeF size = new SizeF(300f,300f);
RectangleF area = new RectangleF(location, size);
List<PDFTextCharacter> chars = textMgr.SelectChar(page, area);
}
#endregion
Select a text item in a PDF page
The following C# example source code shows how to read a text item from pdf. And get the extracted text item information, such as text content, boundary data.
Select text characters
// open a document
String inputFilePath = Program.RootPath + "\\" + "2.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
// get a text manager from the document object
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
// get the first page from the document
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);
// select char at position (245F, 155F)
PointF cursor = new PointF(245F, 155F);
PDFTextCharacter aChar = textMgr.SelectChar(page, cursor);
if (aChar == null)
{
Console.WriteLine("No character has been found.");
}
else
{
Console.WriteLine("Value: " + aChar.GetChar() + "; Boundary: " + aChar.GetBoundary().ToString());
}
// select chars in the region (250F, 150F, 100F, 100F)
RectangleF region = new RectangleF(250F, 150F, 100F, 100F);
List<PDFTextCharacter> chars = textMgr.SelectChar(page, region);
foreach (PDFTextCharacter obj in chars)
{
Console.WriteLine("Value: " + obj.GetChar() + "; Boundary: " + obj.GetBoundary().ToString());
}
Select a text line
// select a line at 150F from the top of the page
PDFTextLine aLine = textMgr.SelectLine(page, 150F);
if (aLine == null)
{
Console.WriteLine("No character has been found.");
}
else
{
Console.WriteLine("Line: " + aLine.GetContent());
}
Read, extract text from markup annotations using C#
The code below is only for text markup annotations: highlight annotation, underline annotation, text delete annotation, text replace annotation.
- PDFAnnotHighlight
- PDFAnnotUnderLine
- PDFAnnotDeleteLine
- PDFAnnotTextReplace
String inputFilePath = Program.RootPath + "\\" + "1.pdf";
// Open the PDF file.
PDFDocument doc = new PDFDocument(inputFilePath);
// Retreive all annotations in the document.
List<IPDFAnnot> annots = PDFAnnotHandler.GetAllAnnotations(doc);
foreach (IPDFAnnot annot in annots)
{
// For PDFAnnotHighlight, PDFAnnotUnderLine, PDFAnnotDeleteLine and PDFAnnotTextReplace.
if (annot is IPDFMarkupAnnot)
{
// Get the parent page of the annotation.
PDFPage page = (PDFPage)doc.GetPage(annot.PageIndex);
// Extract text from the target text markup annotation.
String[] text = PDFAnnotHandler.ExtractText(page, (IPDFMarkupAnnot)annot);
// Show the markup text related to the annotation.
Console.WriteLine("Content: ");
foreach (String line in text)
{
Console.WriteLine(line);
}
}
}
Read text from scanned PDF document using C#
If you want to read, extract text content from scanned PDF document, you need XImage.OCR to process the images inside PDF document. Please go to
page How to read, extract text from scanned PDF file using c# for details.