PM > Install-Package XDoc.PDF

How to Start Tutorials Troubleshooting Main Operations Convert PDF Read PDF Edit PDF PDF Report Generator Work with PDF Modules PDF Document PDF Pages Text Image Graph & Path Annotation, Markup & Drawing Redaction Security Digital Signature Forms Watermark Bookmark Link File Attachment File Metadata Printing Work with Other SDKs Barcode read Barcode create OCR Twain

C# PDF Text Reader Library
How to get, read, extract text from PDF file with line, table position coordinates using c# .net. Free open source examples


Use C# to Freely Extract Text from PDF Page, Page Region or the Whole PDF File with .NET PDF Control





In this tutorial, you learn how to read, extract text from PDF file using C# in ASP.NET MVC Web, Windows applications.

  • Read text characters, words, lines, paragraphs from PDF file
  • Read, get text with coordinates
  • Read specified text from PDF document
  • Extract text from PDF specified pages
  • Read text from target PDF page region coordinates
  • Read text from PDF file annotations
  • Scan, extract text from scanned PDF document

How to read, extract text from PDF file using C#

  1. Download XDoc.PDF Text Reader C# library
  2. Install C# library to read text from PDF document
  3. Step by Step Tutorial


























  • Best PDF C#.NET PDF edit SDK, supports extracting PDF text in Visual Studio .NET framework
  • Free library and component able to extract text from PDF in both .NET WinForms application and ASPX webpage
  • C#.NET Core PDF text, image library: how to add page numbers in pdf using c#, extract image from pdf c# pdfs, c# replace text in pdf file, add image in pdf using c#, c# remove images from pdf.
  • Online C# source code for quick extracting text from adobe PDF document in C#.NET class
  • Support .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint
  • Support extracting OCR text from PDF in C#.NET by working with .NET XImage.OCR SDK
  • Able to extract and get all and partial text content from PDF file
  • Supports text extraction from scanned PDF in .NET console application
  • Enable extracting PDF text to another PDF file, or to TXT and SVG formats


Although it is feasible for users to extract text content from source PDF document file with a copy-and-paste method, it is time-consuming and difficult for us to obtain text information and edit PDF text content. Instead, using this C#.NET PDF text extracting library package, you can easily extract all or partial text content from target PDF document file, edit selected text content, and export extracted text with customized format.







Use text manager to read, extract text contents and information from a PDF page using C#


PDF Text Manager class (PDFTextMgr) will help you easily read, extract text information from a PDF page. You can read all the following text information from a PDF document or pages.

  • Characters: use method ExtractTextCharacter() to get a list of PDFTextCharacter objects.
  • Words: use method ExtractTextWord() to get a list of PDFTextWord objects.
  • Lines: use method ExtractTextLine() to get a list of PDFTextLine objects.
  • Paragraphs: use method ExtractTextParagraph() to get a list of PDFTextParagraph objects.


//  open a document
String inputFilePath = Program.RootPath + "\\" + "2.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
//  get text manager from the document
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

//  extract different text content from the first page
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);

//  get all characters in the page
List<PDFTextCharacter> allChars = textMgr.ExtractTextCharacter(page);
//  report characters
foreach (PDFTextCharacter obj in allChars)
{
    Console.WriteLine("Char: " + obj.GetChar() + "; Boundary: " + obj.GetBoundary().ToString());
}

//  get all words in the page
List<PDFTextWord> allWords = textMgr.ExtractTextWord(page);
//  report characters
foreach (PDFTextWord obj in allWords)
{
    Console.WriteLine("Word: " + obj.GetContent() + "; Boundary: " + obj.GetBoundary().ToString());
}

//  get all lines in the page
List<PDFTextLine> allLines = textMgr.ExtractTextLine(page);
//  report characters
foreach (PDFTextLine obj in allLines)
{
    Console.WriteLine("Line: " + obj.GetContent() + "; Boundary: " + obj.GetBoundary().ToString());
}






Read, get text coordinates from PDF file using C#


After reading, extracting text content from a pdf document or pdf pages, you will get a list of PDFTextParagraph, PDFTextLine, PDFTextWord, or PDFTextCharacter objects.

There is one common method for all of the four classes, GetBoundary(). You can use the method to text coorinates inside the PDF page.



            List<PDFTextParagraph> textParagraphs = textMgr.ExtractTextParagraph(page);
            foreach (PDFTextParagraph obj in textParagraphs)
            {

                RectangleF textBoundary = obj.GetBoundary();

                Console.WriteLine("Text coorinates: left top point X: " + textBoundary.X);
                Console.WriteLine("Text coorinates: left top point Y: " + textBoundary.Y);
                Console.WriteLine("Text coorinates: area width: " + textBoundary.Width);
                Console.WriteLine("Text coorinates: area height: " + textBoundary.Height);

                Console.WriteLine("Boundary: " + textBoundary.ToString());
            }




C# read, extract text from pdf document


The following C# example source code shows how to read text information (chars, words, lines) from a pdf document.



        #region extract text from pdf document
        internal static void extractTextFromPdfFile()
        {
            String inputFilePath = @"C:\demo.pdf";
            // Open a document.
            PDFDocument doc = new PDFDocument(inputFilePath);
            PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

            // Get all lines in the page.
            List<PDFTextLine> lines = textMgr.ExtractTextLine();

            // Get all words in the page.
            List<PDFTextWord> words = textMgr.ExtractTextWord();

            // Get all characters in the page.
            List<PDFTextCharacter> allChar = textMgr.ExtractTextCharacter();
        }
        #endregion




C# extract text from specified pdf page


The following C# example source code shows how to read text data (chars, words, lines) from a pdf page.



        #region extract text from specified pdf page
        internal static void extractTextFromPdfPage()
        {
            String inputFilePath = @"C:\demo.pdf";
            // Open a document.
            PDFDocument doc = new PDFDocument(inputFilePath);
            PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
            // Extract text content from first page.
            int pageIndex = 0;
            PDFPage page = (PDFPage)doc.GetPage(pageIndex);

            // Get all lines in the page.
            List<PDFTextLine> lines = textMgr.ExtractTextLine(page);

            // Get all words in the page.
            List<PDFTextWord> words = textMgr.ExtractTextWord(page);

            // Get all characters in the page.
            List<PDFTextCharacter> allChar = textMgr.ExtractTextCharacter(page);
        }
        #endregion




C# extract PDF document text with coordinates


The following C# example source code shows how to read text data (chars, words, lines) from a pdf page region.



        #region extract PDF document text with coordinates
        internal static void extractTextFromPdfSpecifiedPosition()
        {
            String inputFilePath = @"C:\demo.pdf";
            // Open a document.
            PDFDocument doc = new PDFDocument(inputFilePath);
            PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

            // Extract text content from first page.
            int pageIndex = 0;
            PDFPage page = (PDFPage)doc.GetPage(pageIndex);

            PointF location = new PointF(200f,200f);
            SizeF size = new SizeF(300f,300f);
            RectangleF area = new RectangleF(location, size);
            List<PDFTextCharacter> chars = textMgr.SelectChar(page, area);
        }
        #endregion




Select a text item in a PDF page


The following C# example source code shows how to read a text item from pdf. And get the extracted text item information, such as text content, boundary data.



Select text characters



//  open a document
String inputFilePath = Program.RootPath + "\\" + "2.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
//  get a text manager from the document object
PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

//  get the first page from the document
int pageIndex = 0;
PDFPage page = (PDFPage)doc.GetPage(pageIndex);


//  select char at position (245F, 155F)
PointF cursor = new PointF(245F, 155F);
PDFTextCharacter aChar = textMgr.SelectChar(page, cursor);
if (aChar == null)
{
    Console.WriteLine("No character has been found.");
}
else
{
    Console.WriteLine("Value: " + aChar.GetChar() + "; Boundary: " + aChar.GetBoundary().ToString());
}

//  select chars in the region (250F, 150F, 100F, 100F)
RectangleF region = new RectangleF(250F, 150F, 100F, 100F);
List<PDFTextCharacter> chars = textMgr.SelectChar(page, region);
foreach (PDFTextCharacter obj in chars)
{
    Console.WriteLine("Value: " + obj.GetChar() + "; Boundary: " + obj.GetBoundary().ToString());
}


Select a text line



//  select a line at 150F from the top of the page
PDFTextLine aLine = textMgr.SelectLine(page, 150F);
if (aLine == null)
{
    Console.WriteLine("No character has been found.");
}
else
{
    Console.WriteLine("Line: " + aLine.GetContent());
}







Read, extract text from markup annotations using C#


The code below is only for text markup annotations: highlight annotation, underline annotation, text delete annotation, text replace annotation.


  • PDFAnnotHighlight
  • PDFAnnotUnderLine
  • PDFAnnotDeleteLine
  • PDFAnnotTextReplace


String inputFilePath = Program.RootPath + "\\" + "1.pdf";

//  Open the PDF file.
PDFDocument doc = new PDFDocument(inputFilePath);
//  Retreive all annotations in the document.
List<IPDFAnnot> annots = PDFAnnotHandler.GetAllAnnotations(doc);
foreach (IPDFAnnot annot in annots)
{
    //  For PDFAnnotHighlight, PDFAnnotUnderLine, PDFAnnotDeleteLine and PDFAnnotTextReplace.
    if (annot is IPDFMarkupAnnot)
    {
        //  Get the parent page of the annotation.
        PDFPage page = (PDFPage)doc.GetPage(annot.PageIndex);

        //  Extract text from the target text markup annotation.
        String[] text = PDFAnnotHandler.ExtractText(page, (IPDFMarkupAnnot)annot);
        //  Show the markup text related to the annotation.
        Console.WriteLine("Content: ");
        foreach (String line in text)
        {
            Console.WriteLine(line);
        }
    }
}






Read text from scanned PDF document using C#


If you want to read, extract text content from scanned PDF document, you need XImage.OCR to process the images inside PDF document. Please go to page How to read, extract text from scanned PDF file using c# for details.










Common Asked Questions

How do I extract specific text from a PDF?

To extract specific text from your PDF file, you can find the target text manually or by text search, and use online free tools or offline free or paid applications to extract specific text from the PDF document. Using C# PDF text library, you can do text search on the target PDF document, and extract all or some searched text results from the pdf file in your C# ASP.NET, MVC, WinForms applications.

How do I export only text from a PDF?

You can save and export PDF to a text file (.txt) or extract all text content from the PDF document using PDF reader or editor programs. Using RasterEdge PDF C# library, you can choose several methods to export text only content from a PDF file in C# class.
  • Convert PDF file to text file
  • Convert PDF file with text only content to Microsoft Word document with text well formatted
  • Convert PDF file with text only content to web pages (html or svg format) with text formatted
  • Read, extract all text content from PDF document

Why can't I extract text from a PDF?

There are several reasons why you cannot extract text from a PDF file.
  • The text content is stored as images in PDF. You need OCR tool to extract text from PDF images.
  • The PDF document permission settings which prevent you copy text content. You need owner permission to unlock it.
Using RasterEdge C# PDF and OCR library, you can easily extract text from a scanned PDF file in C# web and desktop applications.

Why can't I select text in a PDF file?

Here are the reasons why you cannot search target text from the PDF document.
  • The text content is stored as images in PDF. You need OCR tool to convert image to text content, then you can select the extracted text.
  • The PDF document permission settings which prevent you select text content. You need unlock the PDF document restriction.
Using XDoc.PDF C# libary, you can search, find, select the extract text from the PDF document in your C# Visual Studio .NET projects.

How to print only selected text in PDF?

You can select a rectangle area to print the contents inside from a PDF document in Acrobat. Using RasterEdge C# PDF library, you can select one page, or multi-pages, or whole PDF document to print in your C# ASP.NET web and desktop applications.

How do I extract text from a PDF to Word?

You can copy text from PDF pages in Acrobat, and paste the text content to Microsoft Word application directly. You can also open the PDF file in EdgePDF demo ASP.NET web app, and copy specified text content, paste them to the Microsoft Word application also. EdgePDF is an ASP.NET web control to build online PDF editor web application like Acrobat, and it is developed based RasterEdge XDoc.PDF C# library.

How to unlock a PDF to copy text?

You need owner password to unlock the PDF with text copy permission. Using C# PDF library, you can easily unlock the PDF permission and extract, edit PDF text content in your C# Windows Forms, WPF, APS.NET, MVC applications.