XDoc.PDF
Features
Tech Specs
How-to C#
How-to VB.NET
Pricing
C# PDF: How to PDF Create PDF Export File and Page Process PDF Read PDF Write Form Process Document Protect Annotation & Drawing PDF Print WPF Viewer & Editor Work with Other SDKs Barcode Read Barcode Create OCR Twain
Text: Extract Text from PDF
  |  
Home ›› XDoc.PDF ›› C# PDF: Extract PDF Text

C#.NET PDF SDK - Extract Text from PDF in C#.NET


Use C# to Freely Extract Text from PDF Page, Page Region or the Whole PDF File with .NET PDF Control




Best PDF C#.NET PDF edit SDK, supports extracting PDF text in Visual Studio .NET framework


Free library and component able to extract text from PDF in both .NET WinForms application and ASPX webpage


Online C# source code for quick extracting text from adobe PDF document in C#.NET class


Support .NET WinForms, ASP.NET MVC in IIS, ASP.NET Ajax, Azure cloud service, DNN (DotNetNuke), SharePoint


Support extracting OCR text from PDF in C#.NET by working with .NET XImage.OCR SDK


Able to extract and get all and partial text content from PDF file


Supports text extraction from scanned PDF in .NET console application


Enable extracting PDF text to another PDF file, or to TXT and SVG formats


XDoc.PDF for .NET offers advanced & mature APIs for developers to extract text content from PDF document file in C#.NET class application. As is known to us all, PDF document is a great choice for file exchanging in across-platform applications. But sometimes, we need to extract or fetch text content from source PDF document file for word processing, presentation and desktop publishing applications.


Although it is feasible for users to extract text content from source PDF document file with a copy-and-paste method, it is time-consuming and difficult for us to obtain text information and edit PDF text content. Instead, using this C#.NET PDF text extracting library package, you can easily extract all or partial text content from target PDF document file, edit selected text content, and export extracted text with customized format.






C# extract text from pdf document


Note: When you get the error "Could not load file or assembly 'RasterEdge.Imaging.Basic' or any other assembly or one of its dependencies. An attempt to load a program with an incorrect format", please check your configure as follows:

       

       If you are using x64 libraries/dlls, Right click the project -> Properties -> Build -> Platform target: x64.

       

       If using x86, the platform target should be x86.




        #region extract text from pdf document
        internal static void extractTextFromPdfFile()
        {
            String inputFilePath = @"C:\demo.pdf";
            // Open a document.
            PDFDocument doc = new PDFDocument(inputFilePath);
            PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

            // Get all lines in the page.
            List<PDFTextLine> lines = textMgr.ExtractTextLine();

            // Get all words in the page.
            List<PDFTextWord> words = textMgr.ExtractTextWord();

            // Get all characters in the page.
            List<PDFTextCharacter> allChar = textMgr.ExtractTextCharacter();
        }
        #endregion




C# extract text from specified pdf page





        #region extract text from specified pdf page
        internal static void extractTextFromPdfPage()
        {
            String inputFilePath = @"C:\demo.pdf";
            // Open a document.
            PDFDocument doc = new PDFDocument(inputFilePath);
            PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);
            // Extract text content from first page.
            int pageIndex = 0;
            PDFPage page = (PDFPage)doc.GetPage(pageIndex);

            // Get all lines in the page.
            List<PDFTextLine> lines = textMgr.ExtractTextLine(page);

            // Get all words in the page.
            List<PDFTextWord> words = textMgr.ExtractTextWord(page);

            // Get all characters in the page.
            List<PDFTextCharacter> allChar = textMgr.ExtractTextCharacter(page);
        }
        #endregion




C# extract PDF document text with coordinates





        #region extract PDF document text with coordinates
        internal static void extractTextFromPdfSpecifiedPosition()
        {
            String inputFilePath = @"C:\demo.pdf";
            // Open a document.
            PDFDocument doc = new PDFDocument(inputFilePath);
            PDFTextMgr textMgr = PDFTextHandler.ExportPDFTextManager(doc);

            // Extract text content from first page.
            int pageIndex = 0;
            PDFPage page = (PDFPage)doc.GetPage(pageIndex);

            PointF location = new PointF(200f,200f);
            SizeF size = new SizeF(300f,300f);
            RectangleF area = new RectangleF(location, size);
            List<PDFTextCharacter> chars = textMgr.SelectChar(page, area);
        }
        #endregion




.NET Class Namespace Required



Add necessary references:


  RasterEdge.Imaging.Basic.dll


  RasterEdge.Imaging.Basic.Codec.dll


  RasterEdge.Imaging.Drawing.dll


  RasterEdge.Imaging.Font.dll


  RasterEdge.Imaging.Processing.dll


  RasterEdge.XImage.Raster.dll


  RasterEdge.XImage.Raster.Core.dll


  RasterEdge.XDoc.PDF.dll


Use corresponding namespaces;


  using RasterEdge.Imaging.Basic;


  using RasterEdge.XDoc.PDF;





public List <PDFTextLine> ExtractTextLine()

Description:
     Extract all lines in the PDF file.

Return:
     A list of line objects.



public List <PDFTextLine> ExtractTextLine(PDFPage page)

Description:
     Extract all lines from one PDF page.

Return:
     A list of line objects.



public List <PDFTextWord> ExtractTextWord()

Description:
     Extract all word in the PDF file.

Return:
     A list of word objects.



public List <PDFTextWord> ExtractTextWord(PDFPage page)

Description:
     Extract all word from one PDF page.

Return:
     A list of word objects.



public List <PDFTextCharacter> ExtractTextCharacter()

Description:
     Extract all characters in the PDF file.

Return:
     A list of character objects.



public List <PDFTextCharacter> ExtractTextCharacter(PDFPage page)

Description:
     Extract all characters from one PDF page.

Return:
     A list of character objects.