OCR: How to C#
Using OCR SDK for C#.NET
Sample C#.NET Codes
Extract Text from Scanned PDF
  |  
Home ›› XImage.OCR ›› C# OCR: Extract Text from Scanned PDF

C#: Extract Text from Scanned PDF Using OCR SDK


How to Extract Text from Adobe PDF Document Using OCR Library in C#



Related .net document control helps:
c# asp.net text file viewer: ASP.NET Text file viewer in MVC, WebForms: Open, view, annotate, convert txt files in C# ASP.NET
asp.net edit pdf page using c#: ASP.NET PDF Pages Edit Control: add, remove, sort, replace PDF pages online using C#
asp.net annotate pdf using c#: ASP.NET Annotate PDF Control: annotate, comment, markup PDF document online using ASP.NET C#
c# asp.net image viewer: ASP.NET Image Viewer Control(MVC & WebForms): view, annotate, redact, convert image files in html, JQuery
c# asp.net pdf editor: EdgePDF: ASP.NET PDF Editor Web Control: Online view, annotate, redact, edit, process, convert PDF documents
c# asp.net mvc document viewer: ASP.NET Document Viewer using C#: Open, View, Annotate, Redact, Convert document files in ASP.NET using C#, HTML5, JQuer...
asp.net pdf document viewer c#: ASP.NET PDF Document Viewer in C#: open, display, view, annotate, redact Adobe PDF files online in ASP.NET MVC & WebForm...


Overview



Besides Tiff image text extraction, C# users can also perform accurate OCR technology on scanned PDF document. Multiple options are available and user-defined. For example, you can direct our .NET OCR SDK to recognize a single page of PDF document and then get its text content and output. More details are listed below.


Choose to recognize the whole PDF document and get all text content


Only recognize a page of PDF document and extract its text content


Directly define a special zone of PDF file page and perform OCR technology


Recognize scanned PDF and output OCR result to Adobe PDF file


Recognize scanned PDF and output OCR result to MS Word file


Please note that, our OCR SDK does not support directly importing PDF file. So, in the following C# demos, PDF documents will be firstly converted to Tiff image files (both string and stream forms are supported) and then be recognized.




C# Project DLLs: Extract Text from Scanned PDF Using OCR SDK



In order to run the following scan tiff image text sample code successfully, please do as follows:


Add References


  RasterEdge.XImage.OCR.dll


  RasterEdge.XImage.OCR.Tesseract.dll


  RasterEdge.Imaging.Basic.dll


  RasterEdge.Imaging.Basic.Codec.dll


  RasterEdge.Imaging.Drawing.dll


  RasterEdge.Imaging.Font.dll


  RasterEdge.Imaging.Processing.dll


  RasterEdge.XImage.AdvancedCleanup.Core.dll


  RasterEdge.XImage.Raster.Core.dll


  RasterEdge.XImage.Raster.dll


  RasterEdge.XDoc.PDF.dll


Using Namespaces


  using RasterEdge.XDoc.PDF;


  using RasterEdge.XImage.OCR;


  using RasterEdge.Imaging.Basic;


Note: When you get the error "Could not load file or assembly 'RasterEdge.Imaging.Basic' or any other assembly or one of its dependencies. An attempt to load a program with an incorrect format", please check your configure as follows:

       

       If you are using x64 libraries/dlls, Right click the project -> Properties -> Build -> Platform target: x64.

       

       If using x86, the platform target should be x86.




Extract Text from Whole PDF Document in C#




            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");

            // Set output file path.
            String outputFilePath = @"C:\Output.txt";
            StreamWriter writer = new StreamWriter(outputFilePath);
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                writer.WriteLine(ocrPage.GetText());
            }
            writer.Close();





Extract Text from Specified PDF Page in C#




            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);
            BasePage page = doc.GetPage(0);
            //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
            Bitmap bmp = page.ConvertToImage(96);//192,288....
            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            OCRPage ocrPage = OCRHandler.Import(bmp);
            ocrPage.Recognize();
            ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");





Extract Text from Specified Zone in PDF Page in C#




            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);
            BasePage page = doc.GetPage(0);
            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            OCRPage ocrPage = OCRHandler.Import(page);
            // Get a page zone start from point (10, 10) with width 400, height 300.
            OCRZone pageZone = ocrPage.CreateZone(new Rectangle(10, 10, 400, 300));

            // Apply recognizing.
            pageZone.Recognize();

            // Output the result to a text file.
            pageZone.SaveTo(MIMEType.TXT, @"C:\output.txt");





Recognize Scanned PDF and Output OCR Result to PDF in C#



Add Reference(Extra)


  RasterEdge.XDoc.PDF.dll




            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            // Set output file path.
            Stream[] streams = new MemoryStream[doc.GetPageCount()];
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                streams[i] = new MemoryStream();
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                ocrPage.SaveTo(MIMEType.PDF, streams[i]);
                streams[i].Seek(0, SeekOrigin.Begin);
            }
            PDFDocument.CombineDocument(streams, @"C:\output.pdf");





Recognize Scanned PDF and Output OCR Result to Word in C#




            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            // Set output file path.
            Stream[] streams = new MemoryStream[doc.GetPageCount()];
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                streams[i] = new MemoryStream();
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                ocrPage.SaveTo(MIMEType.DOCX, streams[i]);
                streams[i].Seek(0, SeekOrigin.Begin);
            }
            DOCXDocument.CombineDocument(streams, @"C:\output.docx");