C# OCR Library
How to extract text from scanned PDF using .NET OCR SDK


How to Extract Text from Adobe PDF Document Using OCR Library in C#



Related .net document control helps:
c# asp.net text file viewer: ASP.NET Text file viewer in MVC, WebForms: Open, view, annotate, convert txt files in C# ASP.NET
asp.net edit pdf page using c#: ASP.NET PDF Pages Edit Control: add, remove, sort, replace PDF pages online using C#
asp.net annotate pdf using c#: ASP.NET Annotate PDF Control: annotate, comment, markup PDF document online using ASP.NET C#
c# asp.net image viewer: ASP.NET Image Viewer Control(MVC & WebForms): view, annotate, redact, convert image files in html, JQuery
c# asp.net pdf editor: EdgePDF: ASP.NET PDF Editor Web Control: Online view, annotate, redact, edit, process, convert PDF documents
c# asp.net mvc document viewer: ASP.NET Document Viewer using C#: Open, View, Annotate, Redact, Convert document files in ASP.NET using C#, HTML5, JQuer...
asp.net pdf document viewer c#: ASP.NET PDF Document Viewer in C#: open, display, view, annotate, redact Adobe PDF files online in ASP.NET MVC & WebForm...





Overview



Besides Tiff image text extraction, C# users can also perform accurate OCR technology on scanned PDF document. Multiple options are available and user-defined. For example, you can direct our .NET OCR SDK to recognize a single page of PDF document and then get its text content and output. More details are listed below.

  • Choose to recognize the whole PDF document and get all text content
  • Only recognize a page of PDF document and extract its text content
  • Directly define a special zone of PDF file page and perform OCR technology
  • Recognize scanned PDF and output OCR result to Adobe PDF file
  • Recognize scanned PDF and output OCR result to MS Word file


Please note that, our OCR SDK does not support directly importing PDF file. So, in the following C# demos, PDF documents will be firstly converted to Tiff image files (both string and stream forms are supported) and then be recognized.





C# Project DLLs: Extract Text from Scanned PDF Using OCR SDK







Extract Text from Whole PDF Document in C#



            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");

            // Set output file path.
            String outputFilePath = @"C:\Output.txt";
            StreamWriter writer = new StreamWriter(outputFilePath);
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                writer.WriteLine(ocrPage.GetText());
            }
            writer.Close();




Extract Text from Specified PDF Page in C#



            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);
            BasePage page = doc.GetPage(0);
            //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
            Bitmap bmp = page.ConvertToImage(96);//192,288....
            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            OCRPage ocrPage = OCRHandler.Import(bmp);
            ocrPage.Recognize();
            ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");




Extract Text from Specified Zone in PDF Page in C#



            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);
            BasePage page = doc.GetPage(0);
            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            OCRPage ocrPage = OCRHandler.Import(page);
            // Get a page zone start from point (10, 10) with width 400, height 300.
            OCRZone pageZone = ocrPage.CreateZone(new Rectangle(10, 10, 400, 300));

            // Apply recognizing.
            pageZone.Recognize();

            // Output the result to a text file.
            pageZone.SaveTo(MIMEType.TXT, @"C:\output.txt");




Recognize Scanned PDF and Output OCR Result to PDF in C#



            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            // Set output file path.
            Stream[] streams = new MemoryStream[doc.GetPageCount()];
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                streams[i] = new MemoryStream();
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                ocrPage.SaveTo(MIMEType.PDF, streams[i]);
                streams[i].Seek(0, SeekOrigin.Begin);
            }
            PDFDocument.CombineDocument(streams, @"C:\output.pdf");




Recognize Scanned PDF and Output OCR Result to Word in C#



            // Open a PDF file.
            String inputFilePath = @"C:\input.pdf";
            PDFDocument doc = new PDFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            // Set output file path.
            Stream[] streams = new MemoryStream[doc.GetPageCount()];
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                streams[i] = new MemoryStream();
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                ocrPage.SaveTo(MIMEType.DOCX, streams[i]);
                streams[i].Seek(0, SeekOrigin.Begin);
            }
            DOCXDocument.CombineDocument(streams, @"C:\output.docx");