C# OCR Library
How to extract text from scanned PDF using .NET OCR SDK
How to Extract Text from Adobe PDF Document Using OCR Library in C#
Related .net document control helps:
c# asp.net text file viewer: ASP.NET Text file viewer in MVC, WebForms: Open, view, annotate, convert txt files in C# ASP.NET
asp.net edit pdf page using c#:
ASP.NET PDF Pages Edit Control: add, remove, sort, replace PDF pages online using C#
asp.net annotate pdf using c#:
ASP.NET Annotate PDF Control: annotate, comment, markup PDF document online using ASP.NET C#
c# asp.net image viewer: ASP.NET Image Viewer Control(MVC & WebForms): view, annotate, redact, convert image files in html, JQuery
c# asp.net pdf editor: EdgePDF: ASP.NET PDF Editor Web Control: Online view, annotate, redact, edit, process, convert PDF documents
c# asp.net mvc document viewer: ASP.NET Document Viewer using C#: Open, View, Annotate, Redact, Convert document files in ASP.NET using C#, HTML5, JQuer...
asp.net pdf document viewer c#: ASP.NET PDF Document Viewer in C#: open, display, view, annotate, redact Adobe PDF files online in ASP.NET MVC & WebForm...
Overview
Besides Tiff image text extraction, C# users can also perform accurate OCR technology on scanned PDF document. Multiple options are available and user-defined. For example, you can direct our .NET OCR SDK to recognize a single page of PDF document and then get its text content and output. More details are listed below.
- Choose to recognize the whole PDF document and get all text content
- Only recognize a page of PDF document and extract its text content
- Directly define a special zone of PDF file page and perform OCR technology
- Recognize scanned PDF and output OCR result to Adobe PDF file
- Recognize scanned PDF and output OCR result to MS Word file
Please note that, our OCR SDK does not support directly importing PDF file. So, in the following C# demos, PDF documents will be firstly converted to Tiff image files (both string and stream forms are supported) and then be recognized.
C# Project DLLs: Extract Text from Scanned PDF Using OCR SDK
Extract Text from Whole PDF Document in C#
// Open a PDF file.
String inputFilePath = @"C:\input.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
// Set output file path.
String outputFilePath = @"C:\Output.txt";
StreamWriter writer = new StreamWriter(outputFilePath);
for (int i = 0; i < doc.GetPageCount(); i++)
{
BasePage page = doc.GetPage(i);
//the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
Bitmap bmp = page.ConvertToImage(96);//192,288....
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
writer.WriteLine(ocrPage.GetText());
}
writer.Close();
Extract Text from Specified PDF Page in C#
// Open a PDF file.
String inputFilePath = @"C:\input.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
BasePage page = doc.GetPage(0);
//the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
Bitmap bmp = page.ConvertToImage(96);//192,288....
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");
Extract Text from Specified Zone in PDF Page in C#
// Open a PDF file.
String inputFilePath = @"C:\input.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
BasePage page = doc.GetPage(0);
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
OCRPage ocrPage = OCRHandler.Import(page);
// Get a page zone start from point (10, 10) with width 400, height 300.
OCRZone pageZone = ocrPage.CreateZone(new Rectangle(10, 10, 400, 300));
// Apply recognizing.
pageZone.Recognize();
// Output the result to a text file.
pageZone.SaveTo(MIMEType.TXT, @"C:\output.txt");
Recognize Scanned PDF and Output OCR Result to PDF in C#
// Open a PDF file.
String inputFilePath = @"C:\input.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
// Set output file path.
Stream[] streams = new MemoryStream[doc.GetPageCount()];
for (int i = 0; i < doc.GetPageCount(); i++)
{
BasePage page = doc.GetPage(i);
streams[i] = new MemoryStream();
//the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
Bitmap bmp = page.ConvertToImage(96);//192,288....
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
ocrPage.SaveTo(MIMEType.PDF, streams[i]);
streams[i].Seek(0, SeekOrigin.Begin);
}
PDFDocument.CombineDocument(streams, @"C:\output.pdf");
Recognize Scanned PDF and Output OCR Result to Word in C#
// Open a PDF file.
String inputFilePath = @"C:\input.pdf";
PDFDocument doc = new PDFDocument(inputFilePath);
// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
// Set output file path.
Stream[] streams = new MemoryStream[doc.GetPageCount()];
for (int i = 0; i < doc.GetPageCount(); i++)
{
BasePage page = doc.GetPage(i);
streams[i] = new MemoryStream();
//the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
Bitmap bmp = page.ConvertToImage(96);//192,288....
OCRPage ocrPage = OCRHandler.Import(bmp);
ocrPage.Recognize();
ocrPage.SaveTo(MIMEType.DOCX, streams[i]);
streams[i].Seek(0, SeekOrigin.Begin);
}
DOCXDocument.CombineDocument(streams, @"C:\output.docx");