C# OCR Library
Extract Text from Tiff File Using OCR SDK


Online C# Guide for Text Extraction from Tiff File Using .NET OCR SDK

Related .net document control helps:
c# asp.net text file viewer: ASP.NET Text file viewer in MVC, WebForms: Open, view, annotate, convert txt files in C# ASP.NET
asp.net edit pdf page using c#: ASP.NET PDF Pages Edit Control: add, remove, sort, replace PDF pages online using C#
asp.net annotate pdf using c#: ASP.NET Annotate PDF Control: annotate, comment, markup PDF document online using ASP.NET C#
c# asp.net image viewer: ASP.NET Image Viewer Control(MVC & WebForms): view, annotate, redact, convert image files in html, JQuery
c# asp.net pdf editor: EdgePDF: ASP.NET PDF Editor Web Control: Online view, annotate, redact, edit, process, convert PDF documents
c# asp.net mvc document viewer: ASP.NET MVC Document Viewer: view, annotate, redact files on ASP.NET MVC web projects
c# asp.net powerpoint viewer: ASP.NET PowerPoint Document Viewer Control (MVC & WebForms): view ppt, pptx files online in C# using ASP.NET





Overview



By using XImage.OCR for .NET, C# programmers are entitled to implement mature and fast OCR recognition for Tiff, scanned PDF and multiple other image file formats like Jpeg, Bmp, Png, Gif, etc. This online guide will focus on implementing OCR technology on Tiff image file. To be more specific, C# programmers are able to do the following aspects. Respective demo codes are provided in the coming parts.



  • Extract text from whole Tiff file
  • Extract text from specified Tiff page
  • Extract text from specified zone in Tiff page
  • Scan image and output OCR result to PDF document
  • Scan image and output OCR result to Word document


Before moving onto using C# demo codes below, please firstly install XImage.OCR for .NET into your C# project. What should be noticed here is that respective DLL libraries should also be integrated as project references if you need to OCR specific files. For Tiff image, RasterEdge.XDoc.Tiff.dll should be used as well.





C# Project DLLs: Extract Text from Tiff File Using OCR SDK



C# OCR: Extract Text from Whole Tiff File



            // Open a tiff file.
            String inputFilePath = @"C:\input.tif";
            TIFFDocument doc = new TIFFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");

            // Set output file path.
            String outputFilePath = @"C:\Output.txt";
            StreamWriter writer = new StreamWriter(outputFilePath);
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                writer.WriteLine(ocrPage.GetText());
            }
            writer.Close();


C# OCR: Extract Text from Specified Tiff Page



            // Open a tif file.
            String inputFilePath = @"C:\input.tif";
            TIFFDocument doc = new TIFFDocument(inputFilePath);
            BasePage page = doc.GetPage(0);
            //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
            Bitmap bmp = page.ConvertToImage(96);//192,288....
            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            OCRPage ocrPage = OCRHandler.Import(bmp);
            ocrPage.Recognize();
            ocrPage.SaveTo(MIMEType.TXT, @"C:\output.txt");


C# OCR: Extract Text from Specified Zone in Tiff Page



// The folder that contains '.traineddata' files.
OCRHandler.SetTrainResourcePath(DefaultSourceFolder);

// Set input file path.
String inputFilePath = RootFolder + "\\" + "Test.tif";

// Set output file path.
String outputFilePath = RootFolder + "\\" + "Output2.txt";

// Import the TIFF file.
OCRDocument doc = OCRHandler.Import(inputFilePath);

// Get the first page.
OCRPage page = doc.GetPage(0);

// Get a page zone start from point (10, 10) with width 400, height 300.
OCRZone pageZone = page.CreateZone(new Rectangle(10, 10, 400, 300));

// Apply recognizing.
pageZone.Recognize();

// Output the result to a text file.
pageZone.SaveTo(MIMEType.TXT, outputFilePath);


C# OCR: Scan Tiff and Output OCR Result to PDF



            // Open a TIFF file.
            String inputFilePath = @"C:\input.tif";
            TIFFDocument doc = new TIFFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            // Set output file path.
            Stream[] streams = new MemoryStream[doc.GetPageCount()];
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                streams[i] = new MemoryStream();
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                ocrPage.SaveTo(MIMEType.PDF, streams[i]);
                streams[i].Seek(0, SeekOrigin.Begin);
            }
            PDFDocument.CombineDocument(streams, @"C:\output.pdf");


C# OCR: Scan Tiff and Output OCR Result to Word



            // Open a TIFF file.
            String inputFilePath = @"C:\input.tif";
            TIFFDocument doc = new TIFFDocument(inputFilePath);

            // The folder that contains '.traineddata' files.
            OCRHandler.SetTrainResourcePath(@"D:\Alice\DLL\Source\");
            // Set output file path.
            Stream[] streams = new MemoryStream[doc.GetPageCount()];
            for (int i = 0; i < doc.GetPageCount(); i++)
            {
                BasePage page = doc.GetPage(i);
                streams[i] = new MemoryStream();
                //the default resolution is 96, if you set larger, it will be helpful to recognize the text, but it can't be too large.
                Bitmap bmp = page.ConvertToImage(96);//192,288....
                OCRPage ocrPage = OCRHandler.Import(bmp);
                ocrPage.Recognize();
                ocrPage.SaveTo(MIMEType.DOCX, streams[i]);
                streams[i].Seek(0, SeekOrigin.Begin);
            }
            DOCXDocument.CombineDocument(streams, @"C:\output.docx");