Using the OptimizedLoad Functions to Speed Loading Large Files

Introduction

Certain multi-page file formats can be slow to load using the default implementation of RasterCodecs. This is especially true for document file formats such as DOCX, RTF, PDF, and TXT.

Most raster file formats, such as TIFF, have a simple header at the beginning of the file that indicates the number of pages as well as the location of each page in the file. Therefore, to read page 5 from a 10-page file, the software first reads the header and then jumps right away to page 5 and parses it. In this case, no other parts of the file (nor the other pages) are touched. The penalty for reading the next page (page number 6) without any prior information is to read and parse the header again which is typically a fast operation.

Some document file formats, such as PDF, also contain a header at the beginning of the file indicating the number and location of the pages. However, the page might have cross-references to other sections in the document. For example, the font data is almost always stored globally for PDF documents at a specific location. Therefore, to read page 5 from a 10-page file, the software must first read the header and then read the global font data (and then jump to page 5 to parse it). The penalty for reading the next page (page number 6) without any prior information is to read and parse the header as well as all the global cross-reference sections in the file, which will be repeated for each page.

Other types of document formats such as plain text do not contain a header at all. The total number and location of each page must be calculated by looping through each line of text and rendering it on a virtual page, incrementing the page count when the virtual page runs out of space. Every line in the file must be processed this way. Therefore, to read page 5 from a 10-page file, the software will render and discard pages 1 through 4 on a virtual page, then render page 5 into the physical page. In essence, RasterCodecs will create a "file format header" for a Text file that includes the number and location of each page in the physical file. The penalty for reading the next page (page number 6) without any prior information is to perform this whole operation all over again (this time rendering and discarding pages 1 through 5).

The solution to this is to use the Optimize Load mechanism in RasterCodecs. With optimized loading, the RasterCodecs can keep internal state information about the document instead of it having to re-load it. For example, in the PDF file above, the global font table is kept in memory and re-used for subsequent pages. And, in the case of the Text file, the virtual header is created and then used and updated as pages are loaded.

Real-World Example

As an example, consider sample.pdf, a PDF file with 115 pages. The following example shows an implementation of loading this file using RasterCodecs:

C#
private static void TestOptimizeLoad(string imageFileName, bool useOptimizeLoad) 
{ 
    using (RasterCodecs codecs = new RasterCodecs()) 
    { 
        // Start a timer 
        var stopwatch = new Stopwatch(); 
 
        // Use GetInformation to get the total number of pages in the file; 
        int pageCount; 
        double thisTime = 0; 
        double totalTime = 0; 
 
        // See if we need to turn optimize load on. 
        if (useOptimizeLoad) 
            codecs.StartOptimizedLoad(); 
 
        stopwatch.Restart(); 
        CodecsImageInfo imageInfo = codecs.GetInformation(imageFileName, true); 
        thisTime = stopwatch.ElapsedMilliseconds; 
        totalTime += thisTime; 
        pageCount = imageInfo.TotalPages; 
        imageInfo.Dispose(); 
 
        Console.WriteLine($"GetInformation took {thisTime} ms"); 
 
        // Loop through every page and load it 
        for (int pageNumber = 1; pageNumber <= pageCount; pageNumber++) 
        { 
            stopwatch.Restart(); 
            RasterImage image = codecs.Load(imageFileName, pageNumber); 
            thisTime = stopwatch.ElapsedMilliseconds; 
            totalTime += thisTime; 
            image.Dispose(); 
 
            Console.WriteLine($"Loading page {pageNumber} took {thisTime} ms"); 
        } 
 
        // See if it is necessary to turn Optimized loading back off and free its data. 
        // Note that this is not really needed in this example since  
        // the RasterCodecs is disposed immediately after, which will take care of that 
        if (useOptimizeLoad) 
            codecs.StopOptimizedLoad(); 
 
        Console.WriteLine($"Total time {totalTime} ms"); 
    } 
} 

The code is very simple: first, it obtains the total number of pages from the file, and then it loads each page as a RasterImage object. It will also log the time it took for each operation in order to add up the total time spent.

Calling this method with TestOptimizeLoad("sample.pdf", false) results in the following on the test machine:

GetInformation took 114 ms 
Loading page 1 took 103 ms 
Loading page 2 took 61 ms 
Loading page 3 took 46 ms 
// Snip 
Loading page 113 took 43 ms 
Loading page 114 took 40 ms 
Loading page 115 took 50 ms 
Total time 6622 ms 

Calling this method with TestOptimizeLoad("sample.pdf", true) results in the following on the same test machine:

GetInformation took 114 ms 
Loading page 1 took 115 ms 
Loading page 2 took 31 ms 
Loading page 3 took 42 ms 
// Snip 
Loading page 113 took 43 ms 
Loading page 114 took 40 ms 
Loading page 115 took 48 ms 
Total time 5577 ms 

Compare the first at 6.6 seconds with the second at 5.6 seconds. Therefore, by calling StartOptimizedLoad the operation is sped up by 15%.

Another file is sample.txt. This is a large text file with 348 pages. Here are the results of TestOptimizeLoad("sample.txt", false):

GetInformation took 294 ms 
Loading page 1 took 193 ms 
Loading page 2 took 155 ms 
// Snip 
Loading page 346 took 161 ms 
Loading page 347 took 138 ms 
Loading page 348 took 146 ms 
Total time 51706 ms 

While TestOptimizeLoad("sample.pdf", true) results in:

GetInformation took 259 ms 
Loading page 1 took 59 ms 
Loading page 2 took 34 ms 
Loading page 3 took 34 ms 
// Snip 
Loading page 346 took 34 ms 
Loading page 347 took 35 ms 
Loading page 348 took 33 ms 
Total time 12992 ms 

Compare the first at 52 seconds with the second at 13 seconds (a speed up of almost 400%.)

Effects of Optimized Loading

Comparing TIFF, PDF, and Text load times shows the extreme effects of using optimized loading with RasterCodecs, as follows:

The above examples use optimized load data to convert all the pages of an input file. For PDF files, the speed improvement will be noticeable when large files (1000 pages or more) are converted. For other file formats, the speed improvement can be significant even with fewer pages (as few as 10 pages).

TIFF/BigTIFF files use a simpler mechanism (the IFD, or file offset of each page). For more information about using the IFD, refer to Loading and Saving Large TIFF/BigTIFF Files.

Supported File Formats

The following file formats are supported:

Usage

The simplest way to enable optimized loading in RasterCodecs is to call RasterCodecs.StartOptimizedLoad before getting information or loading the pages of a document and RasterCodecs.StopOptimizedLoad when the operation finishes. StartOptimizedLoad will inform the RasterCodecs object that all subsequent RasterCodecs.GetInformation, RasterCodecs.GetInformationAsync, RasterCodecs.Load, and RasterCodecs.LoadAsync methods will be performed on the same image file (or stream), until the RasterCodecs.StopOptimizedLoad method is called. Passing a different file or stream in the middle of this operation will result in undefined behavior. To load a different file in the middle of the operation, simply create a new instance of RasterCodecs and use it for this purpose.

While optimized loading is being used on a RasterCodecs object, it will create, store, and re-use the internal data used by the file formats (such as PDF or TEXT) for the particular document file or stream handling all the necessary communication. This internal data is freed when optimized loading is turned off using RasterCodecs.StopOptimizedLoad or when the RasterCodecs object is disposed.

More advanced usage occurs when loading of the pages cannot occur during the life span of one RasterCodecs instance. Instead, it must be re-used by different RasterCodecs objects (for example, when loading the pages through a web service). In this case, RasterCodecs.GetOptimizedLoadData can be used to obtain the current internal state data and then RasterCodecs.StartOptimizedLoad(CodecsOptimizedLoadData) called to turn the same optimized load on a different RasterCodecs object.

Note that not all file formats can use optimized load data. And, for the ones that do, not all file formats support getting/re-setting the optimized data. For such types of formats, RasterCodecs.GetOptimizedLoadData will return null.

For instance, the following example shows an action in an ASP.NET controller of a web service that allows the user to load a page as a PNG file from a URL:

C#
public async Task<ActionResult> GetPng(Uri uri, int pageNumber) 
{ 
    // New RasterCodecs 
    using (RasterCodecs codecs = new RasterCodecs()) 
    { 
        RasterImage image; 
 
        // Load the page as RasterImage 
        using (ILeadStream leadStream = await LeadStream.Factory.FromUri(uri)) 
        { 
            image = await codecs.LoadAsync(leadStream, pageNumber); 
        } 
 
        // Save it as PNG 
        MemoryStream memoryStream = new MemoryStream(); 
        using (ILeadStream leadStream = LeadStream.Factory.FromStream(memoryStream)) 
        { 
            await codecs.SaveAsync(image, leadStream, RasterImageFormat.Png, 0); 
            memoryStream.Position = 0; 
        } 
 
        image.Dispose(); 
 
        return File(memoryStream, "image/png"); 
    } 
} 

The goal is to use optimized loading when the URL points to the same file. Use a concurrent dictionary to store the optimized load data. For a real-world application, this should be replaced by a caching mechanism, as follows:

private static ConcurrentDictionary<string, CodecsOptimizedLoadData> _urlOptimizeLoadData = new ConcurrentDictionary<string, CodecsOptimizedLoadData>();
C#
public async Task<ActionResult> GetPng(Uri uri, int pageNumber) 
{ 
    // New RasterCodecs 
    using (RasterCodecs codecs = new RasterCodecs()) 
    { 
        RasterImage image; 
 
        // See if this URI is already optimized 
 
        // Will use the lower case of the URI as the key 
        string key = uri.ToString().ToLower(); 
        CodecsOptimizedLoadData uriOptimizedLoadData = null; 
        if (!_urlOptimizeLoadData.TryGetValue(key, out uriOptimizedLoadData)) 
        { 
            //there is no optimized data for this URI, therefore start a new one 
            codecs.StartOptimizedLoad(); 
        } 
        else 
        { 
            // Use optimized data for this URI, if it exists 
            codecs.StartOptimizedLoad(uriOptimizedLoadData); 
        } 
 
        // Load the page as a RasterImage 
        using (ILeadStream leadStream = await LeadStream.Factory.FromUri(uri)) 
        { 
            image = await codecs.LoadAsync(leadStream, pageNumber); 
        } 
 
        // If a new optimized data was started, and it is being used by RasterCodecs, then save 
        // it in the dictionary 
        if (uriOptimizedLoadData == null) 
        { 
            uriOptimizedLoadData = codecs.GetOptimizedLoadData(); 
 
            // This could be null, since not all file formats (a) support optimized load 
            // data (b) support getting or setting the data 
            if (uriOptimizedLoadData != null) 
            { 
                // Save it 
                _urlOptimizeLoadData.TryAdd(key, uriOptimizedLoadData); 
            } 
        } 
 
        // Must stop the optimized loading here. Use this RasterCodecs object for saving 
        // to a different location 
        codecs.StopOptimizedLoad(); 
 
        // Save it as a PNG file 
        MemoryStream memoryStream = new MemoryStream(); 
        using (ILeadStream leadStream = LeadStream.Factory.FromStream(memoryStream)) 
        { 
            await codecs.SaveAsync(image, leadStream, RasterImageFormat.Png, 0); 
            memoryStream.Position = 0; 
        } 
 
        image.Dispose(); 
 
        return File(memoryStream, "image/png"); 
    } 
} 

CodecsOptimizedLoadData Class Data Members

The CodecsOptimizedLoadData class contains the following data members that can be serialized and re-constructed easily:

Member Description
CodecIndex Integer indicating the internal LEADTOOLS codec (file filter) index using the data.
GetData/setdata Gets or sets the internal state data as a byte array.

The example above stores CodecsOptimizedLoadData objects into a dictionary. However, the class can be easily re-constructed after being saved into a cache system. For example:

C#
// Get it from a raster codecs: 
CodecsOptimizedLoadData optimizedLoadData = codecs.GetOptimizedLoadData(); 
// Save it into the user-defined cache, it needs to save the integer (codecIndex) and byte[] (Data) 
if (optimizedLoadData != null) 
{ 
    int codecIndex = optimizedLoadData.CodecIndex; 
    byte[] data = optimizedLoadData.GetData(); 
    SaveToCache(key, codecIndex, data); 
} 

And the other way around:

C#
int codecIndex; 
byte[] data; 
// Load it from out of a user-defined cache 
if (LoadFromCache(key, out codecIndex, out data)) 
{ 
    // Create CodecsOptimizedLoadData and set it into the RasterCodecs: 
    CodecsOptimizedLoadData optimizedLoadData = new CodecsOptimizedLoadData(); 
    optimizedLoadData.CodecIndex = codecIndex; 
    optimizedLoadData.SetData(data); 
    codecs.StartOptimizedLoad(optimizedLoadData); 
} 

It used to be that CodecsOptimizedLoadData contained only managed members, so it was not disposable. Now CodecsOptimizedLoadData can contain unmanaged pointers, so the class is now disposable. To determine if the instance contains unmanaged (non-flat) load data, check the CodecsOptimizedLoadData.UnmanagedData property.

CodecsOptimizedLoadData Class Behavior

The behavior of CodecsOptimizedLoadData depends on whether the filter data is flat, as follows:

Very Important Note

When you call RasterCodecs.GetOptimizedLoadData(true), you only get a peek at the optimized load data that is inside the RasterCodecs object. The object retrieved might have already been disposed under any of the following conditions:

Under any of the above conditions, the RasterCodecs object that created it still owns the unmanaged data in this class. So, even though you obtained a CodecsOptimizedLoad data object that is disposable, YOU SHOULD NOT DISPOSE IT. If you do dispose it, you will invalidate the filter data used by RasterCodecs. This can result in undefined behavior.

The user can get ownership of the optimized load data by calling RasterCodecs.DetachOptimizedLoadData. If you call RasterCodecs.DetachOptimizedLoadData, the optimized load data will be valid for the duration of your app until you dispose it.

Unmanaged filter data can contain file handles or internal pointers. Consequently, the data is valid only while the process is running. You cannot give it to another process and you cannot use it if you stop and restart the app (as in the case of a web server).

Other than for informational purposes, you use this optimized load data by passing it as an argument to StartOptimizedLoad(CodecsOptimizedLoadData).

The following two pieces of code are equivalent:

First Piece

CodecsOptimizedLoadData optimizedLoadData = rasterCodecs.GetOptimizedLoadData(true); 
rasterCodecs.DetachOptimizedLoadData(); 
Second Piece
CodecsOptimizedLoadData optimizedLoadData = rasterCodecs.DetachOptimizedLoadData();

Important Note

If you call StartOptimizedLoad(CodecsOptimizedLoadData), you still have ownership of the load data. Calling RasterCodecs.StopOptimizedLoad or loading another file/stream DOES NOT invalidate your optimized load data. You still must dispose it.

Here is one way of using the load data:

CodecsOptimizedLoadData optimizedLoadData; 
CodecsImageInfo info; 
using(RasterCodecs Codecs1 = new RasterCodecs()) 
{ 
   Codecs1.StartOptimizedLoad(); 
   info = Codecs1.GetInformation(“sourcefile.docx”, true); 
   optimizedLoadData = codecs1.GetOptimizedLoadData(true); /* get the data */ 
   codecs1.DetachOptimizedLoadData(); /* take ownership */ 
   Codecs1.StopOptimizedLoad(); 
} 

An alternative to this for looping is to create several threads, each loading a page from the file. Note that this would work with files, but NOT with streams, since you cannot have multiple threads reading from different positions of the same stream.

for(int i = 1; i <= info.TotalPages; i++) 
{ 
   using (RasterCodecs codecs2 = new RasterCodecs()) 
   { 
      Codecs2.StartOptimizedLoad(optimizedLoadData); 
      using (RasterImage image = codecs2.Load(“sourcefile.docx”,  i); 
      { 
         /* do something with the image */ 
      } 
   Codecs2.StopOptimizedLoad(); 
   } 
} 
optimizedLoadData.dispose(); /* dispose the load data once you are done with it */ 

Help Version 20.0.2019.3.12
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2019 LEAD Technologies, Inc. All Rights Reserved.

LEADTOOLS Imaging, Medical, and Document