Using the OptimizedLoad Functions to Speed Loading Large Files

Loading of certain multi-page file format can be slow using the default RasterCodecs implementation. This is especially true for document file formats such as DOCX, RTF, PDF and TXT.

Most raster file formats, such as TIFF, have a simple header at the beginning of the file that indicates the number of pages as well as the location of each page in the file. Therefore, to read page 5 from a 10-page file, the software first reads the header and then jumps right away to page 5 and parses it. In this case, no other parts of the file nor the other pages are touched. The penalty for reading the next (page number 6) without any prior information is to read and parse the header again which is typically a fast operation.

Some document file formats, such as PDF, also contain a header at the beginning of the file indicating the number and location of the pages. However, the page might have cross references to other sections in the document. For example, the font data is almost always stored globally for PDF document at a specific location. Therefore, to read page 5 from a 10 pages file, the software must first read the header and then read the global font data. Then jump to page 5 to parse it. The penalty for reading the next (page number 6) without any prior information is to read and parse the header as well as all the global cross reference sections in the file which will be repeated for each page.

Other types of document formats such as plain Text do not contain a header at all. The total number and location of each page must be calculated by looping through each line of text and rendering it on a virtual page, incrementing the page count when the virtual page runs out of space. Every line in the file must processed this way. Therefore to read page 5 from a 10 pages file, the software will render and discard pages 1 through 4 on a virtual page, then render page 5 into the physical page. In essence, RasterCodecs will create a "file format header" for Text file that includes the number and location of each page in the physical file. The penalty for reading the next (page number 6) without any prior information is to perform this whole operation all over again (this time rendering and discarding pages 1 through 5).

The solution to this is to use the Optimize Load mechanism in RasterCodecs. With optimize loading, the RasterCodecs can keep internal state information about the document being loaded. For example of the PDF file above, the global font table is kept in memory and re-used for subsequent pages. And in the case of the Text file, the virtual header is created and then used and updated as pages are loaded.

Real-World Example

For instance, sample.pdf is a PDF file with 115 pages. Here is a sample implementation of loading this file using RasterCodecs:

C#
private static void TestOptimizeLoad(string imageFileName, bool useOptimizeLoad) 
{ 
    using (RasterCodecs codecs = new RasterCodecs()) 
    { 
        // Start a timer 
        var stopwatch = new Stopwatch(); 
 
        // Use GetInformation to get the total number of pages in the file; 
        int pageCount; 
        double thisTime = 0; 
        double totalTime = 0; 
 
        // See if we need to turn optimize load on. 
        if (useOptimizeLoad) 
            codecs.StartOptimizedLoad(); 
 
        stopwatch.Restart(); 
        CodecsImageInfo imageInfo = codecs.GetInformation(imageFileName, true); 
        thisTime = stopwatch.ElapsedMilliseconds; 
        totalTime += thisTime; 
        pageCount = imageInfo.TotalPages; 
        imageInfo.Dispose(); 
 
        Console.WriteLine($"GetInformation took {thisTime} ms"); 
 
        // Loop through every page and load it 
        for (int pageNumber = 1; pageNumber <= pageCount; pageNumber++) 
        { 
            stopwatch.Restart(); 
            RasterImage image = codecs.Load(imageFileName, pageNumber); 
            thisTime = stopwatch.ElapsedMilliseconds; 
            totalTime += thisTime; 
            image.Dispose(); 
 
            Console.WriteLine($"Loading page {pageNumber} took {thisTime} ms"); 
        } 
 
        // See if we need to turn optimize load back off and free its data. 
        // Note that this is not really needed in this example since we 
        // are disposing the RasterCodecs right after which will take care of that 
        // for us. 
        if (useOptimizeLoad) 
            codecs.StopOptimizedLoad(); 
 
        Console.WriteLine($"Total time {totalTime} ms"); 
    } 
} 

The code is very simple, it first obtains the total number of pages from the file and then loads each as a RasterImage object. It will also log the time it took for each operation as we add the total time spent.

Calling this method with TestOptimizeLoad("sample.pdf", false) results in the following on the test machine:

GetInformation took 114 ms 
Loading page 1 took 103 ms 
Loading page 2 took 61 ms 
Loading page 3 took 46 ms 
// Snip 
Loading page 113 took 43 ms 
Loading page 114 took 40 ms 
Loading page 115 took 50 ms 
Total time 6622 ms 

Calling this method with TestOptimizeLoad("sample.pdf", true) results in the following on the same test machine:

GetInformation took 114 ms 
Loading page 1 took 115 ms 
Loading page 2 took 31 ms 
Loading page 3 took 42 ms 
// Snip 
Loading page 113 took 43 ms 
Loading page 114 took 40 ms 
Loading page 115 took 48 ms 
Total time 5577 ms 

6.6 seconds vs 5.6 seconds. Therefore, by calling StartOptimizedLoad we sped up the operation by 15%.

Another file is sample.txt. This is a large text file with 348 pages. Here are the results of TestOptimizeLoad("sample.txt", false):

GetInformation took 294 ms 
Loading page 1 took 193 ms 
Loading page 2 took 155 ms 
// Snip 
Loading page 346 took 161 ms 
Loading page 347 took 138 ms 
Loading page 348 took 146 ms 
Total time 51706 ms 

While TestOptimizeLoad("sample.pdf", true) results in:

GetInformation took 259 ms 
Loading page 1 took 59 ms 
Loading page 2 took 34 ms 
Loading page 3 took 34 ms 
// Snip 
Loading page 346 took 34 ms 
Loading page 347 took 35 ms 
Loading page 348 took 33 ms 
Total time 12992 ms 

This is 13 seconds vs 52 seconds. Or a speed up of almost 400%. TIFF, PDF and Text shows the extremes effects of using optimize load with RasterCodecs.

TIFF files are not affected since the format contains all the information needed to quickly load any page in a simple to read and parse header.

PDF files are sped up by a margin depending on the file data (size and number of cross references between the pages).

TEXT files are sped up the most with the increase in performance is directly proportional to the number of pages in the file.

The above example uses optimize load data to convert all of the pages of an input file. For PDF files, the speed improvement will be noticeable when large files (1000 pages or more) are converted. For other file formats, the speed improvement can be significant even with fewer pages (as few as 10 pages). TIFF/BigTIFF files use a simpler mechanism (the IFD, or file offset of each page). For more information about using the IFD, refer to Loading and Saving Large TIFF/BigTIFF Files.

The following file formats are supported:

Usage

The simplest way to enable optimize load in RasterCodecs is to call RasterCodecs.StartOptimizeLoad() before getting information or loading the pages of a document and RasterCodecs.StopOptimizedLoad when the operation finishes. StartOptimizedLoad will inform RasterCodecs that all the subsequent RasterCodecs.GetInformation, RasterCodecs.GetInformationAsync, RasterCodecs.Load and RasterCodecs.LoadAsync will be performed on the same image file (or stream) till RasterCodecs.StopOptimizedLoad is called. Passing a different file or stream in the middle of this operation will result in undefined behavior. To load a different file in the middle of the operation, simply create a new instance of RasterCodecs and use it for this purpose.

While optimized load is on a RasterCodecs object, it will create, store and re-use the internal data used by the file formats (such as PDF or TEXT) for the particular document file or stream handling all the necessary communication. This internal data is freed when optimized is turned off using RasterCodecs.StopOptimizedLoad or when the RasterCodecs object is disposed.

More advanced usage occurs when loading of the pages cannot occur during the life span of one RasterCodecs instance, instead, it must be re-used by different RasterCodecs objects. For example, when loading the pages through a web service. In this case, RasterCodecs.GetOptimizedLoadData can be used to obtain the current internal state data and then RasterCodecs.StartOptimizedLoad(codecsoptimizedloaddata) to turn the same optimized load on a different RasterCodecs object.

Note that not all file formats uses optimized load data. And for the ones that do, not all support getting/re-setting the optimized data. For both of these types of formats, RasterCodecs.GetOptimizedLoadData will return null. For instance, here is an action in an ASP.NET controller of web service that allow the user to load a page as PNG from a URL:

C#
public async Task<ActionResult> GetPng(Uri uri, int pageNumber) 
{ 
    // New RasterCodecs 
    using (RasterCodecs codecs = new RasterCodecs()) 
    { 
        RasterImage image; 
 
        // Load the page as RasterImage 
        using (ILeadStream leadStream = await LeadStream.Factory.FromUri(uri)) 
        { 
            image = await codecs.LoadAsync(leadStream, pageNumber); 
        } 
 
        // Save it as PNG 
        MemoryStream memoryStream = new MemoryStream(); 
        using (ILeadStream leadStream = LeadStream.Factory.FromStream(memoryStream)) 
        { 
            await codecs.SaveAsync(image, leadStream, RasterImageFormat.Png, 0); 
            memoryStream.Position = 0; 
        } 
 
        image.Dispose(); 
 
        return File(memoryStream, "image/png"); 
    } 
} 

The goal is to use optimized load when the URL points to the same file. We will use a concurrent dictionary to store the optimized load data. For a real-world application this should be replaced by a caching mechanism:

private static ConcurrentDictionary<string, CodecsOptimizedLoadData> _urlOptimizeLoadData = new ConcurrentDictionary<string, CodecsOptimizedLoadData>();

C#
public async Task<ActionResult> GetPng(Uri uri, int pageNumber) 
{ 
    // New RasterCodecs 
    using (RasterCodecs codecs = new RasterCodecs()) 
    { 
        RasterImage image; 
 
        // See if we have this URI optimized already 
 
        // Will use the lower case of the URI as the key 
        string key = uri.ToString().ToLower(); 
        CodecsOptimizedLoadData uriOptimizedLoadData = null; 
        if (!_urlOptimizeLoadData.TryGetValue(key, out uriOptimizedLoadData)) 
        { 
            // We do not have optimize data for this URI, start a new one 
            codecs.StartOptimizedLoad(); 
        } 
        else 
        { 
            // We have optimize data for this URI, use it 
            codecs.StartOptimizedLoad(uriOptimizedLoadData); 
        } 
 
        // Load the page as RasterImage 
        using (ILeadStream leadStream = await LeadStream.Factory.FromUri(uri)) 
        { 
            image = await codecs.LoadAsync(leadStream, pageNumber); 
        } 
 
        // If we started a new optimized data, and it is being used by RasterCodecs, then save 
        // it in our dictionary 
        if (uriOptimizedLoadData == null) 
        { 
            uriOptimizedLoadData = codecs.GetOptimizedLoadData(); 
 
            // This could be null, since not all file formats (a) support optimized load 
            // data (2) support getting or setting the data 
            if (uriOptimizedLoadData != null) 
            { 
                // Save it 
                _urlOptimizeLoadData.TryAdd(key, uriOptimizedLoadData); 
            } 
        } 
 
        // Must stop the optimize load here, we will use this RasterCodecs object for saving 
        // to a different location 
        codecs.StopOptimizedLoad(); 
 
        // Save it as PNG 
        MemoryStream memoryStream = new MemoryStream(); 
        using (ILeadStream leadStream = LeadStream.Factory.FromStream(memoryStream)) 
        { 
            await codecs.SaveAsync(image, leadStream, RasterImageFormat.Png, 0); 
            memoryStream.Position = 0; 
        } 
 
        image.Dispose(); 
 
        return File(memoryStream, "image/png"); 
    } 
} 

The CodecsOptimizedLoadData class contains the following data members that can be serialized and re-constructed easily:

Member Description
CodecIndex Integer indicating the internal LEADTOOLS codec (file filter) index using the data.
GetData/setdata Get or sets the internal state data as a byte array.

The example above stores CodecsOptimizedLoadData objects into a dictionary, however, the class can be easily re-constructed after being saved into a cache system. For example:

C#
// Get it from a raster codecs: 
CodecsOptimizedLoadData optimizedLoadData = codecs.GetOptimizedLoadData(); 
// Save it into our user-defined cache, we need to save the integer (codecIndex) and byte[] (Data) 
if (optimizedLoadData != null) 
{ 
    int codecIndex = optimizedLoadData.CodecIndex; 
    byte[] data = optimizedLoadData.GetData(); 
    SaveToCache(key, codecIndex, data); 
} 

And the other way around:

C#
int codecIndex; 
byte[] data; 
// Load it from out user-defined cache 
if (LoadFromCache(key, out codecIndex, out data)) 
{ 
    // Create CodecsOptimizedLoadData and set it into the RasterCodecs: 
    CodecsOptimizedLoadData optimizedLoadData = new CodecsOptimizedLoadData(); 
    optimizedLoadData.CodecIndex = codecIndex; 
    optimizedLoadData.SetData(data); 
    codecs.StartOptimizedLoad(optimizedLoadData); 
} 

Help Version 20.0.2018.1.19
Products | Support | Contact Us | Copyright Notices
© 1991-2018 LEAD Technologies, Inc. All Rights Reserved.
LEADTOOLS Imaging, Medical, and Document