←Select platform

DocumentMemoryCache Class

Summary

Document memory cache support.

Syntax
C#
C++/CLI
Python
public class DocumentMemoryCache 
public: 
   ref class DocumentMemoryCache 
class DocumentMemoryCache: 
Remarks

Certain multipage file formats can be slow to load using the default implementation of DocumentFactory. This is especially true for document file formats such as DOCX/DOC, XLSX/XLS RTF, PDF, and TXT.

The Using the OptimizedLoad Functions to Speed Loading Large Files topic contains detailed explanation of optimization techniques used by LEADTOOLS to speed up parsing the images, SVG, and text data of the pages of these types of files.

As explained in the topic, CodecsOptimizedLoadData can contain managed or unmanaged memory, depending on the file format. If a format supports managed data (such as TXT and AFP), the DocumentFactory stores this data into the cache itself and re-uses it whenever a document is loaded from the cache.

If a format can only support unmanaged data (such as DOCX and XSLX), the DocumentFactory is unable to store this type of data in a cache for re-use. The default behavior is for the factory to re-create this data from scratch anytime a document is loaded from the cache, which can degrade performance.

Performance is generally not a problem for desktop-based applications using the document library, such as the Windows Document Viewer or Document Converter. Here, the workflow is to load the document once from a file or URL, use the resulting LEADDocument, and only dispose of it when the user has closed the document.

This, however, can become a problem for applications that require saving the document into the cache and then re-loading it again (such as the Document Service). Here, the workflow is to load the document initially, generate its unmanaged optimized data, and then save it into the cache and dispose the object. Subsequent calls to get the images, SVG, and text data of the pages will have to first load the document from the cache, and thus regenerate the data with every call.

DocumentMemoryCache can be used by the factory to greatly enhance the performance of such documents by keeping the unmanaged memory alive in memory, independent of the document. Subsequent calls to load the document from cache will then use the memory-cached data instead of re-generating it from scratch. The result is quicker loading and parsing of the data of the pages. Naturally, keeping unmanaged data in memory increases the application's resource usage, and so careful consideration must be used when deciding how and when to store this data.

Document Service Workflow 1

A typical client/server application such as the LEADTOOLS Document Viewer/Document Service works as follows:

  1. The Document Viewer loads or uploads a document from an external resource, and eventually calls DocumentFactory.LoadFromUri or DocumentFactory.LoadFromFile on the Document Service.

  2. DocumentFactory calls RasterCodecs.StartOptimizedLoad and RasterCodecs.GetInformation to obtain information about the source document, such as format and number of pages. This can take some time if the document is very large and complex and is especially true for file formats such as XLSX/XLS.

  3. A CodecsOptimizedLoadData object can be obtained at this point. If obtained, the factory stores it internally inside the LEADDocument object.

  4. The service then saves this document information in the cache. If CodecsOptimizedLoadData supports managed data, it is saved into the cache as well. This is true for file formats such as TXT and AFP.

  5. The service disposes the document with all its data, including the CodecsOptimizedLoadData, if available.

  6. The document ID in the cache is returned to the viewer.

  7. The viewer builds the skeleton of empty pages and thumbnails.

  8. The viewer sends one or more requests asynchronously to the service to obtain pages images, SVG, or text data.

  9. Each one of these requests results in loading the document information from the cache using DocumentFactory.LoadFromCache.

  10. CodecsOptimizedLoadData is created for the document. If its managed data were stored in the cache, the managed data is loaded and used (the parsing of the complex document structure is not performed).

  11. If CodecsOptimizedLoadData supports only unmanaged data, then the data was not stored in the cache. Parsing of the complex document structure must be performed again.

Document Service Workflow 2

DocumentFactory supports caching of the unmanaged data described above in memory. When enabled, the workflow is modified as follows:

  1. The Document Viewer loads or uploads a document from an external resource, eventually calls DocumentFactory.LoadFromUri or DocumentFactory.LoadFromFile on the Document Service.

  2. DocumentFactory calls RasterCodecs.StartOptimizedLoad and RasterCodecs.GetInformation to obtain information on the source document, such as format and number of pages. This operation may take some time if the document is very large and complex and is especially true for file formats such as XLSX/XLS.

    2.1 New behavior: The previous operation is timed and stored in LEADDocument.LoadDuration.

  3. A CodecsOptimizedLoadData object may be obtained at this point, and the factory stores it internally inside the LEADDocument object.

  4. The service then saves this document information in the cache. If CodecsOptimizedLoadData supports managed data, it is saved into the cache as well. This is true for file formats such as TXT and AFP.

    4.1 New behavior: If CodecsOptimizedLoadData supports unmanaged data only such as DOCX and XLSX/XLS, then LoadDuration is compared against DocumentMemoryCacheStartOptions.MinimumLoadDuration and if greater, the unmanaged data is stored inside an internal memory cache associated with the document.

  5. The service disposes of the document with all its data, including the CodecsOptimizedLoadData if available.

  6. The document ID in the cache is returned to the viewer.

  7. The viewer builds the skeleton of empty pages and thumbnails.

  8. The viewer sends one or more requests asynchronously to the service to obtain pages, images, SVG, or text data.

  9. Each one of these requests results in loading the document information from the cache using DocumentFactory.LoadFromCache.

  10. CodecsOptimizedLoadData is created for the document. If its managed data were stored in the cache, it is loaded and used (parsing of the complex document structure is not performed.

    10.1 New behavior: Otherwise, the internal memory cache is queries for a CodecsOptimizedLoadData associated with this document, and used if found. Parsing of the complex document structure is not performed again. Otherwise,

  11. If CodecsOptimizedLoadData supports only unmanaged data, then it was not stored in the cache. Parsing of the complex document structure must be performed again.

    11.1 New behavior: If step 11 occurred then Step 4.1 is repeated and this CodecsOptimizedLoadData is compared and potentially added to the internal memory cache. Thus, subsequent calls to LoadFromCache will use this data and the increase of speed is obtained again. This scenario happens when the data stored in the memory cache has expired after a specified time of inactivity.

Therefore, using the document memory cache will increase the performance of loading and parsing pages from complex and large document formats at the expense of keeping the unmanaged data in memory at the server side. LEADTOOLS internal testing showed an increase in speed of up to 50 times when loading very large XLSX files in the JavaScript document viewer if document memory cache is used.

Usage

DocumentFactory contains the static property DocumentFactory.DocumentMemoryCache that controls the usage of this feature in the document toolkit.

DocumentMemoryCache usage can be enabled as follows:

This will initialize the internal memory cache and clean up timer and all subsequent LoadFromUri, LoadFromFile, and LoadFromCache will switch to "Document Service Workflow 2" described above.

The internal cache contains CodecsOptimizedLoadData objects associated with document IDs. Each of these items also contains a timestamp to mark its last usage. The timestamps are updated whenever the document is "touched". If a specific amount of time passes without any activity on the saved data of a document, it is considered expired and is removed from the cache.

The engine will automatically keep the data of a document alive between calls to LoadFromCache and dispose, thus preventing long-running operations (such as converting a large document to a different format), from triggering the expiry on the data.

The engine will also automatically purge the data of a document when it is deleted from the cache.

DocumentMemoryCacheStartOptions contains the following options:

Member Description
MinimumLoadDuration The minimum amount of time the initial LoadFromUri/LoadFromFile of a document takes to be considered to be considered for memory optimization. The default value is 2 seconds.
MaximumItems Maximum number of items to keep in the cache. The default value is 0, meaning there is no limit.
SlidingExpiration Duration at which the cache entry must be "touched" before it is deleted from the cache. The default value is 60 seconds.
TimerInterval Interval the timer uses to check for and remove expired items. The default value is 60 seconds.

DocumentMemoryCache contains the following members:

Member Description
IsStarted Checks whether document memory caching support has started.
Start Starts document memory caching support.
Stop Stops document memory caching support.
HasDocument Checks whether an entry associated with the specified document exists.
Example

This example will simulate a client loading a small and then a large document from URI and then shows the times used. This sample will produce results similar to the following:

Using memory cache is False
Initial load from leadtools.pdf took ~0 seconds
Is using memory cache is False
Multi-threaded load of all pages took ~0 seconds
Initial load from complex.xlsx took ~10 seconds
Is using memory cache is False 1
Multi-threaded load of all pages took ~13 1 seconds
Using memory cache is True
Initial load from leadtools.pdf took ~0 seconds
Is using memory cache is False
Multi-threaded load of all pages took ~0 seconds
Initial load from complex.xlsx took ~10 seconds
Is using memory cache is True 2
Multi-threaded load of all pages took ~0 2 seconds

Notice how the time it took to load all the pages of complex.xlsx in a multi-threaded code decreased from 13 to almost 0 seconds when memory cache is used.

C#
Java
using Leadtools; 
using Leadtools.Codecs; 
using Leadtools.Document.Writer; 
 
using Leadtools.Document; 
using Leadtools.Caching; 
using Leadtools.Annotations.Engine; 
using Leadtools.Ocr; 
using Leadtools.Barcode; 
using Leadtools.Document.Converter; 
 
public void DocumentMemoryCacheExample() 
{ 
   // The cache we are using 
   FileCache cache = new FileCache(); 
   cache.PolicySerializationMode = CacheSerializationMode.Json; 
   cache.DataSerializationMode = CacheSerializationMode.Json; 
   cache.CacheDirectory = @"c:\cache-dir"; 
 
   // The document files we are using 
   string[] documentFiles = 
   { 
      // PDF files are very fast to load and will not use memory cache 
      @"C:\LEADTOOLS22\Resources\Images\leadtools.pdf", 
      // Large Excel files are complex and loading may take some time, they could use memory cache 
      @"C:\LEADTOOLS22\Resources\Images\complex.xlsx", 
   }; 
 
   // First process without memory cache and obtain some times 
   LoadAndProcessDocument(documentFiles, cache); 
 
   // Then process with memory cache 
 
   // Document memory cache options to use 
   var documentMemoryCacheStartOptions = new DocumentMemoryCacheStartOptions 
   { 
      // Use for documents that take more than 2 seconds to load initially 
      MinimumLoadDuration = TimeSpan.FromSeconds(2), 
      // No maximum limit on the number of cache items to keep in memory 
      MaximumItems = 0, 
      // Purse items from the cache if not touched for 60 seconds 
      SlidingExpiration = TimeSpan.FromSeconds(60), 
      // Check for expired items every 60 seconds 
      TimerInterval = TimeSpan.FromSeconds(60) 
   }; 
 
   // Use it 
   DocumentFactory.DocumentMemoryCache.Start(documentMemoryCacheStartOptions); 
 
   // Run again 
   // For the first document, times should be very close to the since this is a PDF document and is very fast (less than MinimumLoadDuration) 
   // For the second document, initial times should be the same, but loading all pages should be much faster 
   LoadAndProcessDocument(documentFiles, cache); 
 
   // Clean up 
   DocumentFactory.DocumentMemoryCache.Stop(); 
} 
 
private static void LoadAndProcessDocument(string[] documentFiles, ObjectCache cache) 
{ 
   Console.WriteLine($"Using memory cache is {DocumentFactory.DocumentMemoryCache.IsStarted}"); 
   string[] documentIds = new string[documentFiles.Length]; 
 
   var stopwatch = new Stopwatch(); 
   TimeSpan elapsed; 
 
   for (var i = 0; i < documentFiles.Length; i++) 
   { 
      string documentFile = documentFiles[i]; 
      int pageCount; 
 
      // First try without memory cache and obtain some times 
      stopwatch.Restart(); 
      using (LEADDocument document = DocumentFactory.LoadFromFile( 
         documentFile, 
         new LoadDocumentOptions 
         { 
            Cache = cache 
         })) 
      { 
         document.Images.MaximumImagePixelSize = 2048; 
         document.AutoSaveToCache = false; 
         document.AutoDeleteFromCache = false; 
         document.SaveToCache(); 
         documentIds[i] = document.DocumentId; 
         pageCount = document.Pages.Count; 
      } 
      elapsed = stopwatch.Elapsed; 
      Console.WriteLine($"Initial load from {Path.GetFileName(documentFile)} took ~{(int)elapsed.TotalSeconds} seconds"); 
      // Check if it's in the cache 
      Console.WriteLine($"Is using memory cache is {DocumentFactory.DocumentMemoryCache.HasDocument(documentIds[i], false)}"); 
 
      // Next call LoadFromCache and process a page in multiple threads 
      stopwatch.Restart(); 
      LoadAllPagesInThreads(documentIds[i], pageCount, cache); 
      elapsed = stopwatch.Elapsed; 
      Console.WriteLine($"Multi-threaded load of all pages took ~{(int)elapsed.TotalSeconds} seconds"); 
   } 
 
   // Clean up 
   DeleteDocumentsFromCache(documentIds, cache); 
} 
 
private static void LoadAllPagesInThreads(string documentId, int pageCount, ObjectCache cache) 
{ 
   System.Threading.Tasks.Parallel.For( 
      1, 
      pageCount + 1, 
      new System.Threading.Tasks.ParallelOptions { MaxDegreeOfParallelism = 4 }, 
      (int pageNumber) => 
      { 
         // Load the document from the cache 
         using (LEADDocument document = DocumentFactory.LoadFromCache( 
            new LoadFromCacheOptions 
            { 
               Cache = cache, 
               DocumentId = documentId 
            })) 
         { 
            // Simulates processing of the page 
            DocumentPage documentPage = document.Pages[pageNumber - 1]; 
            using (RasterImage image = documentPage.GetImage()) 
            { 
 
            } 
         } 
      }); 
} 
 
private static void DeleteDocumentsFromCache(string[] documentIds, ObjectCache cache) 
{ 
   foreach (string documentId in documentIds) 
   { 
      DocumentFactory.DeleteFromCache(new LoadFromCacheOptions 
      { 
         Cache = cache, 
         DocumentId = documentId, 
      }); 
   } 
} 
 
import java.io.File; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.net.MalformedURLException; 
import java.net.URI; 
import java.net.URISyntaxException; 
import java.net.URL; 
import java.nio.file.Files; 
import java.nio.file.Paths; 
import java.util.ArrayList; 
import java.util.Calendar; 
import java.util.List; 
import java.util.concurrent.Callable; 
import java.util.concurrent.ExecutorService; 
import java.util.concurrent.Executors; 
import java.util.concurrent.Future; 
import java.util.regex.Pattern; 
 
import org.junit.*; 
import org.junit.runner.JUnitCore; 
import org.junit.runner.Result; 
import org.junit.runner.notification.Failure; 
import static org.junit.Assert.*; 
 
import leadtools.*; 
import leadtools.annotations.engine.*; 
import leadtools.barcode.*; 
import leadtools.caching.*; 
import leadtools.codecs.*; 
import leadtools.document.*; 
import leadtools.document.DocumentMimeTypes.UserGetDocumentStatusHandler; 
import leadtools.document.converter.*; 
import leadtools.document.writer.*; 
import leadtools.ocr.*; 
 
 
public void documentMemoryCacheExample() { 
   // The cache we are using 
   FileCache cache = new FileCache(); 
   cache.setPolicySerializationMode(CacheSerializationMode.JSON); 
   cache.setDataSerializationMode(CacheSerializationMode.JSON); 
   cache.setCacheDirectory("c:\\cache-dir"); 
 
   // The document files we are using 
   String[] documentFiles = { 
         // PDF files are very fast to load and will not use memory cache 
         "C:\\LEADTOOLS23\\Resources\\Images\\leadtools.pdf", 
         // Large Excel files are complex and loading may take some time, they could use 
         // memory cache 
         "C:\\LEADTOOLS23\\Resources\\Images\\large_sheet_5k.xlsx", 
   }; 
 
   // First process without memory cache and obtain some times 
   loadAndProcessDocument(documentFiles, cache); 
 
   // Then process with memory cache 
 
   // Document memory cache options to use 
   DocumentMemoryCacheStartOptions documentMemoryCacheStartOptions = new DocumentMemoryCacheStartOptions(); 
   // Use for documents that take more than 2 seconds to load initially 
 
   documentMemoryCacheStartOptions.setMinimumLoadDuration(2); 
   // No maximum limit on the number of cache items to keep in memory 
   documentMemoryCacheStartOptions.setMaximumItems(0); 
   // Purse items from the cache if not touched for 60 seconds 
   documentMemoryCacheStartOptions.setSlidingExpiration(60); 
   // Check for expired items every 60 seconds 
   documentMemoryCacheStartOptions.setTimerInterval(60); 
 
   // Use it 
   DocumentFactory.getDocumentMemoryCache().start( 
         documentMemoryCacheStartOptions); 
 
   // Run again 
   // For the first document, times should be very close to the since this is a PDF 
   // document and is very fast (less than MinimumLoadDuration) 
   // For the second document, initial times should be the same, but loading all 
   // pages should be much faster 
   loadAndProcessDocument(documentFiles, cache); 
 
   // Clean up 
   DocumentFactory.getDocumentMemoryCache().stop(); 
} 
 
private static void loadAndProcessDocument(String[] documentFiles, 
      ObjectCache cache) { 
   System.out.println("Using memory cache is " + 
         DocumentFactory.getDocumentMemoryCache().isStarted()); 
   String[] documentIds = new String[documentFiles.length]; 
 
   long stopwatch = System.currentTimeMillis(); 
   long elapsed; 
 
   for (int i = 0; i < documentFiles.length; i++) { 
      String documentFile = documentFiles[i]; 
      int pageCount; 
 
      // First try without memory cache and obtain some times 
      stopwatch = System.currentTimeMillis(); 
      LoadDocumentOptions loadOptions = new LoadDocumentOptions(); 
      loadOptions.setCache(cache); 
      LEADDocument document = DocumentFactory.loadFromFile(documentFile, 
            loadOptions); 
      document.getImages().setMaximumImagePixelSize(2048); 
      document.setAutoSaveToCache(false); 
      document.setAutoDeleteFromCache(false); 
      document.saveToCache(); 
      documentIds[i] = document.getDocumentId(); 
      pageCount = document.getPages().size(); 
 
      elapsed = (System.currentTimeMillis() - stopwatch) / 1000; 
      System.out.println("Initial load from " + documentFile + " took ~" + elapsed + " seconds"); 
      // Check if it's in the cache 
      System.out.println("Is using memory cache is " 
            + DocumentFactory.getDocumentMemoryCache().hasDocument(documentIds[i], 
                  false)); 
 
      // Next call LoadFromCache and process a page in multiple threads 
      stopwatch = System.currentTimeMillis(); 
      loadAllPagesInThreads(documentIds[i], pageCount, cache); 
      elapsed = (System.currentTimeMillis() - stopwatch) / 1000; 
      System.out.println("Multi-threaded load of all pages took ~" + elapsed 
            + " seconds"); 
   } 
   // Clean up 
   deleteDocumentsFromCache(documentIds, cache); 
} 
 
private static void loadAllPagesInThreads(String documentId, int pageCount, ObjectCache cache) { 
   int maxDegreeOfParallelism = 4; 
   ExecutorService executorService = Executors.newFixedThreadPool(maxDegreeOfParallelism); 
   try { 
      // Create a list to hold the Future objects representing the processing of each 
      // page 
      List<Future<Void>> futures = new ArrayList<>(); 
      for (int pageNumber = 1; pageNumber <= pageCount; pageNumber++) { 
         final int currentPage = pageNumber; 
         // Submit each page processing task to the executor 
         Future<Void> future = executorService.submit(new Callable<Void>() { 
            @Override 
            public Void call() throws Exception { 
               // Load the document from the cache 
               LoadFromCacheOptions options = new LoadFromCacheOptions(); 
               options.setCache(cache); 
               options.setDocumentId(documentId); 
               LEADDocument document = DocumentFactory.loadFromCache(options); 
               DocumentPage documentPage = document.getPages().get(currentPage - 1); 
               RasterImage image = documentPage.getImage(); 
               return null; 
            } 
         }); 
         futures.add(future); 
      } 
      // Wait for all tasks to complete 
      for (Future<Void> future : futures) { 
         future.get(); 
      } 
   } catch (Exception e) { 
      e.printStackTrace(); 
   } finally { 
      // Shut down the executor service 
      executorService.shutdown(); 
   } 
} 
 
private static void deleteDocumentsFromCache(String[] documentIds, 
      ObjectCache cache) { 
   for (String documentId : documentIds) { 
      LoadFromCacheOptions loadOptions = new LoadFromCacheOptions(); 
      loadOptions.setCache(cache); 
      loadOptions.setDocumentId(documentId); 
      DocumentFactory.deleteFromCache(loadOptions); 
   } 
} 
Requirements

Target Platforms

Help Version 23.0.2024.2.29
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2024 LEAD Technologies, Inc. All Rights Reserved.

Leadtools.Document Assembly
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2023 LEAD Technologies, Inc. All Rights Reserved.