Document Toolkit and Caching

Introduction

The document library can make use of a cache system for the following:

For example, the Desktop LEADTOOLS Document Viewer Demo uses a FileCache system. The application sets a temporary folder as the cache directory through FileCache.CacheDirectory and uses default CacheItemPolicy objects that do not expire for caching the document. The value of AutoSaveToCache is set to false and each document will delete its own cache entry when it is disposed of and no longer used.

Note that, in this example, The data is neither shared with other applications, nor is it necessary for it to persist between sessions. - Persistance: Since web methods are session-less, the LEADTOOLS Document Web Service uses caching with the appropriate configuration to make the same document available to clients without the need to maintain session states.

The service also uses a FileCache system by default, though with different configuration for persistance. A cache item is created after a document is loaded and contains the document's properties and ID.

A JavaScript client will be able to load the cache item and use it to re-construct the document and obtain the document data for viewing in the LEADTOOLS Document Viewer.

Document Caching

Each LEADDocument object contains an ID. This is a unique string value that can be generated automatically by the system (using a GUID generator), or provided by the user and stored in the LEADDocument.DocumentId field. The ID is all that is required to re-construct a document object from a cache using LoadFromCache.

LoadFromUri is used to create a LEADDocument object that represents a document such as PDF, TIFF, or DOCX stored in a remote URL. The Document is a data structure containing properties such as the mime type, number of pages, size of each page, and other metadata. It does not contain any actual image, SVG, or text data of the pages. The original document data (the PDF, TIFF, or DOCX image) is still stored in the remote URL. This data structure is all that is saved into the cache (by default), and therefore saving and then re-loading a document from the cache is a very fast operation and does not require a large amount of memory.

When the user requests an image representation of a page, the document parses it from the original data. This data can also be cached if required using the appropriate value for DocumentCacheOptions.

Implementation

To use caching, an object that implements ObjectCache is initialized once at the start of the application and then passed to the DocumentFactory and LEADDocument methods that require caching. Any cache system that can persist data between application re-starts can be used.

ObjectCache is an abstract class. Derived object can be implemented to add support for caching using external systems. Below are sample implementations.

Note: The .NET Framework contains ASP.NET caching, which does not persist between sessions and so is unsuitable for the Document library.

FileCache

FileCache is the default implementation of ObjectCache. It supports regions, external resources, and virtual directories.

MemoryCache

This MemoryCache example uses a simple in-memory cache implementation showing the basics of custom caching. It should not be used in production environment.

RedisObjectCache

This RedisObjectCache example shows an implementation of Azure Redis Cache to be used with the LEADTOOLS Document Library.

RedisWithBlobsObjectCache

This RedisWithBlobsObjectCache example shows an implementation of Azure Redis Cache and Storage Blobs to be used with the LEADTOOLS Document Library.

Ehcache

This EhcacheObjectCache example show an implementation of the popular Java Ehcache system to be used with the LEADTOOLS Document Library.

Under the Hood

Saving and Loading from Cache

When LEADDocument.SaveToCache is called, the items below are stored in the cache. Calling DocumentFactory.LoadFromCache will succeed if all the values are found in the cache.

Key Data Notes
"downloaded_file" Either the original URL or byte[] containing the original PDF, DOCX, TIFF, etc data of the document being viewed. LoadFromCache will check for this and return null if it cannot be found in the cache.
"annotations_file" Either the original URL or byte[] containing the optional annotation data. This is optional and can be null if no annotation file was served with the document.
"values" String containing internal data to re-create the Document object.
"pages" String containing internal data to re-create the Document page objects.
"bookmarks" String containing internal data to re-create the Document bookmark objects.
"rastercodecs_options" String containing internal data to re-create the RasterCodecs used to load/save images and SVG.

All items are first set during DocumentFactory.LoadFromUri.The data for "downloaded_file" is from the URI passed to LoadFromUri. Similarly, the data for "annotations_file" is from the LoadDocumentOptions.AnnotationsUri passed to LoadFromUri, if any. All other values are created internally during LoadFromUri.

Saving and loading from cache is performed by calling ObjectCache.AddOrGetExisting with regionName equal to the documentID (LEADDocument.DocumentId) and key equal to the value described in the table. These cache items are always in the cache for a document to be re-constructed (DocumentFactory.LoadFromCache). If the cache system does not support regions (or groups), then it can simply concatenate the value of regionName (the document ID) + key to create a unique cache ID.

Document Data in Cache

The original document data (PDF, DOCX, TIFF) is required to parse the document pages data after it has been loaded from the cache. The data is stored in the "DownloadedFile_CacheId" key described below and the value depends on whether the cache system supports external resources.

If the cache supports external resources (ObjectCache.DefaultCacheCapabilities contains ExtenalResources) such as the default LEADTOOLS FileCache which has access to a file system, then the original document data is downloaded to the physical disk file acting as the store for the cache item. This is performed by calling ObjectCache.GetItemExternalResource and writing the data directly to the file. This can reduce the memory footprint and increase performance.

If the cache does not support external resources (such as the Memory and Ehcache implementations), then the original data is stored as a byte[] into the cache item directly.

The following cache items are added to the cache, depending on DocumentCacheOptions. The key format is:

String key = pageId + "_" + itemName

Where pageId is a GUID representing the page.

Key Data Used when Notes
"thumbnailImage" RasterImage containing the thumbnail of this page DocumentCacheOptions.PageThumbnailImage is set
"text" DocumentPageText serializer DocumentCacheOptions.PageText is set
"annotations" String containing the XML representation of the annotations container for the page DocumentCacheOptions.PageAnnotations is set
"image_[number]" RasterImage containing the image of this page DocumentCacheOptions.PageImage is set [number] is 0,1,2 or 4 depending on the value of Document.Images.MaximumImagePixelSize
"svgBackImage_[number]" RasterImage containing only the image elements of the SVG document for this page DocumentCacheOptions.PageSvgBackImage is set [number] is 0,1,2 or 4 depending on the value of Document.Images.MaximumImagePixelSize
"svg_[number1]_[number2]" SvgDocument containing the SVG representation of this page DocumentCacheOptions.PageSvg is set [number1] can be either 0 or 1 (depending on whether this SVG is to be used for viewing or conversion). [number2] can be 0,1,2,3 and up to 15 similar to the images above

Note that the DocumentPage class will first try to get each item from the cache then set its value during its respect "Get" method. (For example: DocumentPage.GetThumbnail, DocumentPage.GetText,...etc)

All the possible cache IDs for a document can be obtained by calling LEADDocument.GetCacheKeys.

Deleting From Cache

When a document is deleted from the cache using DocumentFactory.DeleteFromCache, the cache checks if the system contains DefaultCacheCapabilities.CacheRegions.

Example

The following example loads a document from the cache and displays information on all of its cache items:

C#
private static void ShowDocumentCacheInfo(ObjectCache cache, string documentId) 
{ 
   // From "Under the Hood" here: 
   // https://www.leadtools.com/help/sdk/v21/dh/to/document-toolkit-and-caching.html 
   // We know that the document ID is the cache region ID 
   // We also know that a document may contain more than 1 cache item 
   // We also know that all the possible items that belong to a document can be obtained with LEADDocument.GetCacheKeys 
 
   // First load the document from the cache 
   var loadFromCacheOptions = new LoadFromCacheOptions(); 
   loadFromCacheOptions.Cache = cache; 
   loadFromCacheOptions.DocumentId = documentId; 
   using (var document = DocumentFactory.LoadFromCache(loadFromCacheOptions)) 
   { 
      // This is to demonstrate that it is not necessary to keep the global "cache" object around, GetCache 
      // returns it for the document 
      ObjectCache documentCache = document.GetCache(); 
      Assert.IsNotNull(documentCache); 
 
      // Get all the possible cache keys 
      // These keys may or may not exist in the cache, depending on which part of the document was cached. 
      // For instance, if DocumentCacheOptions.PageImage was set in the document, and the image was cached, then there is an item for it 
      Console.WriteLine($"document {documentId} cache policies:"); 
      ISet<string> cacheKeys = document.GetCacheKeys(); 
      foreach (string cacheKey in cacheKeys) 
      { 
         // Does it exist? 
         if (documentCache.Contains(cacheKey, document.DocumentId)) 
         { 
            // Get the policy 
            CacheItemPolicy itemPolicy = documentCache.GetPolicy(cacheKey, document.DocumentId); 
            // This demo sets an absolute expiration, but this is generic code 
            // than can figure it out by examining the values: 
            DateTime absoluteExpiration = itemPolicy.AbsoluteExpiration; 
            TimeSpan slidingExpiration = itemPolicy.SlidingExpiration; 
            DateTime localTimeExpiration; 
 
            if (slidingExpiration != TimeSpan.Zero) 
            { 
               // Has sliding expiration, therefore the expiration will be NOW + sliding 
               localTimeExpiration = DateTime.Now.Add(slidingExpiration); 
            } 
            else 
            { 
               // Absolute expiration is stored in UTC, convert to local 
               localTimeExpiration = absoluteExpiration.ToLocalTime(); 
            } 
 
            // Show the expiration 
            Console.WriteLine($" key:{cacheKey} expiry at {localTimeExpiration}"); 
         } 
      } 
   } 
} 

Cache Workflow in LEADTOOLS Document Web Service

Document Web Service Overview

This section will describe how the LEADTOOLS Document Web Service uses the cache. The service ships with full source code, and the process can be modified as needed. The project source code is located at:

The .NET service contains a folder for the .NET Core service (core), the .NET Framework service (fx), and a folder that contains components common to both (src).

Note: Both the core and fx service projects derive components from the src folder. So if any change is needed to be done to these shared components, they must be made in the src folder.

The JavaScript client demo is located at:

You can ignore the source code dealing with "Pre-Cache". This deals with special code to pre-cache the LEADTOOLS sample documents used in the demo.

File Cache

The default sample implementation uses a LEADTOOLS FileCache object that stores cache items in the file system (local or as recommended: remote UNC path). The cache is a persistence system, which means that when the system is restarted, only the cache object is re-created and any non-expired items stored in the cache from previous sessions will still be available.

Hybrid (File and Memory) Caching

As an alternative to the simple file cache used, a Hybrid Memory/File or Memory/Ehcache cache can be used instead. This can be done by modifying the value for lt.Cache.CacheManagerConfigFile in the appsettings.json(.NET)/dev.properties(Java) to use the config-hybrid-memory-file-cache-manager.xml, by simply commenting and uncommenting the relevant lines.

The hybrid memory/file cache works by setting a threshold for the maximum length of the data to be cached. Any data larger than this value will be placed in the file cache, otherwise it can be placed in a memory cache.

For more details, check the source code for the hybrid cache manager:

Application_Start

The global cache object is created in the ServiceHelper.CreateCache method and stored in the static _cache variable (accessible through the ServiceHelper.Cache static property).

The cache is configured using the settings in the appsettings.json(.NET)/dev.properties(Java). The cache eviction policy that determines how long items are kept in the cache is configured here or in the cache config XML.

The service can be modified by replacing the code inside ServiceHelper.CreateCache to initialize and set up a different cache system and set it in the _cache variable.

FactoryController.LoadFromUri

The LoadFromUri is the entry point where the user loads a new document located on a remote URL. The document can be any file format supported by LEADTOOLS such as PDF, TIFF, DOCX, PNG, XLSX, and countless others. It is invoked from the JavaScript Document Viewer client using the "Open URL" menu item.

When the .NET DocumentFactory.LoadFromUri method is called, the cache object and the requested URL is passed in LoadDocumentOptions. LoadFromUri will determine whether the data in the URL contains valid image or document data that is supported, parses the data to determine the number of pages and size of each page, and returns a new LEADDocument object containing the information.

If not already defined by the user, a new unique ID is created by a GUID generator for each LEADDocument object. This ID is stored in the LEADDocument.DocumentId property. If the user wishes to use their own ID, then that value will be used and stored in the property. It is up to the user to guarantee the uniqueness of this ID.

LEADDocument.CacheUri is checked and if it is null, it is set to a value that can be used to obtain the original document data. Refer to the ".NET/Java Document Service and JavaScript" section below for more information.

The following cache properties of the document are then set:

In addition, AutoDisposeDocuments is set to true: This is useful when the document may contain other child documents in the future. Refer to Creating Documents with LEADTOOLS Document Library for more information.

Finally, SaveToCache is called and a new cache item is created from the document ID and the document's data structure and is then saved into the cache.

The JavaScript code will create an instance of a JS LEADDocument object from the saved cache item and set it in the viewer. The viewer has all the information needed to construct the skeleton required to view the document. This includes page holders in the correct size in both the view and thumbnails area, the bookmarks tab if supported, and annotation containers. All of these are created but with empty data since the LEADDocument object does not contain any. The viewer is fully functional and the user can scroll and click items that will trigger calls to other methods in the service to obtain the required data.

The service here can be additionally modified to include:

.NET/Java Document Service and JavaScript

The .NET/Java DocumentFactory.LoadFromUri method does not set the value of LEADDocument.CacheUri and leave it to the default value of null prior to returning it to JavaScript. The JavaScript DocumentFactory.loadFromUri method will check if the value for LEADDocument.CacheUri is null and will then replace it the HTTP GET URL required to call the service CacheController.GetDocumentData web method. Refer to source code in the service for more information.

This is the default implementation of the .NET/Java Documents Service for the following reasons:

Alternatively, if using a cache system that stores the items in a virtual directory (such as the LEADTOOLS FileCache), then FileCache.CacheVirtualDirectory can be set to the full virtual directory path of the cache items and the .NET/Java DocumentFactory.LoadFromUri will set LEADDocument.CacheUri to the path of the document's original data. Finally, the JavaScript DocumentFactory.loadFromUri method will check for this value, and will not modify it since it is not null.

PageController methods

The document is constructed and the first page outline is visible but without content. The system determines whether the document supports SVG viewing, if it does, it requests the document by calling the PageController.GetSvg service method using the document ID and page number.

The service will first try to load the document from ServiceHelper.Cache using DocumentFactory.LoadFromCache with the document ID. This method will only request the small data structure required to re-create the .NET/Java LEADDocument object, and is very fast. From previous discussion, this ID is the only value needed to reconstruct the .NET/Java object.

The DocumentPage.GetSvg method is called with the specified options and the resulting SVG data is streamed back to the JavaScript code, and the .NET/Java object is disposed.

When the LEADDocument object is constructed from the cache, it will use the same settings for AutoSaveToCache and AutoDeleteFromCache: therefore, the document will not save itself back into the cache upon disposal. PageController.GetSvg is considered a read-only method that does not modify the state of the cache object.

The DocumentViewer will generally only call this method a single time per page, and rely on the browser's own caching if requested again (since this is an HTTP GET operation). The method may be called again from the same session only when the browser cache is exhausted. This is performed automatically by the web browser and is outside the control of LEADTOOLS.

The value of LEADDocument.CacheOptions is set to DocumentCacheOptions.None, meaning that only those parts required to reconstruct the document are saved into the cache. The page image, SVG, and text data are not saved into the cache. This minimizes cache size since in almost all cases, the DocumentViewer will never call PageController.GetSvg for a page more than once and the resulting SVG data (which can be large) is never requested from the server again.

Multi-user systems that share the same document ID between different browsers can change the value of LEADDocument.CacheOptions from the default value of DocumentCacheOptions.None to store page image, SVG, and text data into cache to increase performance. For instance, setting the value to All (includes PageSvg) during FactoryController.LoadFromUri above before SaveToCache will instruct the library to store the page data into the cache upon request.

The workflow for DocumentPage.GetSvg is as follows:

  1. Always check whether the cache contains data for the key "documentID" + "value_of_pageNumber" + "svg". If found, return it. Naturally the first time this method is called for this page, it will not find any data and will go to step 2.
  2. Extract the SVG data for the page from the original (document PDF, DOCX, etc.) data. This is almost always a more expensive operation than returning the data directly from the cache.
  3. If LEADDocument.CacheOptions of the owner document contains PageSvg, store the SVG data into the cache using the key above.
  4. Return the SVG data

Therefore, subsequent calls from other user sessions (or browsers) to obtain the SVG data for the same document and same page will find the data in the cache at the first step and will never extract the data from the original document again.

It is possible the process could reset if the data is evicted from the cache manually or through automatic expiration. If the page SVG key is not found, steps 2-4 will be repeated and the data is re-generated when it is requested the next time.

User-modification to the service can include:

Client-side PDF Rendering

If client-side PDF rendering support is used with the Documents Service, then direct HTTP access to the original image data is required and must be set in Document.cacheUri JavaScript object. The PDF renderer will use this value to obtain the original data from the remote source and render the PDF pages directly into the viewer surface and DocumentPage.GetSvg and DocumentPage.GetImage are never called.

Other Page and DocumentController methods

The other methods of Page and Document controller work in similar fashion to PageController.GetSvg. The document is loaded from the cache, the data is extracted using the .NET/Java LEADDocument object, and returned to JavaScript.

The following methods re-save the LEADDocument object into the cache since they modify the data:

See Also

Document Library Features

Loading Using LEADTOOLS Document Library

Creating Documents with LEADTOOLS Document Library

Uploading Using the Document Library

Document Library Coordinate System

Loading Encrypted Files Using the Document Library

Parsing Text with the Document Library

Barcode Processing with the Document Library

Document Toolkit History Tracking

Document Page Transformation

Using LEADTOOLS Document Viewer

Using LEADTOOLS Document Converters

Document View and Convert Redaction

Related Topics

MemoryCache Example
RedisObjectCache Example
RedisWithBlobsObjectCache Example
Ehcache Example
Help Version 21.0.2021.1.15
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2021 LEAD Technologies, Inc. All Rights Reserved.

LEADTOOLS Imaging, Medical, and Document