LEADTOOLS Support
Document
Document SDK Questions
How to do content based scan of a word document?
 
        
            
      
          
            
               
                  #1
                  Posted
                  :
               
               Friday, June 10, 2016 10:14:37 AM(UTC)
               
             
            
          
       
      
         
            
               
                  
                  
                  
              
                
               
            
            
Groups: Registered, Tech Support
Posts: 6
 
            
            
          
         
             
            
               
	
                 
                  I want to make an application which can find duplicate word files regardless of the file name. By that I mean that the contents of the files consist of the same exact content, word for word. Does anyone have any idea how this can be done?
               
 
             
          
       
       
     
            
         
  
 
         
        
        
    
        
            
      
          
            
               
                  #2
                  Posted
                  :
               
               Tuesday, July 5, 2016 6:11:31 AM(UTC)
               
             
            
          
       
      
         
            
               
                  
                  
                  
              
                
               
            
            
Groups: Registered
Posts: 119
Was thanked: 4 time(s) in 4 post(s)
 
            
            
          
         
             
            
               
	
                 
                  You can try doing this by reading the text from your DOC file and save it in string[] (array of strings) so that each element in the array contains a word. After this, open the same doc file for read and start reading the text word by word, and search for each word in the String[] array. If you find the same word in two different places in the same string array, delete that last element; and keep the first one. Then jump to the next word in the file. I think the required functions of doing this are explained in the following article (as Priyaranjan described):
https://social.msdn.microsoft.com/Forums/en-US/ab821d14-bfbc-4c08-b44b-7a5d293ecb2c/compare-word-documents-c?forum=isvvba
If you prefer not to use the above approach, you could try converting the doc file to text format using document converter class from Leadtools. This class can convert the doc files to text even if it contains images. Dealing with Text files should be much easier than doc files. 
After this, you can use the same approach above by reading the text from the text files (*.txt) instead of reading it from word documents. You can find more details about converting your DOC files here <Link to: 
https://www.leadtools.co...entconverters_using.html
Nick Villalobos 
Developer Support Engineer 
LEAD Technologies, Inc. 

 
             
          
       
       
     
            
         
  
 
         
        
        
    
LEADTOOLS Support
Document
Document SDK Questions
How to do content based scan of a word document?
 
    You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.