Dehenken - Top Page
 
 
Search in a site
Dehenken - Text Extraction
DocCat DocCat PDF Option
DocCat DocCatPDF Option
High-Accuracy, Ultra High-Speed Text Extraction Software@
DocCat is a filter program for Solaris/Linux/FreeBSD which extracts text information from Windows document files, such as MS-Word etc, with high accuracy and ultra-high speed.
The program is text extraction software in executable form and can be used in combination with NAMAZU (a full-text search engine) to create a full-text search system on an intranet, or with a mail server to read out documents attached to mail messages received by cellular phones, etc.
[ Advantages/Features/Operating Environment ] [ Limitations ] [ Main Methods of Usage/Evaluation of Speed ] [ Price List/Yearly Maintenance Fee ] [ Purchase]
Main Methods of Usage
1. Full-Text Search System using Namazu
In an age when large amounts of information are stored as digital files, the importance of full-text search systems is increasing more than ever. Namazu was developed as free software, is easy to install and features high-speed search capabilities. In a full-text search system, usually as pre-processing, all files are scanned and text extraction carried out, sentences are segmented into words (in a process called morphological analysis) and then indexes are created linking the words and the files they can be found in. By combining Namazu and DocCat, in addition to HTML and text files, Word, Excel, PowerPoint, IchiTaro, OASYS, Lotus Word Pro and PDF files can also be included in the scope of the search.
2. Read Out of Attached Files in a Mail System
As a Cell Phone Server
The number of people communicating by mail with hand-held devices and cell phones has increased greatly in recent years. In cases when files such as MS-Word are attached to a mail message, if processing is carried out on a server with access to the Dehenken TF Library, the attached file can be read, albeit with the simple display capabilities available in hand-held devices.
     
PPT   Excel   Word   PDF
As a Client for Portable Devices
In portable devices with a comparatively large memory, the Dehenken TF Library can be installed on the client-side to read out attached files.
Evaluation of Speed
Results of Speed Evaluation with DocCat Ver4.0
(1) Software Evaluated
DocCat Ver3/PDF Option for Linux
DocCat Ver4/PDF Option for Linux
(2) Test Environment
OS: RedHat Linux 7.2
CPU: Pentium4 1.6G
Memory: 256MB
[Test Data]
Category doc xls ppt pdf txt Total
No. of Files 205 372 49 2573 2114 5313
Total Size of Input Files (MB) 17.3 39.5 19.9 859.6 125.0 1061.3
Average Size of Input Files (KB) 86.4 108.8 415.4 342.1 60.6  
[DocCat Ver. 3] (Unit: Seconds)
Category doc xls ppt pdf txt Total
Total Processing Time (s) 20.8 37.6 5.3 686.3 265.1 1051.2
Processing Time per File (s) 0.10 0.10 0.11 0.27 0.13 0.70
[DocCat Ver. 4] (Unit: Seconds)
Category doc xls ppt pdf txt Total
Total Processing Time (s) 6.0 5.3 0.7 51.5 73.9 437.34
Processing Time per File (s) 0.03 0.01 0.01 0.14 0.03 0.23
bullet Overview of Test Results
As can be seen from the test data above, the results of carrying out a speed evaluation of [Ver3] and [Ver4] of "DocCat" on files equivalent to 5313 files with a total of 1GB, it can be seen that in regard to the extraction processing time of approx. 17 minutes (1015.2s) with "DocCat Ver3," "DocCat Ver4" gives a result of approx. 7 minutes (437.34s). The extraction speed with Ver4 is more than double the speed of previous versions.
[ Top of Page ] [ Advantages/Features/Operating Environment ] [ Limitations ] [ Price List/Yearly Maintenance Fee ] [ Purchase ]

Dehenken
Yamachu-Bldg 1F, Uradeyama-cho 308, Nakagyo-ku, Kyoto, 604-8155 Japan
Phone: +81-75-254-8780 Fax: +81-75-254-8790
Dehenken
All company and product names are trademarks of the relevant companies. Copyright©2008, Dehenken Limited All Rights Reserved.