 |
High-Accuracy, Ultra High-Speed Text Extraction Software@
DocCat is a filter program for Solaris/Linux/FreeBSD which extracts text information from Windows document files, such as MS-Word etc, with high accuracy and ultra-high speed.
The program is text extraction software in executable form and can be used in combination with NAMAZU (a full-text search engine) to create a full-text search system on an intranet, or with a mail server to read out documents attached to mail messages received by cellular phones, etc. |
| [ Advantages/Features/Operating Environment ] [ Limitations ] [ Main Methods of Usage/Evaluation of Speed ] [ Price List/Yearly Maintenance Fee ] [ Purchase] |
 |
| 1. Full-Text Search System using Namazu |
| In an age when large amounts of information are stored as digital files, the importance of full-text search systems is increasing more than ever. Namazu was developed as free software, is easy to install and features high-speed search capabilities. In a full-text search system, usually as pre-processing, all files are scanned and text extraction carried out, sentences are segmented into words (in a process called morphological analysis) and then indexes are created linking the words and the files they can be found in. By combining Namazu and DocCat, in addition to HTML and text files, Word, Excel, PowerPoint, IchiTaro, OASYS, Lotus Word Pro and PDF files can also be included in the scope of the search. |
 |
| 2. Read Out of Attached Files in a Mail System |
As a Cell Phone Server
The number of people communicating by mail with hand-held devices and cell phones has increased greatly in recent years. In cases when files such as MS-Word are attached to a mail message, if processing is carried out on a server with access to the Dehenken TF Library, the attached file can be read, albeit with the simple display capabilities available in hand-held devices. |
|
As a Client for Portable Devices
In portable devices with a comparatively large memory, the Dehenken TF Library can be installed on the client-side to read out attached files. |
 |
 |
Results of Speed Evaluation with DocCat Ver4.0
(1) Software Evaluated
DocCat Ver3/PDF Option for Linux
DocCat Ver4/PDF Option for Linux
(2) Test Environment
OS: RedHat Linux 7.2
CPU: Pentium4 1.6G
Memory: 256MB |
| [Test Data] |
| Category |
doc |
xls |
ppt |
pdf |
txt |
Total |
| No. of Files |
205 |
372 |
49 |
2573 |
2114 |
5313 |
| Total Size of Input Files (MB) |
17.3 |
39.5 |
19.9 |
859.6 |
125.0 |
1061.3 |
| Average Size of Input Files (KB) |
86.4 |
108.8 |
415.4 |
342.1 |
60.6 |
|
|
| [DocCat Ver. 3] (Unit: Seconds) |
| Category |
doc |
xls |
ppt |
pdf |
txt |
Total |
| Total Processing Time (s) |
20.8 |
37.6 |
5.3 |
686.3 |
265.1 |
1051.2 |
| Processing Time per File (s) |
0.10 |
0.10 |
0.11 |
0.27 |
0.13 |
0.70 |
|
| [DocCat Ver. 4] (Unit: Seconds) |
| Category |
doc |
xls |
ppt |
pdf |
txt |
Total |
| Total Processing Time (s) |
6.0 |
5.3 |
0.7 |
51.5 |
73.9 |
437.34 |
| Processing Time per File (s) |
0.03 |
0.01 |
0.01 |
0.14 |
0.03 |
0.23 |
|
Overview of Test Results
As can be seen from the test data above, the results of carrying out a speed evaluation of [Ver3] and [Ver4] of "DocCat" on files equivalent to 5313 files with a total of 1GB, it can be seen that in regard to the extraction processing time of approx. 17 minutes (1015.2s) with "DocCat Ver3," "DocCat Ver4" gives a result of approx. 7 minutes (437.34s). The extraction speed with Ver4 is more than double the speed of previous versions. |
 |
| [ Top of Page ] [ Advantages/Features/Operating Environment ] [ Limitations ] [ Price List/Yearly Maintenance Fee ] [ Purchase ] |