 |
High-Accuracy, Ultra High-Speed Text Extraction Software
DocCat is a filter program for Solaris/Linux/FreeBSD which extracts text information from Windows document files, such as MS-Word etc, with high accuracy and ultra-high speed.
The program is text extraction software in executable form and can be used in combination with NAMAZU (a full-text search engine) to create a full-text search system on an intranet, or with a mail server to read out documents attached to mail messages received by cellular phones, etc. |
| [ Advantages/Features/Operating Environment ] [ Limitations ] [ Main Methods of Usage/Evaluation of Speed ] [ Price List/Yearly Maintenance Fee ] [ Purchase] |
 |
 |
High Speed
DocCat performs with high accuracy but also with outstanding speed. In a full-text search system, high speed processing in text extraction is an extremely important element for the reduction of indexing time. In "DocCat V4.0," the realization of a high-speed extraction process, among other things, has enabled a speeding up by "more than double the previous speed," enabling even greater speeding up of the system as a whole. This can be expected to represent major improvements in efficiency in the operation of full-text search systems handling large amounts of document data due to a significant reduction in the time required to generate index data and an improvement in the frequency of data updates. |
 |
Accuracy/Stability
In order to realize the high-level search accuracy which is extremely important for full-text search systems, the accuracy of the text extraction in the document file being searched is of the utmost importance. In current systems where expressions which should cause a hit in the extraction routines do not hit, the whole benefit of introducing a full-text search system is lost. In Dehenken's "DocCat" and "TF Library" no reference is made to the file suffix (.doc etc.), the document type being evaluated from the file contents, allowing highly accurate text extraction processing to be carried out. |
| |
 |
 |
| Screen Example (% doccat ms-word.doc > textfile.euc) |
|
|
 |
 |
Easy To Use
"DocCat" greatly resembles the usage of the "cat" command in UNIX. It can be used simply, operation being carried out on the command line. |
 |
Character Set based on Unicode
DocCat uses a Unicode character set conforming to Microsoft Windows, so the affinity with the text character codes of MS-Office and IchiTaro is good and it can be used with other applications. |
 |
Output of Property Information
With files from MS-Office97 or later and PDF files, the property information included in the file can also be output by setting the appropriate option. |
 |
Absorbs Fluctuations with Single-byte Katakana
Single-byte (Hankaku) katakana can be converted to 2-byte (Zenkaku) katakana automatically. |
 |
Automatic Detection of File Type based on File Content rather than File Extension
File information recognition is also suitable for complex client-server models. |
 |
HTML/XML Output
The output format can be set to HTML or XML. |
 |
Supported Documents
Microsoft Office <>
Word 95 / 97 / 98 / 2000 / 2002(XP) / 2003 /2007
Excel 95 / 97 / 2000 / 2002(XP) / 2003 /2007
PowerPoint 95 / 97 / 2000 / 2002(XP) / 2003 /2007
Microsoft Office <>
Word98/2001 for Mac
Excel98/2001 for Mac
PowerPoint98/2001 for Mac
IchiTaro V5 - V13/ IchiTaro 2004/IchiTaro 2005/ IchiTaro 2006 / IchiTaro 2007
OASYS V6/V7/V8/2002
Lotus Lotus Word Pro 2001
Text Documents in JIS/EUC/SJIS/UCS-2/UTF-8/RTF /UTF-16
HTML/XML/SGML
PDF Files (Support for PDF requires the separate program "DocCat PDF Option") |
 |
| Limitations with PDF Option |
| Acrobat |
4.0/5.0/6.0/7.0/8.0 |
| PDF |
1.2/1.3/1.4/1.5/1.6/1.7 (PDF 1.1 is not supported) |
| Encoding |
Text extraction is possible with PDF1.3 encoded PDF files
(1.2/1.4 are not supported) |
| Embedded Fonts |
When creating embedded fonts, character codes are allocated internally within PDF in sequence from 1. When text extraction is carried out using PDF Option, numbers are output in sequence from 1.
(The character codes become meaningless data, but in such cases, the extraction cannot be controlled) |
| Note 1: With some embedded fonts, text cannot be extracted from PDF. |
| Font types where extraction is not possible are as follows: |
| (1) Text extraction is not possible if the subtype of the embedded font is set to Type0, the identity is set to CFF (Compact Font Format) and the CID Font Operator is set to Adobe-Identity. |
| (2) Text extraction is not possible if the subtype of the embedded font is set to Type0, the identity is set to TrueType and the Cmap Encoding table is not available for reference. |
|
|
 |
 |
Required Memory/Free Disk Space
Memory 256MB or more (recommended)
Disk Space1MB or more (In the case of software only) |
 |
Supported OS
• Linux Series
Miracle Linux V2.1, V3.0/Red Hat Enterprise Linux (ES/AS/WS) Ver.2.1, Ver.3.0
SuSE Linux Enterprise Server 8
SUSE Linux Enterprise Server 9
• BSD Series
FreeBSD
• Solaris
Solaris2.6 or greater |
|
 |
| [ Top of Page ] [ Limitations ] [ Main Methods of Usage/Evaluation of Speed ] [ Price List/Yearly Maintenance Fee ] [ Purchase] |
|