 |
High-Accuracy, Ultra High-Speed Text Extraction Software@
DocCat is a filter program for Solaris/Linux/FreeBSD which extracts text information from Windows document files, such as MS-Word etc, with high accuracy and ultra-high speed.
The program is text extraction software in executable form and can be used in combination with NAMAZU (a full-text search engine) to create a full-text search system on an intranet, or with a mail server to read out documents attached to mail messages received by cellular phones, etc. |
| [ Advantages/Features/Operating Environment ] [ Limitations ] [ Main Methods of Usage/Evaluation of Speed ] [ Price List/Yearly Maintenance Fee ] [ Purchase] |
 |
| 1. MS-Word |
| Category |
Specification (Dehenken) |
| Tables |
Tables are destructed and the contents of individual cells extracted as text. |
| Attached Files |
Conversion is possible |
| Protected Documents |
The conversion of "Protected" documents is possible. |
| Limitations (1) |
Word Art text cannot be converted. |
| Limitation (2) |
Tables are formatted with a new line after each cell. |
| Limitation (3) |
Password protected files cannot be converted. |
|
| 2. PowerPoint |
| Category |
Specification (Dehenken) |
| Extraction Overview |
Text extraction of slides and notes is possible. |
| Slide Numbers |
Text extraction is not carried out |
| Tag output |
Tags are not output |
| Limitations |
Attached Word/Excel files cannot be extracted. |
|
| 3. Excel |
| Category |
Specification (Dehenken) |
| Extraction Overview |
Only converts strings and numbers |
| Attached Files |
Conversion is possible |
| CSV Format Output |
Data is output to the text file in CSV format. |
| Book Protected |
Files where Book Protected is set cannot be converted. |
| Sheet Protected |
Files where Sheet Protected is set can be converted. |
| Worksheet |
Newline as a separator is not output. |
| Limitation (1) |
Formulae and calculation information cannot be converted. |
| Limitation (2) |
Password protected files cannot be converted. |
| Limitation (3) |
Specified page numbers, No. of pages, dates, times, file names and sheet names are not extracted. |
|
| 4. PDF |
| Category |
Specification (Dehenken) |
| Extraction Overview |
Unencoded character information is converted to text. (Conversion of PDF1.3 encoded files to text is possible) |
| Camp Characters |
Cannot be extracted in some cases. |
| Symbol Characters |
Characters are corrupted in some cases. |
| Limitations (1) |
Characters which cannot be copied using the "Text Selection Tool" cannot be extracted. |
| Limitations (2) |
Extraction of LZW compressed text is not possible. |
| Limitations (3) |
When creating a PDF file with data containing embedded fonts, character codes are allocated internally within PDF in sequence from 1. When text extraction is carried out in this case, numbers are output in sequence from 1. In such cases, the extraction of the character codes cannot be controlled. |
|
| 5. HTML |
| Category |
Specification (Dehenken) |
| Extraction Overview |
Strings other than tags and attributes are extracted. |
|
| 6. XML |
| Category |
Specification (Dehenken) |
| Extraction Overview |
Strings other than tags and attributes are extracted. |
|
 |
| [ Top of Page ] [ Advantages/Features/Operating Environment ] [ Main Methods of Usage/Evaluation of Speed ] [ Price List/Yearly Maintenance Fee ] [ Purchase ] |