The future is FREE - the rise of free text retrieval

The traditional mechanism for retrieving scanned documentation has been through the use of a keyword search typically via a standard database search engine. Although this remains a highly efficient method for certain groups of documentation (e.g. invoices, transmittal notes, delivery tickets, purchase orders), a growing number of applications could be better served by the use of content sensitive searching methods.


Free-Text Retrieval (FTR) Systems have been available for a number of years from manufacturers such as ZyLab, Excalibur and Adobe. Increasingly, most of the larger suppliers of full-blown document management systems (e.g. Documentum, Filenet, Pafec) have recognised the growing importance of FTR technologies by integrating these capabilities into their products. Collectively, these products have a proven track record in many of the largest companies and are used extensively where knowledge needs to be shared and re-used throughout organisations. The nature of these products allows the end users to organise unstructured data collections into a more structured form at a relatively low cost and to publish them via LAN / WAN networks, the WorldWide Web, an Intranet or CD / DVD.

The strength of the FTR approach is that it permits, in essence, a full keyword search without the need for user input at the capture stage (i.e. minimal typing is required at the initial stages of the scanning process). Once indexed and made available across the network, the user can search for any word or phrase that may be contained in a document or group of documents. The specific section of the identified document(s) can then be easily viewed and assessed by the user.

It is the capacity of FTR to permit users to “trawl” for information that sets them apart from the traditional keyword-based systems. An additional benefit is that many of the products can also identify (and automatically index) other electronic file formats (e.g. word processor formats, spreadsheets and, now even, email postboxes). With this combination of tools, there are now fewer reasons for documents to be “lost” on our ever larger hard disks (local or across the network).

Additional features of FTR systems can also include :
Robust text retrieval search engines. From the use of simple word searches to Boolean, fuzzy, proximity and progressive searches, a wide range of documents can easily located. The particular strength of fuzzy search technologies is that it can compensate for poor Optical Character Recognition (OCR) and spelling errors by allowing for variations in search terms. An example of it’s use could be in a search for the word “scanner” where the OCR process has incorrectly recognised the word as “scamer” or “scarner” – Both perfectly reasonable errors if OCR’g poor quality or old documents.
Hit highlighting. Although using the product of the OCR process (i.e. raw text) for the search process, the user is presented with the “electronic photocopy” of the original document. Hit highlighting simply superimposes the position of text match on the original image. The use of the original image also allows for the direct viewing of complex graphics and hand-written / scribbled notes.
Automated alerting options that can monitor a document archive and send out information relating to new documents via e-mail that match a user’s pre-configured search profile.
Access to web-server technologies allowing for transparent access over the Internet or Intranet to the scanned / indexed information via any Internet browser. The HTML code is generated automatically from one or more indexed sources and TIFF images may be translated into lower quality images in order that data transfer rates are increased. The HTML templates and CGI scripts generated are fully user definable allowing for a corporate identity to be maintained if required.
Integrated e-mail and printing functionality.
Generation of user configured bookmarks and content lists.
Standard and open file formats such as PDF or TIFF Group 4 image files, and standard ASCII text.
Options to use manual entry key fields (free text, list & date fields), automatic key fields and barcode recognition. The use of special patch pages can allow for documents to be automatically separated prior to indexing.
Support for a wide range of scanners and digital copiers.
Publishing capabilities to CD-R or DVD-RAM.
Encryption options for data, image and text files.
Multi-lingual support.
Combination of workstation or concurrent (floating) licences allowing for the most cost-efficient use of the system.


A principle cost of developing any image archive is that of data capture. The scanning process itself, although potentially lengthy, can be dwarfed by the keying process. Data input staff need to record sufficient information to allow for documents to be readily retrieved. For complex documents the amount of information to be recorded can be extremely large, particularly if a user is to be able to easily locate the documentation in the future. In addition, the data to be recorded needs to be clearly identifiable to staff who may not be fully conversant with the content of the actual documentation.

A logical solution to the data input process is to make use of the actual content of the documents to be scanned. The new generation of Optical Character Recognition (OCR) engines, coupled with developments in the power of modern PC’s, now makes it cost effective to consider the use of OCR as a matter of course for large documentation sets. On smaller systems, OCR throughputs of 5 - 10 pages per minute are now attainable. This rate is similar to the maximum throughputs achieved by many, ad-hoc, conventional scanning solutions and so would allow for the routine use of the FTR approach.

Scanning bureaux or large volume operations, by using the economies of scale available to the high throughput user, routinely batch process material to be scanned. In this way, a 24 hour processing operation could be maintained with 3 – 4 scanning workstations (each scanning an average of 8000 pages image per day) supplying one “OCR server”.

Bureaux scanning (and by implication, the costs of handling the process in-house) costs for a FTR solution can be divided into a number of main areas:
Preparation of the documentation to be scanned. In an ideal world, bureaux would receive the material in a condition ready for immediate scanning. To achieve the highest throughput scanning, the material should ideally be consistently single or double sided, have a regular paper size (e.g. A4) and weight (e.g. 80-100 gram paper). Non-standard paper types such as faxes, telexes and card can cause scanner feed problems. All paper should be removed from binders / files / envelopes and all staples should be removed. Patch pages (or blank sheets) should be inserted into the documentation identifying each new document. Additional costs to be recognised include those for any staff needed for the extraction process (from the client’s library), collection charges and the cost of storage boxes.
Scanning. With the requirement to OCR the images, a scanning resolution of 300 dpi is generally recommended. Higher resolutions can be used but these generally do not contribute greatly to the quality of the OCR output, but only to the image sizes themselves. Some FTR solutions now offer the option of scanning / OCR’g colour material although this is typically used by most organisations in an ad-hoc manner. Similarly, large format drawings can also be incorporated although, as with the colour option, this is generally an ad-hoc process.
OCR to generate text / Indexing of the free text database. Although it is usual to carry out both these functions as a batch process, offline from the immediate scanning process, they are frequently run under supervision thereby tying up both staff and valuable hardware resources.
Manual indexing (if required in addition to the free-text indexing). Costs for this process are usually levied on a charge per keystroke. The most significant issue, however, frequently remains that of clarity to the data entry staff who need to be able to easily identify the key words and to read, sometimes illegible, handwriting.
Quality checking. Although the scanning process is carried out in real-time and under the direct supervision of the scanning operator, an element of quality control is necessary. Within the bureau environment this is usually carried out by a supervisor who typically performs this function as a routine part of any job.
Recompilation. The client may require that the scanned material be recompiled (re-stapled, re-inserted into folders or envelopes, or re-boxed) prior to return of the original documentation.
Delivery costs. These include the cost of the delivery media (typically CD-R or possibly DVD-RAM) and any delivery costs associated with the return of the original documentation. Storage capacities of CD-R and DVD-RAM (2.6GB) are typically 13,000 and 52,000 page images respectively (including corresponding text files and FTR database).
With the variables described above, it will be understood that pricing for every job is unique. That said, ballpark figures of between £70 - £150 per 1000 page images for the complete scanning process are typical. An accurate quotation is generally dependant on a representative sample of the material to be scanned.

Many of the costs outlined above also apply to conventional image cataloguing jobs (scanning with keyword indexing). The principle cost differences are those associated with the OCR (Optical Character Recognition) process. When compared to the conventional approach, particularly where the individual documents are relatively small and relatively complex manual indexes need to be created, it can quickly be argued that the cost of data capture could be lower if using the FTR approach. As documents become larger, the cost benefits (with respect to the data capture portion) swing back in favour of the keyword approach. There is, however, some argument as to the potential offset in the value issue when the other strengths of Free Text Retrieval systems are taken into account.

In summary, it is proposed that Free Text Retrieval (FTR) solutions are becoming the logical choice for the generation of image catalogues, particularly in instances where the text content of documents is critical to the search process. The use of scanning bureaux is recommended where economies of scale are to be achieved especially where a backlog of material needs to be converted.

Dr Adrian Shepherd
Technical Director
Scan-IT : A Division of Instant Library Ltd


