future is FREE - the rise of free text retrieval
The traditional mechanism for retrieving scanned documentation has been
through the use of a keyword search typically via a standard database
search engine. Although this remains a highly efficient method for certain
groups of documentation (e.g. invoices, transmittal notes, delivery tickets,
purchase orders), a growing number of applications could be better served
by the use of content sensitive searching methods.
Free-Text Retrieval (FTR) Systems have been available for a number of
years from manufacturers such as ZyLab, Excalibur and Adobe. Increasingly,
most of the larger suppliers of full-blown document management systems
(e.g. Documentum, Filenet, Pafec) have recognised the growing importance
of FTR technologies by integrating these capabilities into their products.
Collectively, these products have a proven track record in many of the
largest companies and are used extensively where knowledge needs to be
shared and re-used throughout organisations. The nature of these products
allows the end users to organise unstructured data collections into a
more structured form at a relatively low cost and to publish them via
LAN / WAN networks, the WorldWide Web, an Intranet or CD / DVD.
The strength of the FTR approach is that it permits, in essence, a full
keyword search without the need for user input at the capture stage (i.e.
minimal typing is required at the initial stages of the scanning process).
Once indexed and made available across the network, the user can search
for any word or phrase that may be contained in a document or group of
documents. The specific section of the identified document(s) can then
be easily viewed and assessed by the user.
It is the capacity of FTR to permit users to trawl for information
that sets them apart from the traditional keyword-based systems. An additional
benefit is that many of the products can also identify (and automatically
index) other electronic file formats (e.g. word processor formats, spreadsheets
and, now even, email postboxes). With this combination of tools, there
are now fewer reasons for documents to be lost on our ever
larger hard disks (local or across the network).
Additional features of FTR systems can also include :
Robust text retrieval search engines. From the use of simple word searches
to Boolean, fuzzy, proximity and progressive searches, a wide range of
documents can easily located. The particular strength of fuzzy search
technologies is that it can compensate for poor Optical Character Recognition
(OCR) and spelling errors by allowing for variations in search terms.
An example of its use could be in a search for the word scanner
where the OCR process has incorrectly recognised the word as scamer
or scarner Both perfectly reasonable errors if OCRg
poor quality or old documents.
Hit highlighting. Although using the product of the OCR process (i.e.
raw text) for the search process, the user is presented with the electronic
photocopy of the original document. Hit highlighting simply superimposes
the position of text match on the original image. The use of the original
image also allows for the direct viewing of complex graphics and hand-written
/ scribbled notes.
Automated alerting options that can monitor a document archive and send
out information relating to new documents via e-mail that match a users
pre-configured search profile.
Access to web-server technologies allowing for transparent access over
the Internet or Intranet to the scanned / indexed information via any
Internet browser. The HTML code is generated automatically from one or
more indexed sources and TIFF images may be translated into lower quality
images in order that data transfer rates are increased. The HTML templates
and CGI scripts generated are fully user definable allowing for a corporate
identity to be maintained if required.
Integrated e-mail and printing functionality.
Generation of user configured bookmarks and content lists.
Standard and open file formats such as PDF or TIFF Group 4 image files,
and standard ASCII text.
Options to use manual entry key fields (free text, list & date fields),
automatic key fields and barcode recognition. The use of special patch
pages can allow for documents to be automatically separated prior to indexing.
Support for a wide range of scanners and digital copiers.
Publishing capabilities to CD-R or DVD-RAM.
Encryption options for data, image and text files.
Combination of workstation or concurrent (floating) licences allowing
for the most cost-efficient use of the system.
A principle cost of developing any image archive is that of data capture.
The scanning process itself, although potentially lengthy, can be dwarfed
by the keying process. Data input staff need to record sufficient information
to allow for documents to be readily retrieved. For complex documents
the amount of information to be recorded can be extremely large, particularly
if a user is to be able to easily locate the documentation in the future.
In addition, the data to be recorded needs to be clearly identifiable
to staff who may not be fully conversant with the content of the actual
A logical solution to the data input process is to make use of the actual
content of the documents to be scanned. The new generation of Optical
Character Recognition (OCR) engines, coupled with developments in the
power of modern PCs, now makes it cost effective to consider the
use of OCR as a matter of course for large documentation sets. On smaller
systems, OCR throughputs of 5 - 10 pages per minute are now attainable.
This rate is similar to the maximum throughputs achieved by many, ad-hoc,
conventional scanning solutions and so would allow for the routine use
of the FTR approach.
Scanning bureaux or large volume operations, by using the economies of
scale available to the high throughput user, routinely batch process material
to be scanned. In this way, a 24 hour processing operation could be maintained
with 3 4 scanning workstations (each scanning an average of 8000
pages image per day) supplying one OCR server.
Bureaux scanning (and by implication, the costs of handling the process
in-house) costs for a FTR solution can be divided into a number of main
Preparation of the documentation to be scanned. In an ideal world, bureaux
would receive the material in a condition ready for immediate scanning.
To achieve the highest throughput scanning, the material should ideally
be consistently single or double sided, have a regular paper size (e.g.
A4) and weight (e.g. 80-100 gram paper). Non-standard paper types such
as faxes, telexes and card can cause scanner feed problems. All paper
should be removed from binders / files / envelopes and all staples should
be removed. Patch pages (or blank sheets) should be inserted into the
documentation identifying each new document. Additional costs to be recognised
include those for any staff needed for the extraction process (from the
clients library), collection charges and the cost of storage boxes.
Scanning. With the requirement to OCR the images, a scanning resolution
of 300 dpi is generally recommended. Higher resolutions can be used but
these generally do not contribute greatly to the quality of the OCR output,
but only to the image sizes themselves. Some FTR solutions now offer the
option of scanning / OCRg colour material although this is typically
used by most organisations in an ad-hoc manner. Similarly, large format
drawings can also be incorporated although, as with the colour option,
this is generally an ad-hoc process.
OCR to generate text / Indexing of the free text database. Although it
is usual to carry out both these functions as a batch process, offline
from the immediate scanning process, they are frequently run under supervision
thereby tying up both staff and valuable hardware resources.
Manual indexing (if required in addition to the free-text indexing). Costs
for this process are usually levied on a charge per keystroke. The most
significant issue, however, frequently remains that of clarity to the
data entry staff who need to be able to easily identify the key words
and to read, sometimes illegible, handwriting.
Quality checking. Although the scanning process is carried out in real-time
and under the direct supervision of the scanning operator, an element
of quality control is necessary. Within the bureau environment this is
usually carried out by a supervisor who typically performs this function
as a routine part of any job.
Recompilation. The client may require that the scanned material be recompiled
(re-stapled, re-inserted into folders or envelopes, or re-boxed) prior
to return of the original documentation.
Delivery costs. These include the cost of the delivery media (typically
CD-R or possibly DVD-RAM) and any delivery costs associated with the return
of the original documentation. Storage capacities of CD-R and DVD-RAM
(2.6GB) are typically 13,000 and 52,000 page images respectively (including
corresponding text files and FTR database).
With the variables described above, it will be understood that pricing
for every job is unique. That said, ballpark figures of between £70
- £150 per 1000 page images for the complete scanning process are
typical. An accurate quotation is generally dependant on a representative
sample of the material to be scanned.
Many of the costs outlined above also apply to conventional image cataloguing
jobs (scanning with keyword indexing). The principle cost differences
are those associated with the OCR (Optical Character Recognition) process.
When compared to the conventional approach, particularly where the individual
documents are relatively small and relatively complex manual indexes need
to be created, it can quickly be argued that the cost of data capture
could be lower if using the FTR approach. As documents become larger,
the cost benefits (with respect to the data capture portion) swing back
in favour of the keyword approach. There is, however, some argument as
to the potential offset in the value issue when the other strengths of
Free Text Retrieval systems are taken into account.
In summary, it is proposed that Free Text Retrieval
(FTR) solutions are becoming the logical choice for the generation of
image catalogues, particularly in instances where the text content of
documents is critical to the search process. The use of scanning bureaux
is recommended where economies of scale are to be achieved especially
where a backlog of material needs to be converted.
Dr Adrian Shepherd
Scan-IT : A Division of Instant Library Ltd