|

Access
control
Above all in companies, information is often only
approved for distribution in a restricted manner. A search
system must avoid providing unauthorised users with knowledge
of or even access to secret information through a “back
door”.

Boolean
retrieval
The so-called Boolean Operators (AND, OR, NOT) will
be used to explicitly define relationships between individual
search expressions.

Clustering
Certain document collections contain a large number
of documents that are only marginally different from other
documents. These are often different versions of the same
documents or revisions/corrections of one document. These
will be grouped together and be presented to the user in a
more compact form.

Concept sensors
Concept sensors permit the formulation of very complex relationships
that require the integration of extensive rules. With their
help, it is possible to detect cases that only arise through
certain correct combinations of several factors.
Conversion - Document formats
Documents in all standard Office formats (Word, Excel, PowerPoint,
Lotus, WordPerfect etc.) and the major presentation formats
(HTML, XML, SGML, Postscript and PDF) will be converted in
such a way that they can be read into an Information Retrieval
System.
Conversion - Character Coding
Different coding systems for textual information (ASCII, ANSI/Windows,
ISO Latin, KOI8 Cyrillic, etc.) are converted into a suitable
internal format so that they can be processed by an information
retrieval system.
Coordination Level Matching
The ranking list will be sub-divided into individual sections,
which will be arranged according to the number of search terms
found.
Cross-language retrieval
In today’s work of globalisation and multi-national
organisations and companies, it is becoming increasingly common
to index document collections that contain objects in many
different languages. It must be possible to efficiently access
such collections using only a single query formulated in the
language preferred by the user.

Decompounding
Some languages, such as German, permit the formation
of complex expressions by joining a number of simple words
together without any intermediate spaces. Composite words
of this kind can often be written in a phrasal form, however,
or are frequently only partially referenced in a query. It
is therefore important to break them down into their constituent
parts.

Duplicate elimination
Certain document collections contain redundant information
- and, in particular, many documents that often appear in
exactly the same, or almost identical forms. These duplicates
are grouped together and are displayed to the user in a more
compact form.

Entity
recognition
In many cases, background knowledge is necessary in
order to make optimal use of information. Entity recognition
identifies words as names (of persons, companies, localities)
and thereby makes it possible to place them in connection
with additional information.

Fuzzy matching
Fuzzy Matching enables robust retrieval of relevant information, especially
in the case of typing errors and alternative spellings. Fuzzy Matching
generates relevant matches independent of whether the query or the document
contains the misspellings or alternative transcriptions and
transliterations.

Language
detection
Stemming and decompounding are usually dependent on the language.
If a system has to process documents in different languages,
it is necessary that it must first detect the language for
each document.

Meta
data
Even when they are only partly structured, many documents
contain Meta data that can considerably facilitate access
(date, author, etc.).

N-Gram
Indexing by words is suitable when a system should
process documents with few or no typing or grammatical errors
and if the documents have been translated in a language that
is known to the system (necessary for stemming/ decompounding).
If this is not the case, the system can break down words into
smaller units ("N-Gram"), which makes error-tolerant
comparison possible.

Noun phrase extraction
A combination of several words can often have a more
specific meaning than the sum of its individual components.
Phrases, i.e., expressions involving several words, are recognised
and are processed as a unit.

Parsing
documents
The document is scanned to identify those parts that
contain information to be indexed. Other text parts (certain
formatting codes, etc.) will be ignored.

Passage retrieval
In longer documents or information streams, it is
often the case that only short sections are relevant for the
answer to a search query. In order to deliver a good search
result, the system must be able to identify and correctly
weight such sections.

Probabilistic
Ranking lists will be sorted on the basis of estimates of
the probability that an object is relevant. There are sophisticated
formulae for the calculation of the probabilities.

Query extension
It is often difficult for the user to determine the terminology
with which a searched fact will be expressed in the available
objects. Automatic query expansion extends search queries
by the terms used, and thereby helps to achieve more comprehensive
search results.

Rule-based
Information objects are organised with the help of
a number of rules, so that they can then be presented in a
ranking list. This procedure enables a simple adaptation to
customer-specific ranking wishes.

Relevancy feedback
The user can check search results for their relevance,
whereby the system automatically refines the query further
and supplies better search results.

Statistical
text categorisation
It is increasingly becoming the task of an Information
Retrieval System to not only search through large quantities
of search expressions, but to automatically fit documents
into a hierarchy of categories that have been defined by complex
criteria. Statistical procedures solve this problem on the
basis of training examples.

Stemming
In natural spoken language, expressions are used in
various word forms, depending on their use in the grammatical
constructions. In order to ensure that as much relevant information
as possible is found, words have to be normalised so that
words that are not given in exactly the same form will still
be returned as hits after matching with search terms.

Stop word elimination
Certain very frequent words (articles, prepositions,
etc.) are eliminated. These words do not help in distinguishing
relevant from non-relevant information. As a result of the
elimination, the size of the index is also reduced and the
search is accelerated.

Structured documents
Structure in documents is detected and evaluated in
order to later allow specific access to information in certain
fields only.

Sub-collections
Information can be divided into sub-collections, and
in particular information from different sources. As a result,
a user can then specifically enable/disable individual areas
of the document collection in and out, and can focus his search
on individual areas.

Word
segmentation
The individual words are taken from the document or
the information stream. In doing this, punctuation marks and
spaces, among other things, are removed.

|