Home
relevancySolutionsPartnersCompanyNewsExpertise
Sitemap
Contact
deutsch
  White Paper
  Literature
  Links
  Glossary
Glossary

W S R P O N M L F E D C A B Q

Access control
Above all in companies, information is often only approved for distribution in a restricted manner. A search system must avoid providing unauthorised users with knowledge of or even access to secret information through a “back door”.


Boolean retrieval
The so-called Boolean Operators (AND, OR, NOT) will be used to explicitly define relationships between individual search expressions.


Clustering
Certain document collections contain a large number of documents that are only marginally different from other documents. These are often different versions of the same documents or revisions/corrections of one document. These will be grouped together and be presented to the user in a more compact form.


Concept sensors
Concept sensors permit the formulation of very complex relationships that require the integration of extensive rules. With their help, it is possible to detect cases that only arise through certain correct combinations of several factors.

Conversion - Document formats
Documents in all standard Office formats (Word, Excel, PowerPoint, Lotus, WordPerfect etc.) and the major presentation formats (HTML, XML, SGML, Postscript and PDF) will be converted in such a way that they can be read into an Information Retrieval System.

Conversion - Character Coding
Different coding systems for textual information (ASCII, ANSI/Windows, ISO Latin, KOI8 Cyrillic, etc.) are converted into a suitable internal format so that they can be processed by an information retrieval system.

Coordination Level Matching
The ranking list will be sub-divided into individual sections, which will be arranged according to the number of search terms found.

Cross-language retrieval
In today’s work of globalisation and multi-national organisations and companies, it is becoming increasingly common to index document collections that contain objects in many different languages. It must be possible to efficiently access such collections using only a single query formulated in the language preferred by the user.

Decompounding
Some languages, such as German, permit the formation of complex expressions by joining a number of simple words together without any intermediate spaces. Composite words of this kind can often be written in a phrasal form, however, or are frequently only partially referenced in a query. It is therefore important to break them down into their constituent parts.


Duplicate elimination
Certain document collections contain redundant information - and, in particular, many documents that often appear in exactly the same, or almost identical forms. These duplicates are grouped together and are displayed to the user in a more compact form.


Entity recognition
In many cases, background knowledge is necessary in order to make optimal use of information. Entity recognition identifies words as names (of persons, companies, localities) and thereby makes it possible to place them in connection with additional information.


Fuzzy matching
Fuzzy Matching enables robust retrieval of relevant information, especially in the case of typing errors and alternative spellings. Fuzzy Matching generates relevant matches independent of whether the query or the document contains the misspellings or alternative transcriptions and transliterations.


Language detection
Stemming and decompounding are usually dependent on the language. If a system has to process documents in different languages, it is necessary that it must first detect the language for each document.


Meta data
Even when they are only partly structured, many documents contain Meta data that can considerably facilitate access (date, author, etc.).


N-Gram
Indexing by words is suitable when a system should process documents with few or no typing or grammatical errors and if the documents have been translated in a language that is known to the system (necessary for stemming/ decompounding). If this is not the case, the system can break down words into smaller units ("N-Gram"), which makes error-tolerant comparison possible.


Noun phrase extraction
A combination of several words can often have a more specific meaning than the sum of its individual components. Phrases, i.e., expressions involving several words, are recognised and are processed as a unit.


Parsing documents
The document is scanned to identify those parts that contain information to be indexed. Other text parts (certain formatting codes, etc.) will be ignored.


Passage retrieval
In longer documents or information streams, it is often the case that only short sections are relevant for the answer to a search query. In order to deliver a good search result, the system must be able to identify and correctly weight such sections.


Probabilistic
Ranking lists will be sorted on the basis of estimates of the probability that an object is relevant. There are sophisticated formulae for the calculation of the probabilities.


Query extension
It is often difficult for the user to determine the terminology with which a searched fact will be expressed in the available objects. Automatic query expansion extends search queries by the terms used, and thereby helps to achieve more comprehensive search results.


Rule-based
Information objects are organised with the help of a number of rules, so that they can then be presented in a ranking list. This procedure enables a simple adaptation to customer-specific ranking wishes.


Relevancy feedback
The user can check search results for their relevance, whereby the system automatically refines the query further and supplies better search results.


Statistical text categorisation
It is increasingly becoming the task of an Information Retrieval System to not only search through large quantities of search expressions, but to automatically fit documents into a hierarchy of categories that have been defined by complex criteria. Statistical procedures solve this problem on the basis of training examples.


Stemming
In natural spoken language, expressions are used in various word forms, depending on their use in the grammatical constructions. In order to ensure that as much relevant information as possible is found, words have to be normalised so that words that are not given in exactly the same form will still be returned as hits after matching with search terms.


Stop word elimination
Certain very frequent words (articles, prepositions, etc.) are eliminated. These words do not help in distinguishing relevant from non-relevant information. As a result of the elimination, the size of the index is also reduced and the search is accelerated.


Structured documents
Structure in documents is detected and evaluated in order to later allow specific access to information in certain fields only.


Sub-collections
Information can be divided into sub-collections, and in particular information from different sources. As a result, a user can then specifically enable/disable individual areas of the document collection in and out, and can focus his search on individual areas.


Word segmentation
The individual words are taken from the document or the information stream. In doing this, punctuation marks and spaces, among other things, are removed.