Email Threading for eDiscovery

In order to review documents within an eDiscovery process, it is helpful to assess entire email conversations. Short messages without context may be difficult or even impossible to assess, e.g. when preceding questions are missing. Furthermore, it is simply more efficient to tag an entire thread rather every single message in the thread.

Mathematically speaking, an email thread is a directed acyclic graph that relates messages to other messages. An email message is related to another by sending, replying and forwarding. A message in the sender’s sent box is related to the same message in the recipient’s inbox. Analogously, a received message in the inbox is related to the reply in the outbox.

Depending on the collection process, it may well be that not all email messages of a conversation have been collected. Thus, an email thread may not be a connected graph; in particular, the root message may be missing. Depending on the configuration of the mail client, the body of the root message may be included in a related mail. The fact that these included texts may be changed subsequently must be taken into account.

The relations between email messages can be determined in different ways. Sometimes, but not always, thread identifiers are available. References may be available, pointing to the message identifier of a related message. If none of this information is available, hashing sender, subject and date may yield additional relations.

Back to Knowledge Management

Information Retrieval

The objective of Information Retrieval (IR) is to search large data collections for information relevant to a user’s information requirements. The term “information retrieval” was coined by Calvin Mooers in 1950. Like “research” the word “retrieval” does not refer to refinding something. It rather relates to the information retrieval paradox: “If I knew what I was searching for, I wouldn’t be searching for it.”

Information retrieval is focuses on three dimensions: systems and applications, theory and models, evaluation. Various retrieval models exist, such as Vector Space Model (VSM) and probabilistic and language models. For evaluatio,n recall and precision are often used. SMART was an early retrieval system that dealt with all three aspects. RankBrain is a more recent retrieval system based on TensorFlow.

WebGND

The Integrated Authority File (German: Gemeinsame Normdatei or GND) is an international authority file used and maintained by the German National Library (German: Deutsche Nationalbibliothek or DNB), all German-language library associations, the Zeitschriftendatenbank (ZDB) and many other institutions. WebGND is an online application that supports navigation and search within this large database which consists of more than 11 million records covering personal names, corporate names, meeting names, geographic names, topical terms and uniform work titles.

Go to WebGND

Compliance

Media Analysis

Knowledge Managament