Deduplication for eDiscovery

Today’s tools support and encourage the duplication of data. Let’s assume user A obtains a document from the enterprise storage and sends it as an attachment by email to user B who stores it on a laptop. This everyday scenario shows how easily files are duplicated. The document file is not only in the enterprise storage, but also in A’s sent box, in B’s inbox, and on B’s laptop, possibly twice if it is in the target folder selected by B as well as in the download folder.

eDisovery

In eDiscovery, it is desirable to group duplicates before reviewing. This grouping is often called deduplication, which must not be confused with deduplicating in order to save storage space. In this latter case, identical disk blocks of files are stored only once. Deduplication for eDisovery is more challenging.

The notion of duplicates in the context of eDiscovery is quite tricky. When looking at the file described above, there are a number of differences between the copies: file path, creation date, file owner, last modified and last access date, etc. For certain forensic information needs, it may matter who accessed the document and when. However, in early case assessment it is desirable to group as many duplicates as possible to speed up a first review of possibly relevant data. This is why eD-MCS supports different definitions of duplicates as well as relevance ranking, in order to review the most relevant documents first.

Back to Knowledge Management

Information Retrieval

The objective of Information Retrieval (IR) is to search large data collections for information relevant to a user’s information requirements. The term “information retrieval” was coined by Calvin Mooers in 1950. Like “research” the word “retrieval” does not refer to refinding something. It rather relates to the information retrieval paradox: “If I knew what I was searching for, I wouldn’t be searching for it.”

Information retrieval is focuses on three dimensions: systems and applications, theory and models, evaluation. Various retrieval models exist, such as Vector Space Model (VSM) and probabilistic and language models. For evaluatio,n recall and precision are often used. SMART was an early retrieval system that dealt with all three aspects. RankBrain is a more recent retrieval system based on TensorFlow.

WebGND

The Integrated Authority File (German: Gemeinsame Normdatei or GND) is an international authority file used and maintained by the German National Library (German: Deutsche Nationalbibliothek or DNB), all German-language library associations, the Zeitschriftendatenbank (ZDB) and many other institutions. WebGND is an online application that supports navigation and search within this large database which consists of more than 11 million records covering personal names, corporate names, meeting names, geographic names, topical terms and uniform work titles.

Go to WebGND

Compliance

Media Analysis

Knowledge Managament