The techniques that underlie eDiscovery as it is practised today did not come out of nowhere, but lean heavily on the science of document classification that was invented in the late 1820's in Germany. The first American school for library science was founded at Columbia University in 1887.
Some of today’s hottest eDiscovery ‘trends’, such as Deep Learning and Artificial Intelligence, have been subjects of academic inquiry since the 1950s – when ‘data’ as we know it today hardly seemed to exist. What revolutionized the possibilities and capabilities of this body of knowledge is the awesome growth of computational power we have experienced over the past 30 years.
A barrage of terminology
Along with the explosion of technology and data that made modern eDiscovery a necessity came a barrage of terminology, a lot of it imperfectly understood by (most) lawyers. There is a case to be made for the eDiscovery solution suppliers to impose fixed definitions, but that is inherently problematic because a) the technology is changing very rapidly b) some eDiscovery vendors use the confusion over terms to paper over the cracks in their own technology.
So what are these terms and where are these cracks? When Assisted Review was ruled to be defensible under U.S. law in 2012, the state-of-the-art was the technology now referred to as TAR 1.0.
From TAR 1.0 to 2.0
TAR 1.0 carried out document review faster and generally with more precision than flesh-and-blood lawyers, but it was clunky. If you wanted to add a batch of documents to a data set you had to start from scratch – not a trivial task as the machine training for TAR 1.0 requires a random set of many hundreds or even thousands of documents to be reviewed by an expensive senior lawyer who also has to be a subject matter expert. This set is called a Validation or Control Set. TAR 2.0 addressed both these issues. Here, the starting point of the training is no longer random, and a Control or Validation Set is no longer indispensable.
Lawyers immediately welcomed TAR 2.0 because it removed the need to disclose the initial training set to the opposing party in the run-up to the eDiscovery negotiation process. TAR 2.0 – as well 3.0, a variant that uses Topic Modelling as a start condition for the Assisted Review – became the industry standard.
Onwards and upwards
It’s great – but not perfect. What worries Legal is that TAR 2.0 is not good at dealing with data sets that contain a lot of noisy data: documents in different languages (i.e. not the training language), very short or very long documents, multi-media files, files with a lot of numerical data, extremely unbalanced data sets. The risk is that such files will simply not be picked up, or are classified as responsive or non-responsive in error – errors that could sink a case, or put Legal in a weak position in negotiations for a settlement.
Fortunately, it is now possible to weed out these tricky files upfront, and set them apart for review by other matters. This is a significant technological refinement that will probably become known as TAR 5.0. Not that Legal is concerned with that! Legal wants to look beyond the terminology to see where the gaps are, and where accuracy (and that means defensibility) can be improved.
If you want to cut through the technological jargon and find out what you should be focusing on to get the most out of your Assisted Review, download our white paper, Assisted Review: What’s in a Name?