"AI-based eDiscovery Analytics", Augmented Intelligence”, “Advanced Predictive Analytics”, “The Robot-Lawyer”: we all hear the buzz. Artificial intelligence (AI) is steadily revolutionizing the legal profession and advances in AI-based analytics now drastically increase the speed and improve the quality of the eDiscovery process. The use of AI and analytics in eDiscovery however has been good practice for years. In this blog, we will elaborate on new AI-based analytics techniques and how they compare to other analytics we have been using for years.
First some terminology. Artificial Intelligence is the umbrella term indicating the broad, complex field of research. AI includes areas such as reasoning, problem solving, knowledge representation, planning, machine learning, natural language processing, perception, motion, social intelligence, and even creativity. The ultimate goal of AI is the creation of some form of general intelligence.
“Analytics” is the discovery, interpretation, and communication of meaningful patterns in data. The terms “analytics” or “analysis” describe functions ranging from reporting and review metrics to sophisticated search and advanced data, text mining and machine learning applications.
“AI-based eDiscovery analytic technology” is a combination of all of the above. Analytics detect trends and patterns in data and AI provides the algorithms and evaluation methods.
In the legal profession, AI is helping in-house counsel and attorneys to manage data-intensive tasks more productively and efficiently. Benefits are paramount. eDiscovery analytics allow a better understanding of the data so legal professionals can make better strategic decisions and lawyers can focus on high-value initiatives that will benefit their clients and grow their business.
The use of AI-based tools to search through and organize massive volumes of litigation data reveal insights and information that can be used to predict factors like probable time to resolution and resources and budget accordingly.
In short, AI-based analytics make it possible to base decisions and (legal) strategies on facts rather than instinct or speculation.
In eDiscovery there are three (main) type of analytics. Structural, aka syntactic analytics include techniques as file-, document and forensic property extraction. Structured eDiscovery analytics are generally based on syntactic approaches that utilize character associations as the foundation for the analysis.
These techniques allow meta-data filtering, saved (full-text) searches, email thread detection, - reduction, identifying missing emails in thread, (near) duplicate detection, language identification, communication analysis, time-line visualizations, geo-mapping, etc. detection.
In eDiscovery you never know in advance how much data you will have, what type of data that will be and what type of processing is required. Using analytics to structure data allows for better understanding of the data, the ability to make better strategic decisions and build and justify eDiscovery budget, resources and timelines.
Emails make up for large part of most data collections. Analytics enable users to quickly analyze complex email communications between custodians and reduce data volumes. The use of eDiscovery analytics to analyze email threads allows identification of missing emails. Comparison of the gaps among custodians and restoring emails from backup’s, makes it possible to fill the gaps in the collected emails.
Conceptual, aka semantic or meaning based analytics, is based on semantic approaches which explore the meaning of the text contained within the data. Computer algorithms bring together similar documents based on the text contained in the documents and their metadata.
Conceptual term searching returns documents that contain concepts similar to used search terms or phrases. Keyword expansion uses conceptually matches and contextual clues to return keywords that matches the user’s search term.
This is a very effective way to expand the amount of keywords. Other types of conceptual analytics are clustering (the conceptual identification and grouping of similar documents) and categorization that uses a set of example documents as the basis for categorizing other conceptually similar documents. These techniques all dependent on technology that builds and then applies a conceptual index of the data for analysis.
Sentiment and Emotion Mining is a technique that detects and analyzes human emotions towards events, people or other interests. Used on sentiments expressed in (electronically stored) communications like emails and chats, these results form a good starting point for further research, for example in (criminal) investigations (see also a previous blog on Emotion Mining and Topic Extraction on Song Lyrics).
In “predictive analytics” the system is trained to learn from the data. Machine learning techniques like Technology Assisted Review (TAR), contract clause detection and classification and privileged detection teach the system what the human is looking for until the system can statistically “predict” how the human would code for the rest of the collection.
To use a quote from Forbes magazine from 2016:
‘Very basically, a machine learning algorithm is given a ‘teaching set’ of data, then asked to use that data to answer a question. For example, you might provide a computer a teaching set of photographs, some of which say, ‘this is a cat’ and some of which say, ‘this is not a cat’. Then you could show the computer a series of new photos and it would begin to identify which photos were of cats.’
‘Machine learning then continues to add to its teaching set. Every photo that it identifies – correctly or incorrectly – gets added to the teaching set, and the program effectively gets ‘smarter’ and better at completing its task over time.’
In eDiscovery, discussions about “artificial intelligence” used to focus on predictive coding—the machine learning process to reduce the time human reviewers need to spend on reading non-relevant information. But now, the initial use of predictive coding or TAR has expanded. Machine Learning techniques are nowadays more commonly used for purposes such as privilege review, preparation for GDPR and other data protection regulations, and discovering the “stories” within collections of documents.
The era of traditional keyword and Boolean search is over. Data volumes are too big. Even the most brilliant query results in too many hits. Reviewing these takes too much time and resources. People do not know exactly what to look for, what keywords to use or how to spell them.
This all results in the fact that the quality of traditional search is much lower than the knowledge workers think. Most researchers make it to 40%. Only highly skilled searchers who manage all (advanced) query options are able to get close to 80%. Even then, they cannot be sure that they did in fact found 80% of all relevant documents. This is another problem measuring recall: you never know what you miss.
AI-based analytics go beyond traditional key word search and offer countless benefits: