This is the last of a series of four blogs about Artificial Intelligence techniques such as text-mining, clustering, topic modeling and machine learning that take a more intelligent approach on finding what we are looking for, even if we do not know exactly what it is, how to formulate or how to write it.
See the previous blogs for more insights on “Text-Mining for Information Extraction and Data Visualization” and “Topic Modeling and Clustering to find high-level concepts”.
Know what you miss - machine learning
The more unknown-unknowns (relevant documents that you do not find), the higher the risk that you miss critical information. For this reason, reviewers often define Boolean keyword queries to pick up a wide range of potentially relevant documents for an eDiscovery. Unfortunately, this always also results in picking up a lot of noise as well. Reviewing all these non-relevant documents leads to unnecessary high review cost, extra time and large review and management teams needed to do the work.
Investigators have the same problem with Boolean queries. They either result in a feast (too many results) or in famine (very few results), but never lead to exactly the right results. Highly experienced analysts with many years of experience in managing all query options might be able to reach recall levels of 70-80%, but most normal investigators do not have all the knowledge to do so. As a result, they often find only part of the answers.
In both cases, the reviewer, analyst or investigator does not know exactly how much they actually found and what is still missing.
We can tackle this problem by using machine learning to teach a computer system to understand the interest of the user. There are four ways to start this interactive process:
- Starting with a random (validation) set that is reviewed for the topics of interest.
- Start with a full-text query system to find a number of really relevant example documents.
- Run topic modeling and select the most relevant clusters.
- Use a combination of (1) to (3).
Once a number of relevant documents for a specific interest or issue is determined, we can use machine learning to build an automatic classifier. With this classifier, we can then find all the documents that match best. By reviewing these documents, the user can fine-tune his interest. After a few iterations, and without the need to define any complex queries, users have taught the computer to find almost exactly what they are looking for.
During this process, several quantitative measures can be calculated such as precision, recall, F-values and precision of the return set. Based on these measurements, one can describe exactly how much of the relevant information has been found at which moment in the process.
To read more on secure, defensible and transparent use of Machine Learning and other techniques, please read the technical commentary on the responsible use of Artificial Intelligence techniques in our Trust Center.