Structured, Semi-Structured, and Unstructured Data in eDiscovery

The sheer volume of data involved in eDiscovery has challenged organizations and their legal professionals for years—and that was before the pandemic created a global tidal wave of online activity and data consumption. As COVID-19 lockdown measures caused people to stay home in unprecedented numbers, we saw massive changes in digital usage behavior. Working from home necessitated videoconferencing services, normalized online meetings, and accelerated the use of online chat platforms for collaboration and ongoing communication, all of which generated new business data. Meanwhile, isolated consumers increasingly turned to video content as their go-to source for information, entertainment, and distraction, resulting in the creation of yet more data. 

On the whole, global data consumption in 2020 grew more than 30 percent compared to 2019, according to a recent PwC report. And video content was a key driver of that global data consumption, accounting for more than three-quarters of the data consumed in 2020. 

Efficiently managing all of an organization’s electronically stored information (ESI) in eDiscovery requires that legal professionals find ways to sift through these masses of data to identify relevant data sources and pinpoint important facts and patterns. For eDiscovery practitioners, the flood of data generally—and video specifically—highlights a key difference in data types that must be understood and appreciated to mine data skillfully: the distinction between structured and unstructured data. 

Let’s look at both types of data—and the hybrid of semi-structured data—and then consider how eDiscovery tools can help legal professionals manage the onslaught of ESI. 

Contents: 

What is structured data?
What is unstructured data?
The middle ground: What is semi-structured data?
Why do these distinctions matter for eDiscovery?
Analyzing unstructured data for eDiscovery
Confidently assess both structured and unstructured eDiscovery data 

What is structured data? 

Structured data is data that has clearly defined internal parameters and relationships. Structured data generally resides within a relational database management system (RDBMS), with fields that contain length-delineated data, such as phone numbers, Social Security numbers, and employee ID numbers. 

Examples of structured data include database records, structured query language (SQL) records, airline reservation systems, and inventory control systems. Structured data is labeled to describe its attributes and relationships to other data (depicted in the cells of a database) are clearly defined. Due to its internal structure and defined relationships, structured data is very easy to search and manipulate. 

Structured data represents about 20 percent of all organizational data. 

What is unstructured data? 

Unstructured data includes all other data sources. It’s the majority of data that individuals generate and interact with, from text files to communications to video content. Unstructured data does have its own internal structure, but there are no clear relationships between individual data points as there would be in a database. 

Unstructured data may be generated by humans—as in text or chat messages, Word documents, photos, and videos—or by machines. Examples of unstructured data include: 

  • Text-based information like Word files, PDFs, presentations, emails, text and chat messages, and website content.
  • Audiovisual information including digital photos, generated image files, audio recordings, and video recordings.
  • Machine-generated information such as data from internet of things (IoT) devices, automatic door logs, or weather data. 

Eighty percent or more of data is unstructured—and that percentage is growing rapidly, at a rate of 55 to 65 percent per year. 

The middle ground: What is semi-structured data? 

Much of what we consider unstructured data could be better considered as a “middle ground” category known as semi-structured data. 

The text of an email, for example, is unstructured, but each email also includes structured data in the form of predefined metadata fields such as the sender, recipient, date and time, and subject. Likewise, Word documents have structured elements—their creation date, author, and so on—though the content of those documents is unstructured. Social media is also semi-structured; the content of a post is unstructured text or image data, but the date and time of posting and the number of likes and comments are structured data. 

Semi-structured data represents about 5 to 10 percent of all data. 

Why do these distinctions matter for eDiscovery? 

Structured data has been a source of discoverable ESI for years, so eDiscovery professionals are proficient in its analysis and management. Additionally, the very nature of structured data makes it easy to search. Tools that can parse and analyze structured data are well-established and widely accepted. 

Unstructured data, on the other hand, can be very difficult to search or analyze, and attempting to assess it manually would be prohibitively time-consuming. Plus, with approximately 30,000 hours of video content uploaded to YouTube every hour as of February 2020, before the pandemic even began in earnest, it would be impossible for anyone trying to watch each video to ever catch up with the rate of production. Simply identifying which portion of unstructured data might be relevant to an eDiscovery matter—never mind collecting, analyzing, reviewing, and producing that data—is a daunting task. 

That brings us to the second key challenge around unstructured data: it’s growing at an unbelievable rate. Fortunately, the eDiscovery industry is developing technologies to help legal professionals wrangle with the rising tide of unstructured data. 

Analyzing unstructured data for eDiscovery 

Innovative software allows eDiscovery practitioners to view, search, code, and analyze unstructured data using advanced techniques fueled by artificial intelligence (AI). Specialized technology makes everything searchable, including formats that are not natively searchable (e.g., TIF, PDF, BMP, and PBG), complex composite formats (e.g., PST and ZIP), and embedded objects (email attachments). Yet there are still challenges, and this is where preparation and careful planning become critical. 

There are four broad categories of eDiscovery tools that practitioners should use to analyze their data. 

Automation tools 

Techniques such as deNISTing, deduplication, data processing, email threading, and optical character recognition can quickly eliminate surplus, extraneous data and render visual data into searchable text. Automation tools can also help organize data so that the eDiscovery team can focus its efforts on the data that’s most likely to be relevant and responsive. When you are searching for the proverbial needle in the haystack, the ability to shrink the haystack by weeding out unhelpful and irrelevant data is useful indeed. 

Context tools 

Powerful AI technologies built on natural language processing can detect and group concepts, making for more efficient investigations. These technologies include named entity extraction, basic entity extraction, foreign language extraction, language translation, and dark language detection. Named entity extraction can identify names and events within unstructured data sources and create a relationship map that links together disparate facts and circumstances, creating a fuller picture of the facts, issues, and custodians at issue. Dark language detection can help identify suspicious behavior including code words. These text mining tools go well beyond keyword searches to extract meaning and emotional tone, aiding eDiscovery teams in making sense of the vast sea of unstructured data. 

Proactive intelligence tools 

Sophisticated AI tools use continuous active learning approaches to group documents and data into related sets so that human review teams can more easily evaluate relevance and importance. These tools include technology-assisted review (TAR), topic modeling, concept clustering, and document classification. TAR deploys sophisticated algorithms to “watch” and learn from a human coder, then uses natural language processing to determine what words and phrases result in tags for relevance, privilege, and responsiveness. The technology learns continuously, sorting the remaining corpus of documents for human review to prioritize those it determines to be most relevant. 

Emerging intelligence and analytical tools 

Advanced analytical tools are the future of eDiscovery and the only practical way to sift through the data volumes of modern organizations. Using techniques like tone detection to assess language including idioms, slang, and abbreviations, these next-generation tools analyze communication patterns, sentiments, and anomalies to reveal unusual relationships and highlight issues. These tools allow eDiscovery professionals to rapidly discern the relationships between individuals and to identify additional custodians or parties of interest. Once the team has designated documents for production, auto-detection and auto-redaction are essential tactics to streamline production and help organizations maintain compliance with data privacy laws, such as the General Data Protection Regulation and the California Consumer Privacy Act. 

Confidently assess both structured and unstructured eDiscovery data 

Unstructured data is a growing challenge, but innovative eDiscovery tools provide a powerful remedy, enabling legal teams to ride the massive wave of structured and unstructured data that their organizations are generating. Teams can conduct reasonable, diligent, and defensible searches using advanced technologies that can organize unstructured data into data types or facets, discard unhelpful or irrelevant data sources, unearth hidden patterns, and bring to light important facts. 

Smarter searching leads to smarter decision-making. And technology can assess data—structured and unstructured—far more quickly and accurately than humans ever could, which means lower spend and reduced risk for the organization. 

ZyLAB ONE is an AI-powered, easy-to-master solution that manages both structured and unstructured data to give you critical answers to pressing questions in the heat of an emerging eDiscovery matter. With built-in advanced analytics tools, including TAR, topic modeling, entity extraction, and much more, ZyLAB ONE will accelerate your results and give you the confidence you need to move forward.