Jump to Navigation | Jump to Content
American Bar Association


Culling Data for Your Case

By George Bellas and Elizabeth Fogerty – November 4, 2010

Not so long ago, most litigators did not know what the term "e-discovery" meant, let alone how or why to cull data. But times have changed. Even the most technophobic attorneys are coming to terms with the fact that e-discovery is a reality of practicing law in the twenty-first century.

The days of sorting through boxes for relevant, responsive documents are fading. More than 93 percent of information now manifests itself as electronically stored information (ESI), and the average business person produces 2.5 GB of ESI annually (most of which is automatically stored). Companies are realizing employees will say just about anything over email, which makes email a treasure trove of discoverable and often embarrassing evidence. All of this ESI must be organized electronically so that it can be reviewed and coded in that format. Astronomical amounts of information are created each year and almost all of it is at least "potentially relevant" for discovery purposes. The advantage to this electronic storage is that with a few mouse clicks one can sort though a large amount of documents with incredible speed and precision using basic sort and filter technology.

The sheer volume of data causes two main difficulties for practitioners - cost and time. Processing potentially relevant data can be extremely expensive. During the recent economic downturn, this ultimately translated into clients being financially unable to pursue or defend meritorious litigation. Moreover, many court-mandated deadlines rarely appreciate or take into account the enormity of the task of complying with e-discovery deadlines. The Federal Rules of Civil Procedure now require discussion of issues relating to the exchange of ESI in the Rule 16 Conference, and that Rule 26 Scheduling Orders include provisions for the exchange of ESI. As such, issues relating to the exchange of ESI are now one of the first battles in any litigation.
Litigators need to understand the root of the ESI problem. The two major contributors to the problem are:

  • the sheer amount of data collected by the clients and handed over to the counsel;
  • the amount of time lawyers and staff expend reviewing irrelevant documents.

Even if the cost to process all the data is manageable, the cost of reviewing it is generally not. The only solution to this is to reduce the amount of non-related data returned. While culling the data to decrease costs sounds appealing, potentially fatal sanctions abound for not choosing defensible, court-sanctioned culling methods. Litigators must tread lightly.

Culling Process
Within the discovery cycle, data can be culled at multiple stages. In a traditional model, all data collected are turned over to a service provider to remove duplicates and cull the data by date- and time-range parameters and then processed. Next, all of this data is reviewed by an attorney-which is not cheap. Because of the volume of data, this method is making litigation too costly. We need to find more efficient and economically feasible alternatives.

One option is to reduce how much data is actually collected. While this method may seem the most obvious and effective approach, it is riddled with such risks as sanctions and dismissal of pleadings. Typically, the client turns the initial culling over to its IT department, which is usually unfamiliar with the serious consequences of omitting potentially relevant data.

Software and Technology
A better option to decrease the data set prior to processing is to use a tool that can sort the data by date range, custodians, and file type prior to processing. There are software applications that can assist in the process. Using such software, practitioners can filter the data interactively. This is similar to applying connector and jurisdiction filters in electronic legal research. It permits practitioners to safely exclude data while providing the metrics that explain to the court why the exclusions were made. This saves time and money by excluding data prior to incurring any processing costs.

The most effective stage for culling data is the initial search stage. Traditional tools process all the data and then reduce the data set by 40–50 percent by applying custodian, date ranges, and de-duplication technology. Unfortunately, that still leaves a great deal of data to review, and it remains cost- and time-prohibitive. New technology is available to safely cull data at rates of 80–90 percent. Clearwell will cull data by the traditional methods but also by a number of other criteria that includes domain, keywords, discussion threads, and sender/recipient name. By reducing your data sets to 10–20 percent of the original collection, less time and financial resources are required for reviewing potentially relevant data. This allows smaller firms to handle these matters for their clients—and it is more economical for all clients, regardless of firm size (and allows for a more focused review).

Intelligent de-duplication is applied across the entire data set rather than simply at an individual custodian level. This means that an email that was sent to every employee in a 2,500-employee company can be reviewed once for relevance rather than 2,500 times. Domain searching technology removes irrelevant data by applying a list of domains that will not contain relevant data but that is likely to greatly increase data volumes (e.g., all emails from retail stores, newsletters, and newspaper and magazine subscriptions delivered electronically). Discussion-thread technology links all related messages that capture an entire discussion together including all replies, carbon copies, blind carbon copies, and forwards. This cuts review time dramatically because reviewers can quickly and accurately identify everyone involved and determine who knew what and when, thus saving time and money and preventing errors that occur when more than one reviewer is involved in the coding process.

Near duplication software is also available. This technology can be applied to scanned documents during the processing stage and can be customized to the percentage to which you are comfortable. For example, you can set the threshold at 90 percent so that all the documents that are 90 percent similar to another document will be grouped together and can be easily coded. These tools also save time in the review stage and have an added advantage of preventing errors. When determining which software to use, it is important to ensure that the tool offers both "opt-in" and "opt-out" techniques. Opt-in technology ensures the exclusion of irrelevant documents. Examples are date-range filters, file-type filters, and privileged and domain culling. Opt-in technology sorts through data to ensure that only relevant data are included for review. Examples of opt-out technology include keyword, custodian, and wild-card and stem culling.

This new technology offers clear metrics that can be presented in court to track what was culled and why, offering a cost-effective, defensible approach for all practitioners. It shows the court that reasonable and best efforts were used to implement an appropriate search and collection methodology.

George Bellas is a partner in the firm of Bellas & Wachowski located in suburban Chicago. Elizabeth Fogerty is an e-discovery consultant at C2Legal in Chicago.

Copyright © 2016, American Bar Association. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or downloaded or stored in an electronic database or retrieval system without the express written consent of the American Bar Association. The views expressed in this article are those of the author(s) and do not necessarily reflect the positions or policies of the American Bar Association, the Section of Litigation, this committee, or the employer(s) of the author(s).

Back to Top