Jump to Navigation | Jump to Content
American Bar Association

Commercial & Business Litigation

Predictive Coding Goes Mainstream

By Paula M. Bagger – February 27, 2013

For several years, legal technology writers (and e-discovery vendors) have been touting “predictive coding” as a solution to the problem of skyrocketing e-discovery costs. If you do not read legal tech journals, you may have missed the earlier discussion, but chances are you heard about it in 2012: the use of computer technology to leverage human review of a sample set of electronically stored information (ESI) to a larger data set, which software analyzes but humans do not review. If you told yourself that you need to find out what “predictive coding” is all about, you were right. In contrast with 2011, when “trade” discussion abounded but no judicial decision addressed predictive coding, five reported cases in 2012 qualify as a flurry of activity. Predictive coding “firsts” arrived one after another last year:

February: United States Magistrate Judge Andrew Peck resolved disputes between parties that had agreed to use predictive coding but disagreed about implementation. Judge Peck declared, apparently for the first time in a written decision, that “[c]omputer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” Moore v. Publicis Groupe, No. 11-Civ.-1279 (AJP), 2012 WL 607412, at *1 (S.D.N.Y. Feb. 24, 2012), accepted by Moore v. Publicis Groupe, No. 11-Civ.-1279 (ALC), 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012).

March: Concluding that producing parties are best situated to evaluate procedures for production of their ESI, United States Magistrate Judge Nan Nolan declined to order predictive coding requested by the receiving party over the producing party’s objection, instead urging the parties to agree on a cooperative approach to discovery of ESI. Kleen Prods. LLC v. Packaging Corp. of Am., No. 10 C 5711, 2012 WL 4498465, at *5 (N.D. Ill. Sept. 28, 2012) (reporting order of March 28, 2012).

April: When a producing party asked to use predictive coding to review and produce ESI, Virginia Circuit Judge James Chamblin ordered that it might do so, in the first published instance of a court approving the use of predictive coding over the receiving parties’ objection. Global Aerospace Inc. v. Landow Aviation, L.P., No. CL 61040, 2012 WL 1431215 (Va. Cir. Ct. Apr. 23, 2012).

July: United States District Judge Rebecca Doherty issued a case management order memorializing a set of procedures for the production of ESI in a multidistrict pharmaceutical case, including a detailed “search methodology proof of concept” for the use of predictive coding and provision for multiple “meet and confer” opportunities throughout the process and upon its conclusion. Her order provided a first public glimpse at a completely consensual agreement to use predictive coding in a complex, multidistrict case. In re Actos Prods. Liab. Litig., No. 6:11-md-2299, 2012 WL 6061793 (W.D. La. July 27, 2012).

October: Vice Chancellor Laster of the Delaware Chancery Court raised the issue of predictive coding sua sponte, ordering the parties to a contractual indemnification dispute to show cause why they should not use it for review and production of their ESI. Transcript of Oral Argument at 66–67, EORHB, Inc. v. HOA Holdings, LLC, No. 7409-VCL (Del. Ch. Ct. Oct. 15, 2012).

Clearly, business litigators now need to learn enough about predictive coding to assess its utility in new cases. We need to understand what types of cases generally are good candidates, which tools or procedures might provide a benefit in a particular case, and how best to advocate in favor of predictive coding to our clients, opposing counsel, and the court. Even those who do not take the initiative will need to spot issues and respond intelligently when opposing counsel suggests it, amid the increasing likelihood that the court will raise the issue itself or even—as in EORHB—simply order its use. This article is intended as a high-level introduction for “the rest of us,” to whom computer-assisted document review is a new and perhaps mysterious development. It provides a basic explanation of the technology, reviews some of the vocabulary, and addresses some commonly observed benefits and drawbacks.

What Is Predictive Coding?
We all are familiar with the problem: The huge volume of ESI generated by our business clients defies our efforts to identify discoverable information cost-effectively. Predictive coding, offered as a solution, is the use of software that detects patterns in and infers rules from coding decisions a knowledgeable reviewing attorney makes about a sample set of documents and then applies what it has learned to “predict” how the human reviewer would have coded the larger document set. Use of the tool can dramatically decrease the number of documents subject to “eyes-on” review, with the expected cost savings. In Global Aerospace, the producing party argued that predictive coding would reduce the time needed to review over 2 million documents from 10 man-years to 2 man-weeks, at 1 percent of the cost of manual review. Moreover, some published studies (including those presented to Magistrate Judge Peck in Moore and Circuit Judge Chamblin in Global Aerospace) maintain that predictive coding also yields better results than manual review of ESI. (Other studies challenge that assertion.)

This supervised learning by a computer, or “predictive analytics,” is all around us. Think of spam filters, which decide which email is junk, based in part on what we have previously placed in our junk mail folders, or services like Netflix, which suggests movies for us based on reviews we have given other movies we have watched. Predictive coding, similarly, is no more than a tool to sort, organize, and select data. It has many uses in the field of civil discovery, including early case assessment, prioritization of review efforts, and post-production quality assurance. Using it to respond to document requests—sometimes in the absence of an attorney’s review—has generated controversy and, in 2012, resulted in some initial feedback from the bench.

First, a Little Terminology
Discussions of predictive coding tend to attract jargon. Understanding some of it will give you a decided advantage when it is your turn to join the conversation. Many of the concepts and activities actually are familiar to litigators experienced in manual document review, even if we have never put these labels on them.

“Proportionality” refers to the cost-benefit analysis that is applicable to all discovery but is frequently invoked in the context of e-discovery. Fed. R. Civ. P. 26(b)(2) permits the court to limit “the frequency or extent of discovery” where “the burden or expense of the proposed discovery outweighs its likely benefit,” taking into account the size of the case, the parties’ resources, the importance of the issues at stake, and the nexus between those issues and the discovery sought. The problem posed by ESI is largely a problem of proportionality, and the solution offered by predictive coding is viewed in this light.

“Defensibility” is the degree to which a proposed document review system complies with the producing party’s obligations under applicable rules. “Reasonableness,” not “perfection,” is what is required of all document production methodologies, manual or automated, in terms of the disclosure of relevant documents and the safeguarding of privileged ones. Manual review of documents does not always meet this standard; neither will some technology-assisted processes.

“Transparency” refers to the degree to which the receiving party and the court understand and participate in a document production methodology. With predictive coding, issues of transparency arise at two levels: (1) the degree to which the parties and the court are allowed to understand how the computer learns and applies its learning to new data, which is often low with algorithms that are proprietary to the e-discovery vendors; and (2) the degree to which the receiving party is allowed to review (or even participate in) the producing party’s production efforts at each step of this process. This second type of transparency, widely considered necessary to an effective predictive coding protocol, is discussed more fully below.

“Precision” is the ratio of responsive documents to the total number of documents selected by a particular selection procedure. Precision is “high” when a greater number of the documents selected as relevant are in fact relevant; that is, false positives are avoided. A high degree of “precision” is a fundamental goal of any document selection procedure, whether manual or automated.

“Recall,” a related term, means the ratio of responsive documents identified by a particular selection method to the number of responsive documents in the entire data set. Recall is “high” when fewer responsive documents are missed. Like high precision, high recall is a desired attribute of any document selection procedure, whether manual or automated. The ease with which these ratios can be calculated and validated when document review is automated increases their visibility and importance.

“Linear review” means document review as lawyers historically performed it: starting at the beginning of a collection (a box of paper or a custodian’s email) and reviewing through to the end. “Nonlinear” review refers to any number of interventions, typically automated, that organize or prioritize data prior to review to increase speed and accuracy.

“Keyword searches” ask a computer to look for words or phrases anywhere in a data set. “Boolean searches” refine keyword searching by allowing combinations of words or phrases with connectors such as AND, OR, and NOT. Internet search engines have made us all familiar with keyword searching; attorneys using Westlaw and Lexis understand rudimentary Boolean searching.

How Does Predictive Coding Work?
Once the universe of documents to which predictive coding will be applied is identified, a subset of “sample” documents, or a “seed” set, is developed, typically in conjunction with the provider of the predictive coding tool. Because the seed set will be coded for relevance, confidentiality, privilege, or whatever metrics the software will be measuring in the larger data set, it must be adequately representative of the larger data set, including responsive and nonresponsive documents alike. Statistical tools are employed to provide assurance that the seed set will have the appropriate attributes; this is one of the bases on which the defensibility of the production may rise or fall. If the seed sample is too small—thus lacking enough responsive documents effectively to “teach” the software—or insufficiently representative of the entire data set, the results of the predictive coding may lack either the requisite precision (failing to cull nonresponsive documents) or sufficient recall (failing to identify responsive documents).

Once the seed set is selected, it is coded by a human reviewer who is typically a senior attorney with a deep understanding of the case. This is not a job for the inexperienced or uninformed: “Garbage in, garbage out” applies here as in so many technical processes, and incorrect coding of the sample will result in faulty predictions about the larger data set.

The computer then analyzes the human reviewer’s coding of the seed set and applies the results of its analysis to another sampling of the data set, classifying or ranking those documents based on what it learned from the human reviewer—in essence, predicting how the human reviewer would have coded this second set. The human reviewer then manually audits the computer’s predictions to assess their accuracy; the computer reviews the results of the audit and adjusts its algorithm as needed; and another set of documents is reviewed by the computer and audited by the human reviewer. The process is repeated as often as needed until the computer’s decisions predict the human reviewer’s with a sufficient degree of accuracy.

When the iterative process of “training” the computer is complete and the predictive model validated, the actual document review and production can unfold in many different ways. The extent to which a producing party relies on predictive coding alone will depend on the types of documents at issue, agreements reached in negotiating an ESI protocol, and a party’s overall comfort level with the tool.

In some cases, predictive coding will not have a direct impact on review and production, which will be done manually or through keyword searching, but will be used internally, such as to prioritize documents for review (so that more responsive data sets are reviewed first) or to perform quality assurance on human reviewers or keyword searching. Predictive coding can also be used to confirm assumptions about which data sets are unlikely to include relevant documents, for instance, in preparation for a meet-and-confer session.

Even when predictive coding is used for document review and production, it can still be used in many different ways, with varying degrees of reliance on the computer’s assessment of the responsiveness of documents. Predictive coding can be used to cull a data set, so that documents with a computer-generated predictive score below a particular level are deemed nonresponsive and rejected with no human review, while the remaining data set, which has a higher computer-generated score and is now more manageable in size, is manually reviewed. The computer score awarded each document can be used to order the data set so that documents above a certain cutoff are deemed responsive, those below a certain score deemed nonresponsive, and only those in the middle reviewed by hand. Who decides how the results of the computer’s coding will be used?

Kleen Products (in which the court would not order predictive coding) and Global Aerospace (in which it did) both stand for the proposition that the producing party is in the best position to decide how to produce its own documents. In Kleen Products, Magistrate Judge Nolan expressly relied on one of the fundamental discovery principles advanced by the Sedona Conference: that “[r]esponding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.” 2012 WL 4498465, at *5 (citing “The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E–Discovery,” 8 Sedona Conf. J. 189, 193 (Fall 2007)). The producing party’s assessment should be entitled to substantial deference, so long as it is defensible.

If any categories of documents will be produced without human review, the producing party will need to negotiate effective anti-waiver and clawback provisions as part of the ESI protocol, to allow for recovery of inadvertently produced documents. The Actos protocol, contained in the form of two protective orders, addressed confidential and trade secret documents, assertions of the attorney-client privilege and work-product protection, and the inadvertent disclosure of either. Case Management Order, In re Actos, No. 6:11-md-2299 (July 27, 2012). Federal Rule of Evidence 502(e) sanctions clawback agreements, and Rule 502(d) authorizes the court to “order that the privilege or protection is not waived by disclosure connected with the litigation pending before the court—in which event the disclosure is also not a waiver in any other federal or state proceeding.” A producing party should take advantage of these provisions where applicable.

 What Are “Appropriate Cases” for Predictive Coding?
While noting that “computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases,” Magistrate Judge Peck also stressed that this “does not mean that computer-assisted review must be used in all cases, or that the exact ESI protocol approved here will be appropriate in all future cases that utilize computer-assisted review.” Moore, 2012 WL 607412, at *1. Predictive coding is a tool, nothing more, and it is important for counsel to understand when and how to apply it beneficially.

Reducing attorney time and thus cost is a principal benefit of predictive coding. In some applications of the process, only a fraction of the potentially responsive documents will be subject to manual review. Proportionality and defensibility are the twin considerations when deciding whether predictive coding is appropriate in a particular case. Does the use of predictive coding bring the cost of producing ESI in line with the value of that discovery to the case? And does it do so without jettisoning the producing party’s obligation to provide a “reasonable” response to discovery requests, that is, one that is reliably thorough and accurate? If the answer is yes, the case may be an appropriate one for predictive coding. Magistrate Judge Peck observed that:

computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review. . . . As with keywords or any other technological solution to ediscovery, counsel must design an appropriate process, including use of available technology, with appropriate quality control testing, to review and produce relevant ESI while adhering to Rule 1 and Rule 26(b)(2)(C) proportionality.

 2012 WL 607412, at *12.

The initial setup for computer-assisted review is typically more expensive than developing and running a set of keyword search terms on an electronic data set. Therefore, a case needs to be of a certain size—in terms of ESI to be reviewed—before predictive coding will become cost-effective. Moreover, the data set also needs to be large enough for an appropriate search protocol to be developed and validated. Too few documents mean that a sophisticated statistical approach will not be feasible.

As its name suggests, predictive coding allows a computer to be trained to predict which documents in a data set would be considered responsive by a human reviewer. There will always be some margin of error unless computer-assisted review is supplemented by eyes-on review. Predictive coding without follow-on human review might not be appropriate in cases in which (or for data sets in which) highly sensitive information could be scattered throughout because manual review following computer-assisted review might destroy any cost savings or efficiencies predictive coding offers.

Similarly, a poorly designed or implemented program of computer-assisted review could wind up being more expensive and inefficient than none at all—particularly if the receiving party demands that the production be done over again with a more traditional production methodology—thereby eliminating any cost savings. That risk is especially high now, when predictive coding is in its infancy and best practices have not been codified. For example, in Global Aerospace, the court expressly provided that the receiving party was free to “rais[e] with the Court an issue as to completeness of the contents of the production or the ongoing use of predictive coding.” 2012 WL 1431215, at *1. An appropriate case, then, is one in which the producing party is able to implement and monitor active project management from day one. Moreover, because costs are incurred up front, it will rarely be efficient to adopt predictive coding after the costs of manual review or other automated processes have been incurred. In Kleen Products, the defendants had already invested much time and effort in keyword searching when the receiving party moved to compel the use of predictive coding, arguing that the keyword searching provided insufficient recall. After evidentiary hearings, the court declined to order predictive coding, urging the parties to find a compromise. 2012 WL 4498465, at *5.

The type of documents maintained by the producing party may influence how successful predictive coding will be. A case with a great many custodians, especially if they hold very different types of documents, may make the creation of seed sets and validation of results more complicated and expensive. It could be necessary to develop and code different seed sets for different custodians or different types of documents. Consistent document types make it easier to train the computer to predict with great precision which documents are responsive. Cases in which a great many documents are not text-based—including, for instance, audio files or image files—would also not be optimal candidates for predictive coding given current technology.

A producing party may be more likely to convince a receiving party to accept predictive coding if the receiving party is familiar with the types of ESI the producing party maintains and the types of internal jargon that employees of the producing party use. For instance, a former employee would be in a better position to assess whether it is receiving the documents it expects than a competitor with little information about the opposing party’s internal workings.

Finally, as further discussed below, most of the cases in which we have seen predictive coding protocols thus far are cases in which all parties are open to the use of computer-assisted review and are able to communicate well and work cooperatively. Cases in which a high degree of cooperation among counsel can be expected are the more likely candidates for the use of predictive coding.

Transparency and Cooperation Support the Use of Predictive Coding
The judicial decisions on predictive coding stress transparency and cooperation among counsel. Judge Peck stated explicitly in Moore that the transparency of the process proposed by the producing party was a major factor in his decision to allow predictive coding:

MSL’s transparency in its proposed ESI search protocol made it easier for the Court to approve the use of predictive coding. . . . While not all experienced ESI counsel believe it necessary to be as transparent as MSL was willing to be, such transparency allows the opposing counsel (and the Court) to be more comfortable with computer-assisted review, reducing fears about the so-called “black box” of the technology. This Court highly recommends that counsel in future cases be willing to at least discuss, if not agree to, such transparency in the computer-assisted review process.

2012 WL 607412, at *1.

In Kleen Products, Magistrate Judge Nolan likewise focused on the benefits of cooperation in discovery:

The [Sedona Conference] Cooperation Proclamation calls for a “paradigm shift” in how parties engage in the discovery process. . . . In some small way, it is hoped that this Opinion can be of some help to others interested in pursuing a cooperative approach. The Court commends the lawyers and their clients for conducting their discovery obligations in a collaborative manner.

2012 WL 4498465, at *19.

Even in EORHB, after summarily ordering the parties to use predictive coding, Vice Chancellor Laster further mandated the use of a single vendor to host both parties’ documents and provide e-discovery services. Transcript of Oral Argument at 67, EORHB, Inc., No. 7409-VCL.

The In re Actos process called for the parties to agree on what documents would constitute the seed set, allowed both sides to participate in coding the seed set, required the parties to agree on the relevance score that would serve as the cutoff for manual document review and production, and gave the receiving party access to a random sampling of irrelevant documents for quality-control purposes. Once the producing party’s manual review of the most responsive documents was complete, the receiving party would be given access to a random sample of documents that the human reviewer coded as nonresponsive.

In Moore, on the other hand, it was not contemplated that the receiving party would participate in document review, but the producing party’s protocol compensated for this by promising almost complete transparency in connection with its production activities. The Moore procedure called for the entire seed set, including the documents ultimately coded nonresponsive, to be turned over to the receiving party with the tags reflecting the coding by the producing party’s human reviewer.

It is too early to tell whether these concessions to transparency, agreed to by producing parties anxious to win agreement to and judicial approval of a novel technological tool, will become accepted parts of a computer-assisted review and production protocol. Certainly, the information being shared in these document productions would never be shared by parties conducting a manual review. Allowing opposing counsel to see the entire seed set and coding decisions may provide reassurance that the machine is being properly trained, but when a manual review is conducted by young associates or contract attorneys, does the producing party need to prove that the lawyers have been properly trained? When documents are manually reviewed, how many of the producing party’s nonresponsive documents is the receiving party entitled to review? As parties and the courts become more comfortable with predictive coding, producing parties may become less forthcoming, and it remains to be seen whether courts will continue to enforce this level of transparency.

Predictive coding moved closer to the mainstream of e-discovery processes and solutions in 2012. In 2013 and beyond, courts will fill in additional structure where it is needed, addressing the types of cases in which it will be allowed, whether it is the court’s prerogative to order its use when the parties have not requested it, and to what degree transparency is a necessary attribute of predictive coding protocols. Like e-discovery itself, predictive coding is an important process that is fast becoming integral to litigation practice.

Keywords: litigation, commercial, business, predictive coding, e-discovery, document review, computer-assisted review, court decisions, transparency

Paula M. Bagger is a partner in the business litigation firm Cooke Clancy & Gruenthal LLP, in Boston, Massachusetts.

Copyright © 2017, American Bar Association. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or downloaded or stored in an electronic database or retrieval system without the express written consent of the American Bar Association. The views expressed in this article are those of the author(s) and do not necessarily reflect the positions or policies of the American Bar Association, the Section of Litigation, this committee, or the employer(s) of the author(s).