Using Data Classification Solutions for Legal Regulation Compliance
Managing data storage can be a considerable challenge for IT departments. Compliant data classification tools must be affordable, efficient and user-friendly for both IT and business professionals. Check out the newest automated solutions.
No matter how large or small an organization, managing data storage is generally a challenge for the Information Technology (IT) department, and it's becoming more challenging every year. IT has always had the charter to manage the storage and protection of data, but in recent years new responsibilities have emerged to include data security, long-term archival strategy and the basics of regulatory compliance requirements. If that weren't enough, IT is also being called upon to manage detailed data discovery and compliance requests, which shift into high gear when legal discovery becomes a necessity. Compliance and legal discovery in civil cases are complex issues for IT because they usually involve large amounts of data and require IT to interface with not only multiple internal departments but also external personnel such as opposing IT departments, general counsels or sometimes even representatives of the court. The intent of this article is to advise and guide IT professionals on how to select tools that will help them meet the new legal discovery and compliance requirements.
Compliance in these cases became much more complex due to changes in last year's Federal Rules for Civil Procedures. Addressing these new rules requires IT to discover the actual content and overall themes of the millions of files and emails within an enterprise's data storage complex. The rules require all files and emails with content pertaining to the suit to be submitted during the discovery phase.
Rules 26 and 34 require that IT departments of companies involved in a civil suit conduct a pre-meeting to disclose each party's data storage procedures and the technology used to access that data. The implication of these rules is that IT must be able to determine the actual data content in millions or even billions of files. For example, IT may be asked to pull every email and file pertaining to specific individuals involved in the case. Or, IT may need to search and discover all emails containing "suspect" keywords or phrases in sexual harassment cases or even specific patterns in computer source code for patent infringement suits.
The only way IT can support this increased level of discovery will be to deploy automated solutions. Fortunately, many new solutions in the data discovery arena have been recently announced, but IT will be challenged to match the most cost-affordable solution with their specific needs. The first step in this process is to determine exactly what information must be discovered. The broader and more complex this data, as well as how deeply the solution must probe into file content, the more sophisticated the solutions will have to be. In addition to understanding data requirements, IT will need an understanding of where the information to be discovered is physically located. Will a solution have to handle a single storage volume, or will the data set be widely distributed across laptops and multiple servers in multiple locations?
Why traditional classification solutions fall short
At the lowest level are traditional classification solutions, usually called "Basic Classification" solutions, which are based on simple file system metadata. They lack comprehensive content visibility, but may be acceptable for discovering basic data if that is what is required. The problem with "basic" is that they generally can only help their users sort through the files based on filesystem attributes, also call "metadata." Solutions which rely on metadata offer classification based on file name, directory name, file size, file type and modified/access dates. They could be used to find blocks of files if it's known precisely in which directory the files reside, or in which date ranges the files were generated or modified. Unfortunately these solutions are generally very cost prohibitive, even for basic solutions.
The next level of functionality is solutions based on relational databases and enterprise search engines. These solutions use "intermediate classification," which based on key words or phrases, can add some visibility into content. Google Desktop and Microsoft's "Windows Desktop Search" are examples of these solutions. They are much more capable of finding data than basic classification systems, but they are designed for the desktop file environment, so their overhead requirements make them impractical for enterprise use. The average desktop has 100GB of data, but the average enterprise environment manages over 1000 times that much data and this data is spread across networked storage environments. Using desktop search technology or web search technology cannot scale in large, distributed environments.
Key capabilities needed for successful legal discovery
A more sophisticated class of solutions, called information classification and management (ICM), is becoming available to address more complex search and discovery requirements. These solutions offer sophisticated capabilities such as file-path metadata parsing, in-file content visibility, context category classification, file classification "tagging" and policy-based management and tracking. Some solutions even include advanced "pattern recognition and context extraction" used to classify data based on document summaries or "themes."
These solutions allow IT to fully meet the compliance challenges they face today, but IT needs to understand how the performance, scalability and flexibility of these solutions will affect their total storage environment.
For example, in most large enterprises, data is generally spread out geographically in branch offices and even among individual employees in desktops or laptops. A discovery solution using monolithic, single database architectures can't solve the issues of having distributed data. They can't scale beyond a few million records, are inflexible to customization, suffer from slow indexing and can be very costly. They also don't offer features such as user-selectable parsing, which can be a vital attribute.
Relational databases do a great job managing structured transactions, but they fall short when managing unstructured information such as metadata and file content. Their performance falls precipitously at around five million records, and at a few hundred million they become practically unusable. They also have very limited ability to aggregate metadata from multiple repositories which is necessary to guarantee consistent discovery across enterprise environments.
Finally, an entirely new range of solutions is emerging which operate much like parallel computers. They deploy technology that allows each "database instance" to stand alone. This "self-contained" technology is capable of quickly managing millions of records. The new technology seamlessly handles performance scalability and the distributed nature of data sets. An ICM solution based on a distributed metadata database technology automatically divides itself into many distribute, "slices." This unique scalability allows the handling of billions of files with no performance degradation.
Finding the correct data is just the first phase of the process for IT. Once discovered and properly classified, files containing the classified data must be carefully moved to a specified data storage location. The only practical way of doing this is for IT to set up policy engines that include both data classification and file "tagging."
It's important that the solution is capable of accurately "knowing" the file data values before establishing policies. The policy engine must be capable of leaving file shortcuts so the file history can be traced. It's not enough to just move the data to its new location. All of the file directory structures, including access control (security) must be moved as well. If the storage environment is typical of most large enterprises, it will consist of many remote locations and only the most sophisticated solutions will be able to seamlessly accomplish the move.
Of all the capabilities, tagging is the most important for successful data classification, because it will define the compartmentalized data policies necessary to manage and track the movement of all the "discovered" files. When evaluating solutions, IT should make sure that the vital tagging feature is easy to understand and configure.
The actual classification process should include meetings between department heads and IT to define the critical, confidential or sensitive information that needs to be tagged so that policies on how to manage and track files can be enforced based on those tags. The most sophisticated solutions allow these meetings to be conducted efficiently by enabling tagging to be accomplished as easily as creating an MP3 playlist. Files should be able to be tagged with a simple mouse click and then dragged into a list window.
These new ICM solutions are the most useful because they combine the best of scalability, capability, flexibility and affordability. Fully featured and simple to use, they transcend the limitations of enterprise search or relational database technologies.
There is no doubt that the burden of data classification for compliance falls on the IT department. IT must lead the selection of the proper tool which is affordable, yet meets the needs of the data classification task at hand. Any chosen tool must be usable by both the IT group and the business users. Careful consideration must be given to assure the solution can scale to handle the amount of data that will need to be classified no matter if the data is centrally contained or distributed across an enterprise.
Marketing Articles
Management Articles
Technology Articles
Finance Articles