All organisations have large amounts of data stored on enterprise systems. These might be structured databases like Oracle, SAP and other “rows and columns” databases or what is referred to as unstructured information which is the information stored in a variety of human-based formats such as documents, spreadsheets, SharePoint sites, document management systems, emails, voice files and social media messages.
To complicate matters further, this information is spread across a variety of locations such as enterprise servers, email servers, back-up media, mobile devices and “the cloud” - corporate or third-party shared data centres. The final element to throw into the mix is that there is an awful lot of this data and information – to extend Benjamin Franklin’s famous line it appears that in this world nothing can be said to be certain except death, taxes and ever-growing amounts of electronic information.
Every year, IDC attempts to measure the “digital universe” which they define as all the digital data created, replicated, and consumed in a single year and they also make a projection of the size of that universe to the end of the decade. The latest report says that between 2005 and 2020 the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than 5,200 gigabytes for every man, woman, and child in 2020). From now until 2020, the digital universe will about double every two years – Big Data is getting bigger and getting much harder to control. As IDC point out “like our own physical universe, the digital universe is rapidly expanding and incredibly diverse, with vast regions that are unexplored and some that are, frankly, scary”.
So we have lots of information and data sitting in lots of systems and in a variety of formats. Why is that such a big problem? Well, there are actually three sets of problems and the first one is all about information production. In parallel with the growth in information, there has been a similar if slightly less spectacular growth in the obligations faced by all organisations to produce subsets of all the data and information that they are storing. Whether it be for regulators, litigators, freedom of information requests, subject access requests, internal or external auditors or any of the other third-parties who are able to provide a legitimate request for information, organisations have a duty to provide a complete set of data and information and often they have to do this under severe time pressure. In order to carry that out successfully, organisations need to know what information they have and where it is and then have the ability to find, assess and produce this information in a timely manner. This is not an easy task when information is doubling every two years and the inability to be able to fulfil these obligations can result in severe financial damage, reputational damage or both.
The second set of problems concern inefficiency due to the cost and complexity of the information environment. If information workers struggle to find the information they need in the course of their everyday job, or find multiple versions of the same information, there will be a significant cost overhead to the organisation which then leads to workers keeping their own copies “just to be sure” which further exacerbates the problem. And when you factor in the cost implications of storing and backing up all of the duplicate or out-dated content or migrating redundant information to new systems, the costs can be significant. The answer, of course, is to have one version of the truth – a protected version of the current information in a designated repository available to the right audience at the right time. It sounds simple, but getting to this point from where most organisations are now is not easy without the right approach, the right tools and the right help.
The final problem is one of missed opportunities. In the report cited above, IDC state that only a tiny fraction of the digital universe has been explored for analytic value and estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analysed. That is a classic Big Data view on the world but where the Big Data proponents can sometimes miss a trick is in assuming that the data and information to be assessed and analysed is all sitting in consumer-focussed transactional databases to be analysed by data scientists in order to spot patterns and predict future behaviour. When it is remembered that structured data typically forms only 10% of an organisation’s total information landscape, many organisations now also understand that there is value in understanding the content and patterns in their unstructured data too. The software for this is out already out there - there are tools that not only provide the ability to rapidly search, find, identify and review textual information in multiple formats on a variety of media but also have the ability to do smarter things such as identify faces in security footage, extract phrases from voice files or assess sentiment about a product or service in a social media message stream.
Having identified three big set of problems, it is now time to look at the solutions and, as always, the solutions are easy to describe but somewhat more difficult to put into practice. The technology exists but it is the practical application of that technology that can prove tricky so we have set out below the approach that many organisations are currently taking.
We have set out six practical steps and whilst it is possible to gain significant benefits from implementing one or more single steps as point solutions, the maximum benefit will be gained from following the whole six step process. Some of the steps involve consultancy and business analysis and some of them are based around software products and implementation services.
The first thing to do is to understand what information the business actually needs to operate and relies on for its on-going existence and to perform its primary function whether that be manufacturing pharmaceuticals, carrying out financial transactions, providing education services or any other line of business. Understanding the logical information environment will involve an engagement with all information users to understand what information is used by each department or function. The result of this exercise is a schedule of logical information types along with examples of the information, access rights, security requirements and a retention and disposition policy. The process should include all users, front-office, back-office and any in between so that a full understanding of the organisation’s information can be obtained.
It is at this point that it is necessary to go and analyse the data sitting out in the repositories identified in the data map. In larger organisations a risk assessment may need to be carried out in order to prioritise the repositories which are reviewed. Having done that, technology takes centre stage and enables a very fast trawl of all identified repositories in order to build a smart index. Initially a metadata index is built to provide an overall understanding of the whole estate and the information is then grouped by metadata criteria such as file type and age. At this point it is possible to identify redundant, obsolete and trivial information in order to free up disk space and prepare the information set for an in-depth analysis.
Redundant documents are generally duplicates which need to be removed after assigning a master.
Obsolete information is obviously past its sell by date and can be found using the retention schedules and metadata analysis.
Trivial information has no valuable content and is mainly system files such crash dumps, backups and temporary files.
In most organisations there is a disk space saving of between 30% and 60% after removing these types of files.
Now that the remaining information has been surface, classified and categorised some decisions can be made about what should happen next to this information.
There are three options:
Defensible Destruction - seek approval from identified owners, provide audit reports on all stages of the destruction decision process (identification, sampling, tagging etc.) and maintain audit logs of deletion process;
Migration - migrate to records management, archives or secondary storage system according to the business value of the information, and migrate the metadata and tags if possible and maintain audit logs of the migration process;
In-Place Management - use a policy engine to enforce future application of the policy you have selected now, assuming that is within the capabilities of the system under management and maintain audit trails of any actions applied.
The next step is to look at the existing systems in use and build a “data map” showing the current information repositories and their contents insofar as this is possible. The repositories are likely to include shared drives, databases, email repositories and line of business systems. An assessment of each repository can then be carried out to decide whether it is provides the right level of information security and availability as well as preserving the integrity of the information it contains.
Having understood this, it is then possible to decide “what goes where” and build a current and future state data map. At the heart of this repository strategy is likely to be some form of document and records management system which will allow you to lock down information to maintain integrity and ensure security and availability.
After the redundant, obsolete and trivial information has been dealt with, then a second pass deep dive analysis can take place to assess and analyse the content of files to understand themes, to search for data and to apply agreed metadata and policies to identified files. This is where dark data (in the “vast unexplored regions” of the digital universe) can be actioned and located. For this, a smart toolset is needed that can take the index that has been build and identify a whole variety of data and information patterns and relationships. The techniques used for this include:
Clustering - visualizing common content patterns in dark data and identifying groupings for policy development and application;
Trained categories - leveraging any investment in records management using known documents to train categories and then identifying category matches in data, thus shining a light on previously dark data;
Eduction - going beyond traditional entity extraction and enriching the extracted data based on the knowledge already held within the organisation.
In order to ensure that an optimum information environment is maintained on an on-going basis including continuing best use of disk space, it is necessary to dig a little deeper on the policy and process side:
How do business processes use and generate information?
How should information governance policies be monitored and adjusted to keep pace with business and regulatory requirements?
From a technical point of view, all indexes should be kept up to date meaning that these indexes can be leveraged to provide an enterprise search capability, to keep data accessible and to allow on-going categorisation. And it should be possible to automate large parts of your information environment such as implementing performance and compliance archives, applying on-going policies based on indexed categories, auto-capturing records and managing records in-place.
Data is getting bigger, regulation is getting more frequent and disputes simply fail to go away. All of which mean that it is necessary to spend some time designing and implementing proper document production capabilities and making them part of business-as-usual. For it is in the area of document production, that the greatest benefits lay in a faster, more flexible response capability for internal users, regulators and litigators alike. Moreover, there are significant related benefits of a greatly simplified data and information environment regime such as reduced complexity, reduced cost and the opportunity to unlock the value in a data set (including within dark data) and to create rather than miss opportunities for growth or improvement.