Data Archival


“Archiving” — Misunderstood and Poorly Executed

“Archiving involves indexing content such that it can be retrieved easily at a later time using a keyword search. Anything else is just backup, and ineffectual backup at that”.

As networked storage becomes more ubiquitous, the need to manage where the data is stored and to ensure that it can be moved around within the storage environment becomes increasingly important. However, as this process moves up the agenda for IT departments, the understanding of the differences between different techniques and limitations becomes ever more important.

Data migration, backup, disaster recovery and archiving may all seem to merge into the single discipline of data movement but each has a different role to play. What differentiates each of these is the business drivers behind them. For instance, data migration is about the bulk movement of data from one data storage resource to another to achieve a particular outcome. This tends to be a data movement which is based on a ‘one off’, project-based requirement, rather than a regular feature of day to day Data Centre activity. Disaster recovery on the other hand, is deployed as a form of risk mitigation, basically a form of insurance against a catastrophic event depriving an organisation of access to its data. Where disaster recovery is concerned, data will be in an almost constant state of change as remote facilities are constantly updated.

By contrast, Backup is about restoring lost, deleted or corrupted data to a known good state. In the real world, most cases are about restoring individual files rather than complete system volumes. Backup does not try to keep up with a constantly changing set of data but relies on a ‘snapshot’ of a point in time which can vary from hours (using virtualised disk techniques) to days and weeks using more conventional tape backup.

Finally, archiving is about the long term retention of data which rarely if ever changes and any changes which do occur tend to be deletions of data no longer required. The characteristic which defines archiving is therefore the unchanging nature of the data.

Information Life Cycle Management (ILM):

Information Lifecycle Management (ILM) is a concept being widely promoted in the industry today. Many industry leaders consider this to be little more than a name change for a technique known as Hierarchical Storage Management (HSM) which has been widely used in the mainframe world for many years. Despite employing many similar techniques, ILM is not exactly the same as HSM in that it has some unique characteristics that allow it to be applied in an open system environment. ILM must deliver in an environment where the data is typically unstructured and scalability of the system is a must while, HSM was developed to operate in an environment where highly structured data was managed by a closed system with limited potential for future scalability.

The advent of data storage virtualisation techniques from companies such as Tarmin and HDS, enables different disk technologies and disk vendors to be consolidated into one logical pool of centralised storage capacity has been one of the a major elements underpinning the ILM approach and offers users new alternatives for their archiving strategies based on a tiered storage model.

This ability to mix and match data storage based on criteria such as performance and price, means that archiving can become a multistage process with data not accessed in the immediate past to be taken from high performance (and high cost) disk systems and migrated to lower performance (lower cost) disk systems and in the process releasing expensive, high performance disk capacity for high value system use. If the data remains undisturbed for a further period then there can be a further migration to, for example, tape where it can remain without consuming expensive disk storage space.

Of course there is no universal panacea, and with ILM one of the challenges is tracking the data as it is passed further down the data storage hierarchy. In an ILM environment, the way this is achieved is when a file is moved from a high performance (high cost) disk a ‘stub file’ is left behind. This ‘stub file’ automatically redirects any data access (to the migrated data) to the lower performing disk system. This process is transparent to the user, When the file in question is further migrated to tape, another stub file is written, this time on the low performance disk giving yet another re-direct for any data access.

ILM is inherently an application typically embedded in a SAN. ILM will undoubtedly become a significant option to be considered as part of an organisation’s data archiving strategy. However, technology is a tool which provides part of the solution. Users will still have to define and create the policies around which data should be migrated, when it should be migrated, how it should be migrated and the criteria on which these decisions are based. ILM will require business and technology decision making to go hand in hand if a long term viable solution based on ILM is to be effective.

Regulation & Compliance:

One of the drivers for the number of IT departments reviewing their archiving strategy is the deluge of legal, regulatory and compliance requirements affecting digitally stored data. There are two distinct aspects to archiving in a regulatory environment. The first is all about policies for retaining data and equally important – deleting data. The second is matching the technology available to meet the specific requirement. Compliance can be viewed as a three dimensional model based on the triple axis of Regulation, Industry Sector and Geography. However, issues such as the retention of data tend to be a common thread and this relates directly to an archiving strategy.

Where archiving policies are concerned, the decision of how long to retain data in an archive depends upon the applicable regulatory requirements which set a minimum level and the organisation’s internal policies which may extend this period. However, some legal requirements also mean that data must be deleted when it is no longer required for the purpose for which it was gathered (Data Protection Legislation) so equally important is a data deletion policy that ensure only data which is legally acceptable is kept in an archive. Additionally, there are issues around whether archived data should be capable of being changed and, if so, tracking who has done this. These challenges are met by an array of technology approaches most based around Write Once Read Many (WORM) technology.

Write Once Read Many (times) – (WORM)

Today almost all digital data is stored on either disk and/or tape. These technologies are designed to allow data to be added, changed or deleted as required by the user. Whether or not these data changes are journalled is a local operational decision – but in most cases no records are kept. The new regulatory regimes now require that any changes to affected archived data are either impossible or are recorded. This has given rise to three distinct approaches.

Optical Disk Archiving:

Most people are familiar with optical technology through entertainment CDs and DVDs. Data Centre Professionals may also be familiar with Magneto Optical (MO) technology where the data is written magnetically but read by a laser. While all these can be true WORM (the data cannot be altered once it is written) the data capacity of these technologies is small compared to amounts of digital data being accumulated electronically, resulting in many pieces of media being required to archive even modest amounts of data by today’s standards.

One manufacturer, Plasmon, has pioneered a new optical technology called UDO (Ultra Density Optical) which currently triples the capacity over a magneto-optical drive and is road mapped to double and then double again within the next few years. Many users have had a ‘wait and see’ approach to this new technology but now, with the media being second sourced by Mitsubishi (Verbatim), and HP having adopted the technology as a standard offering, shipments are now in the multi-petabytes and rising sharply. UDO may be the way forward for capacity but all optical devices have one more issue to resolve – how do you delete just one file on an optical cartridge without having to copy and re-write all the data (minus the file that is to be deleted). Plasmon now have a solution for this as well – file shredding. Optical data storage offers an effective, leading edge alternative for users wanting to make their archive systems fully compliant.

Secure Archiving:

Data archives are a prime target for both covert and malevolent attempts at unauthorised data access. Securing the data archive not only makes commercial sense but also contributes towards meeting compliance requirements.

Securing data archives can take two approaches. The first of these is to encrypt the data – thereby making it useless even if it is subject to unauthorised access.

The second approach is not to allow unauthorised access in the first place. This implies layers of hardened access controls which cannot be altered by just one person and where all access to the secured data and all changes must be authorised by 2 or more nominated individuals and where all attempts to access data (whether successful or not) are journalled.