Anda di halaman 1dari 36

Information lifecycle management:

Information lifecycle management (ILM) is a process for managing information


through its lifecycle, from conception until disposal, in a manner that optimizes
storage and access at the lowest cost.

ILM is not just hardware or software—it includes processes and policies to manage
the information. It is designed upon the recognition that different types of information
can have different values at different points in their lifecycle. Predicting storage needs
and controlling costs can be especially challenging as the business grows.

The overall objectives of managing information with ILM are to help reduce the total
cost of ownership (TCO) and help implement data retention and compliance policies.
In order to effectively implement ILM, owners of the data need to determine how
information is created, how it ages, how it is modified, and if/when it can safely be
deleted. ILM segments data according to value, which can help create an economical
balance and sustainable strategy to align storage costs with businesses objectives and
information value.

ILM elements
To manage the data lifecycle and make your business ready for on demand, there are
four main elements that can address your business in an ILM structured environment.
They are:

1) Tiered storage management


2) Long-term data retention
3) Data lifecycle management
4) Policy-based archive management

Tiered storage management


Most organizations today seek a storage solution that can help them manage data
more efficiently. They want to reduce the costs of storing large and growing amounts
of data and files and maintain business continuity. Through tiered storage, you can
reduce overall disk-storage costs, by providing benefits like:

1) Reducing overall disk-storage costs by allocating the most recent and most critical
business data to higher performance disk storage, while moving older and less critical
business data to lower cost disk storage.
2) Speeding business processes by providing high-performance access to most recent
and most frequently accessed data.
3) Reducing administrative tasks and human errors. Older data can be moved to lower
cost disk storage automatically and transparently.

Typical storage environment


Storage environments typically have multiple tiers of data value, such as application
data that is needed daily, and archive data that is accessed infrequently. However,
typical storage configurations offer only a single tier of storage, as shown in Figure,
which limits the ability to optimize cost and performance.
Multi-tiered storage environment
A tiered storage environment that utilizes the SAN infrastructure affords the
flexibility to align storage cost with the changing value of information. The tiers will
be related to data value. The most critical data is allocated to higher performance disk
storage, while less critical business data is allocated to lower cost disk storage.

An IBM ILM solution in a tiered storage environment is designed to:

1) Reduce the total cost of ownership (TCO) of managing information. It can help
optimize data costs and management, freeing expensive disk storage for the most
valuable information.

2) Segment data according to value. This can help create an economical balance and
sustainable strategy to align storage costs with business objectives and information
value.
3) Help make decisions about moving, retaining, and deleting data, because ILM
solutions are closely tied to applications.

4) Manage information and determine how it should be managed based on content,


rather than migrating data based on technical specifications. This approach can help
result in more responsive management, and offers you the ability to retain or delete
information in accordance with business rules.

5) Provide the framework for a comprehensive enterprise content management


strategy.

Long-term data retention


There is a rapidly growing class of data that is best described by the way in which it is
managed rather than the arrangement of its bits. The most important attribute of this
kind of data is its retention period, hence it is called retention managed data, and it is
typically kept in an archive or a repository. In the past it has been variously known as
archive data, fixed content data, reference data, unstructured data, and other terms
implying its read-only nature. It is often measured in terabytes and is kept for long
periods of time, sometimes forever.

Businesses must comply with these laws and regulations. Regulated information can
include e-mail, instant messages, business transactions, accounting records, contracts,
or insurance claims processing, all of which can have different retention periods, for
example, for 2 years, for 7 years, or retained forever. Data is an asset when it needs to
be kept; however, data kept past its mandated retention period could also become a
liability. Furthermore, the retention period can change due to factors such as litigation.
All these factors mandate tight coordination and the need for ILM.

Not only are there numerous state and governmental regulations that must be met for
data storage, but there are also industry-specific and company-specific ones. And of
course these regulations are constantly being updated and amended. Organizations
need to develop a strategy to ensure that the correct information is kept for the correct
period of time, and is readily accessible when it needs to be retrieved at the request of
regulators or auditors.

Data lifecycle management


At its core, the process of ILM moves data up and down a path of tiered storage
resources, including high-performance, high-capacity disk arrays, lower-cost disk
arrays such as serial ATA (SATA), tape libraries, and permanent archival media where
appropriate. Yet ILM involves more than just data movement; it encompasses
scheduled deletion and regulatory compliance as well. Because decisions about
moving, retaining, and deleting data are closely tied to application use of data, ILM
solutions are usually closely tied to applications.

ILM has the potential to provide the framework for a comprehensive information-
management strategy, and helps ensure that information is stored on the most cost-
effective media. This helps enable administrators to make use of tiered and virtual
storage, as well as process automation. By migrating unused data off of more costly,
high-performance disks, ILM is designed to help:
1) Reduce costs to manage and retain data.
2) Improve application performance.
3) Reduce backup windows and ease system upgrades.
4) Streamline™ data management.
5) Allow the enterprise to respond to demand—in real-time.
6) Support a sustainable storage management strategy.
7) Scale as the business grows.

Policy-based archive management


As businesses of all sizes migrate to e-business solutions and a new way of doing
business, they already have mountains of data and content that have been captured,
stored, and distributed across the enterprise. This wealth of information provides a
unique opportunity. By incorporating these assets into e-business solutions, and at the
same time delivering newly generated information media to their employees and
clients, a business can reduce costs and information redundancy and leverage the
potential profit-making aspects of their information assets.

Growth of information in corporate databases such as Enterprise Resource


Planning (ERP) systems and e-mail systems makes organizations think about moving
unused data off the high-cost disks. They need to:

1) Identify database data that is no longer being regularly accessed and move it to an
archive where it remains available.
2) Define and manage what to archive, when to archive, and how to archive from the
mail system or database system to the back-end archive management system.
The way to do this is to migrate and store all information assets into an e-business
enabled content manager. ERP databases and e-mail solutions generate large volumes
of information and data objects that can be stored in content management archives. An
archive solution allows you to free system resources, while maintaining access to the
stored objects for later reference. Allowing it to manage and migrate data objects
gives a solution the ability to have ready access to newly created information that
carries a higher value, while at the same time still being able to retrieve data that has
been archived on less expensive media
Requirements that today’s storage infrastructure meets:
1) Unlimited and just-in-time scalability. Businesses require the capability to
flexibly adapt to rapidly changing demands for storage resources without
performance degradation.

2) System simplification. Businesses require an easy-to-implement infrastructure


with the minimum of management and maintenance. The more complex the enterprise
environment, the more costs are involved in terms of management. Simplifying the
infrastructure can save costs and provide a greater return on investment (ROI).

3) Flexible and heterogeneous connectivity. The storage resource must be able to


support whatever platforms are within the IT environment. This is essentially an
investment protection requirement that allows you to configure a storage resource for
one set of systems, and subsequently configure part of the capacity to other systems
on an as-needed basis.

4) Security. This requirement guarantees that data from one application or system
does not become overlaid or corrupted by other applications or systems. Authorization
also requires the ability to fence off one system’s data from other systems.

5) Availability. This is a requirement that implies both protections against media


failure as well as ease of data migration between devices, without interrupting
application processing. This certainly implies improvements to backup and recovery
processes: attaching disk and tape devices to the same networked infrastructure allows
for fast data movement between devices, which provides enhanced backup and
recovery capabilities, such as:
– Server less backup. This is the ability to back up your data without using the
computing processor of your servers.
– Synchronous copy. This makes sure your data is at two or more places before
your application goes to the next step.
– Asynchronous copy. This makes sure your data is at two or more places within
a short time. It is the disk subsystem that controls the data flow.

Infrastructure simplification
There are four main methods by which infrastructure simplification can be achieved:
consolidation, virtualization, automation and integration:

1) Consolidation
Concentrating systems and resources into locations with fewer, but more powerful,
servers and storage pools can help increase IT efficiency and simplify the
infrastructure. Additionally, centralized storage management tools can help improve
scalability, availability, and disaster tolerance.

2) Virtualization
Storage virtualization helps in making complexity nearly transparent and at the same
time can offer a composite view of storage assets. This may help reduce capital and
administrative costs, while giving users better service and availability. Virtualization
is designed to help make the IT infrastructure more responsive, scalable, and
available.

3) Automation
Choosing storage components with autonomic capabilities can improve availability
and responsiveness—and help protect data as storage needs grow. As soon as day-to-
day tasks are automated, storage administrators may be able to spend more time on
critical, higher-level tasks unique to a company’s business mission.

4) Integration
Integrated storage environments simplify system management tasks and improve
security. When all servers have secure access to all data, your infrastructure may be
better able to respond to your user’s information needs.
Five Pillars of Technology:

Data proliferation:
Data proliferation refers to the unprecedented amount of data, structured and
unstructured, that business and government continue to generate at an unprecedented
rate and the usability problems that result from attempting to store and manage that
data. While originally pertaining to problems associated with paper documentation,
data proliferation has become a major problem in primary and secondary data storage
on computers.

Problems caused by data proliferation:

The problem of data proliferation is affecting all areas of commerce as the result of
the availability of relatively inexpensive data storage devices. This has made it very
easy to dump data into secondary storage immediately after its window of usability
has passed. This masks problems that could gravely affect the profitability of
businesses and the efficient functioning of health services, police and security forces,
local and national governments, and many other types of organization.

Data proliferation is problematic for several reasons:


1) Difficulty when trying to find and retrieve information. At Xerox, on average
it takes employees more than one hour per week to find hard-copy documents,
costing $2,152 a year to manage and store them.
2) Data loss and legal liability when data is disorganized, not properly replicated,
or cannot be found in a timely manner. In April of 2005, Ameritrade Holding
Corporation told 200,000 current and past customers that a tape containing
confidential information had been lost or destroyed in transit. In May of the
same year, Time Warner Incorporated reported that 40 tapes containing
personal data on 600,000 current and former employees had been lost en route
to a storage facility.
3) Increased manpower requirements to manage increasingly chaotic data storage
resources.
4) Slower networks and application performance due to excess traffic as users
search and search again for the material they need.
5) High cost in terms of the energy resources required to operate storage
hardware. A 100 terabyte system will cost up to $35,040 a year to run.

Proposed solutions:

1) Applications that better utilize modern technology.


2) Reductions in duplicate data (especially as caused by data movement).
3) Improvement of metadata structures.
4) Improvement of file and storage transfer structures.
5) The implementation of Information Lifecycle Management solutions to
eliminate low-value information as early as possible before putting the rest
into actively managed long-term storage in which it can be quickly and
cheaply accessed.

Data Center:
A data center is a facility used to house computer systems and associated components,
such as telecommunications and storage systems. It generally includes redundant or
backup power supplies, redundant data communications connections, environmental
controls (e.g., air conditioning, fire suppression), and special security devices

Data centers have their roots in the huge computer rooms of the early ages of the
computing industry. Early computer systems were complex to operate and maintain,
and needed a special environment to keep working. A lot of cables were necessary to
connect all the parts. Also, old computers required a lot of power, and had to be
cooled to avoid overheating. Security was important; computers were expensive, and
were often used for military purposes. For this reason, engineering practices were
developed since the start of the computing industry. Elements such as standard racks
to mount equipment, elevated floors, and cable trays (installed overhead or under the
elevated floor) were introduced in this early age, and have modernized relatively little
compared to the computer systems themselves.

During the boom of the microcomputer industry, and especially during the 1980s,
computers started to be deployed everywhere, in many cases with little or no care
about operating requirements. However, as information technology (IT) operations
started to grow in complexity, companies grew aware of the need to control IT
resources. With the advent of client-server computing, during the decade of 1990,
microcomputers (now called "servers") started to find their places on the old computer
rooms. The availability of inexpensive networking equipment, coupled with new
standards for network cabling, made it possible to use a hierarchical design which put
the servers in a specific room inside the company. The use of the term "data center",
as applied to specially design computer rooms, started to gain popular recognition
about this time.

The boom of data centers came during the dot-com bubble. Companies needed fast
Internet connectivity and non-stop operation to deploy systems and establish a
presence on the Internet. Many companies started building very large facilities, called
"Internet data centers", or IDCs, which provide businesses with a range of solutions
for systems deployment and operation. New technologies and practices were designed
to handle the operational requirements of such large scale operations. These practices
eventually migrated towards the private data centers, and were largely adopted
because of their practical results.

As of 2007, data center design, construction, and operation is a well-known


discipline. Standard documents from accredited professional groups, such as the
Telecommunications Industry Association, specify the requirements for data center
design. Well-known operational metrics for data center availability can be used to
evaluate the business impact of a disruption. There is still a lot of development being
done in operation practice, and also in environmentally-friendly data center design.

A data center can occupy one room of a building, one or more floors, or an entire
building. Most of the equipment is often in the form of servers racked up into 19 inch
rack cabinets, which are usually placed in single rows forming corridors between
them. This allows people access to the front and rear of each cabinet. The physical
environment of the data center is usually under strict control:

1) Air conditioning is used to keep the room cool; it may also be used for humidity
control. Generally, temperature is kept around 20-22 degrees Celsius (about 68-72
degrees Fahrenheit). The primary goal of data center air conditioning systems is to
keep the server components at the board level within the manufacturer's specified
temperature/humidity range. This is crucial since electronic equipment in a confined
space generates much excess heat, and tends to malfunction if not adequately cooled.
Air conditioning systems also help keep humidity within acceptable parameters. The
humidity parameters are kept between 35% and 65% relative humidity. Too much
humidity and water may begin to condense on internal components; too little and
static electricity may damage components.

2) Data centers often have elaborate fire prevention and fire extinguishing systems.
Modern data centers tend to have two kinds of fire alarm systems; a first system
designed to spot the slightest sign of particles being given off by hot components, so a
potential fire can be investigated and extinguished locally before it takes hold
(sometimes, just by turning smoldering equipment off), and a second system designed
to take full-scale action if the fire takes hold. Fire prevention and detection systems
are also typically zoned and high-quality fire-doors and other physical fire-breaks
used, so that even if a fire does break out it can be contained and extinguished within
a small part of the facility.

3) Backup power is catered for via one or more uninterruptible power supplies and/or
diesel generators.

4) To prevent single points of failure, all elements of the electrical systems, including
backup system, are typically fully duplicated, and critical servers are connected to
both the "A-side" and "B-side" power feeds.

5) Old data centers typically have raised flooring made up of 60 cm (2 ft) removable
square tiles. The trend is towards 80-100cm void to cater for better and uniform air
distribution. These provide a plenum for air to circulate below the floor, as part of the
air conditioning system, as well as providing space for power cabling.

6) Using conventional water sprinkler systems on operational electrical equipment can


do just as much damage as a fire. Originally Halon gas, a halogenated organic
compound that chemically stops combustion, was used to extinguish flames.
However, the use of Halon has been banned by the Montreal Protocol because of the
danger Halon poses the ozone layer. More environmentally-friendly alternatives
include Argonite and HFC-227.

7) Physical security also plays a large role with data centers. Physical access to the
site is usually restricted to selected personnel. Video camera surveillance and
permanent security guards are almost always present if the data center is large or
contains sensitive information on any of the systems within.

The main purpose of a data center is running the applications that handle the core
business and operational data of the organization. Communications in data centers
today are most often based on networks running the IP protocol suite. Data centers
contain a set of routers and switches that transport traffic between the servers and to
the outside world. Network security elements are also usually deployed: firewalls,
VPN gateways, intrusion detection systems, etc.

Data centers are also used for off site backups. Companies may subscribe to backup
services provided by a data center. Encrypted backups can be sent over the Internet to
data center where they can be stored securely.
Evolution of Storage System:
Storage systems have become an important component of information technology.
Storage systems are built by taking the basic capability of a storage device, such as
the hard disk drive, and adding layers of hardware and software to obtain a highly
reliable, high-performance, and easily managed system.

The first data storage device was introduced by IBM in 1956. Since then there has
been remarkable progress in hard disk drive (HDD) technology, and this has provided
the fertile ground on which the entire industry of storage systems has been built.

It has long been recognized that the disk drive alone cannot provide the range of
storage capabilities required by enterprise systems. The first storage devices were
directly controlled by the CPU. The key advantage of a control unit (or controller)
was that the I/O commands from the CPU (sometimes called the host) were
independently translated into the specific commands necessary to operate the HDD
(sometimes called the direct access storage device, or DASD), and so the HDD device
itself could be managed independently and asynchronously from the CPU.

Storage systems leapt further ahead in the early 1990s when RAID (redundant array
of independent disks) technology was introduced. RAID allowed the coordination of
multiple HDD devices so as to provide higher levels of reliability and performance
than could be provided by a single drive. The classical concept of parity was used to
design reliable storage systems that continued to operate despite drive failures.
Parallelism was used to provide higher levels of performance. RAID technology was
delivered in low cost hardware and by the mid 1990s became standard on servers that
could be purchased for a few thousand dollars. Many variations on RAID technology
have been developed which are used in large external storage systems that provided
significant additional function, including redundancy (no single point of failure in the
storage system) and copy services (copying of data to a second storage system for
availability).

Disaster recovery became a requirement for all IT systems, and impacted the design
of storage systems. Several principles were developed to save data in case of disaster
e.g. a point-in-time copy (offered by IBM under the name FlashCopy) is the making
of a consistent virtual copy of data as it appeared at a single point in time. This copy
is then kept up to date by following pointers as changes are made. If desired, this
virtual copy can, over time, be made into a real copy through physical copying. A
second technique, mirroring or continuous copy (offered by IBM under the name
Peer-to-Peer Remote Copy) involves two mirror copies of data, one at a primary
(local) site and one at a secondary (recovery) site. We say this process is synchronous
when data must be successfully written at the secondary system before the write
issued by the primary system is acknowledged as complete. Although synchronous
operation is desirable, it is practical only over limited distances (say, of the order of
100 km).

Further the requirements for data availability were not completely satisfied by reliable
storage systems as data could be accidentally erased (through human error or software
corruption); so additional copies were also needed for backup purposes. Then Backup
systems were developed that allowed users to make a complete backup of selected
files or entire file systems. The traditional method of backup was to make a backup
copy on tape, or in the case of a personal computer, on a set of floppy disks or a small
tape cartridge. However, as systems became networked together, LAN-based backup
systems replaced media-oriented approaches, and these ran automatically and
unattended, often backing up from HDD to HDD. File-differential backup was
subsequently introduced, in which only the changed bytes within a file are sent and
managed at the backup server.

Note:

Backup systems are not as simple as they sound, because they must deal with many
different types of data (of varying importance), on a variety of client, server, and
storage devices, and with a level of assurance that may exceed that for the systems
they are backing up. For reasons of performance and efficiency, backup systems must
provide for incremental backup (involving only those files that have changed) and
file-differential backup (involving only those bytes in a file that have changed). These
techniques pose an exceptionally stringent requirement on the integrity of the meta-
data that are associated with keeping straight the versions of the backed-up data.

Although the raw cost of HDD storage has declined, tertiary media such as tape or
optical disks continue to remain important, and therefore hierarchical storage systems
that manage these levels of storage are needed:

Hierarchical Storage Management (HSM) is a data storage technique which


automatically moves data between high-cost and low-cost storage media. HSM
systems exist because high-speed storage devices, such as hard disk drive arrays, are
more expensive (per byte stored) than slower devices, such as optical discs and
magnetic tape drives. While it would be ideal to have all data available on high-speed
devices all the time, this is prohibitively expensive for many organizations. Instead,
HSM systems store the bulk of the enterprise's data on slower devices, and then copy
data to faster disk drives when needed. In effect, HSM turns the fast disk drives into
caches for the slower mass storage devices. The HSM system monitors the way data
is used and makes best guesses as to which data can safely be moved to slower
devices and which data should stay on the fast devices.

In a typical HSM scenario, data files which are frequently used are stored on disk
drives, but are eventually migrated to tape if they are not used for a certain period of
time, typically a few months. If a user does reuse a file which is on tape, it is
automatically moved back to disk storage. The advantage is that the total amount of
stored data can be much larger than the capacity of the disk storage available, but
since only rarely-used files are on tape, most users will usually not notice any
slowdown.

HSM is sometimes referred to as tiered storage. Tiered storage is a data storage


environment consisting of two or more kinds of storage delineated by differences in at
least one of these four attributes: Price, Performance, Capacity and Function e.g.
• Disk and Tape: Two separate storage tiers identified by differences in all four
defining attributes.
• Old technology disk and new technology disk: Two separate storage tiers
identified by differences in one or more of the attributes.
• High performing disk storage and less expensive, slower disk of the same
capacity and function: Two separate tiers.
• Identical Enterprise class disk configured to utilize different functions such as
RAID level or replication: A separate storage tier for each set of unique
functions.

HSM was first implemented by IBM on their mainframe computers to reduce the cost
of data storage, and to simplify the retrieval of data from slower media. The user
would not need to know where the data was stored and how to get it back; the
computer would retrieve the data automatically. The only difference to the user was
the speed at which data was returned.

The emergence of low-cost LAN technology drove the most significant trend of the
late 1980s and early 1990s in storage systems. PCs became networked and the
client/server computing model emerged. The widespread availability of the NFS
(Network File System) file-sharing protocols caused further specialization and the
emergence of file servers. The next step was the emergence of NAS (network-
attached storage) systems, which bundled the network file serving capability into a
single specialized box, typically serving standard protocols such as NFS, CIFS
(Common Internet File System), HTTP (Hypertext Transfer Protocol), and FTP (File
Transfer Protocol). NAS systems were simple to deploy because they came packaged
as “appliances,” complete with utilities and management functions.

At the same time, IT organizations were attempting to “regain control” over the
dispersed assets characteristic of client/server computing. The data center took on a
renewed importance in most enterprises. To that end, multiple servers in a machine
room sought the capability to access their backend storage without necessarily having
individual storage directly attached and therefore dedicated to an individual server.
This caused the emergence of storage area networks (SANs). SAN technology opened
up new opportunities in simplified connectivity, scalability, and cost and capacity
manageability. Fibre Channel became the predominant networking technology and
large storage systems, such as IBM TotalStorage Enterprise Storage Server (ESS),
support this protocol and use it to attach multiple servers. To avoid confusion, one
should think of NAS systems as working with files and file access protocols; whereas
a SAN enables access to block storage (the blocks of data may be stored on a storage
system or an HDD).

Thus in short we can say that the evolution of storage follows the following hierarchy:

1) Embedded (Disks with a server, Parallel buses (SCSI, IDE,EIDE))


2) DAS
3) NAS
4) SAN
LAN
PC
Server
D1 D2 PC
PC

**Embedded System Storage **

3-Tier Client/Server Storage Model:

1) Presentation of Data: top tier Desktops, Pc’s or network computers.


2) Processing of data: middle tier Application servers.
3) Storage of data: bottom tierStorage devices, interconnected by SAN.

PC PC PC PC
Primary/
Production
Presentation LAN/WAN Network

Servers Servers Servers


Processing

Storage Area Network


Storage
Storage Storage
Devices Devices
Recognizing the fact that progress in base information technology had outrun our
systems management capabilities, in 2001 IBM published the Autonomic
Computing Manifesto. The manifesto is a call to action for industry, academia, and
government to address the fundamental problem of ease and cost of management. The
basic goal is to significantly improve the cost of ownership, reliability, and ease-of-
use of information technologies. To achieve the promise of autonomic computing,
systems need to become more self-configuring, self-healing and self-protecting, and
during operation, more self-optimizing.

The autonomic computing is performed at three levels:

1) The first level is the component level in which components contain features
which are self-healing, self-configuring, self-optimizing, and self-protecting.
2) At the next level, homogenous or heterogeneous systems work together to
achieve the goals of autonomic computing e.g. the Collective Intelligent
Brick System (researched at IBM) allows “bricks” within a structure to take
over from one another, or help each other carry a surge of load. Thus Systems
behaving this way can be designed and packaged for “fail-in-place” operation.
This means that failing components (or bricks) need not necessarily be
replaced; they can be worked around i.e. the use of redundancy and
reconfiguration in case of failure. Using this capability we can package storage
in three dimensions rather than two, because failing bricks in the interior of the
structure do not have to be replaced.
3) At the third and highest level of autonomic computing, heterogeneous systems
work together toward a goal specified by the managing authority. For
example, a specified class of data is assigned certain attributes of performance,
availability, or security. The data specification might be based on a file
subtree, a file ending, file ownership, or any other characteristic that can
define a subset of data. The Storage Tank SAN file system has the capability
to perform at this highest level of autonomic computing. It can achieve this
while obeying certain constraints on the data, for example, location, security
policies in force, or how the data may be reorganized. Given the virtual nature
of data and because of the existence of a separate meta-data manager, each file
can be associated with a policy that describes how it is to be treated for
purposes of performance, security, availability, and cost. This composite
capability of autonomic systems working toward goals, while adhering to
constraints, is referred to as policy management.

Future challenges:
Certain known requirements for storage systems are accelerating and new paradigms
are emerging that provide additional challenges for the industry. The volume of data
being generated continues to increase exponentially, but a new source of data is
becoming increasing prevalent: machine-generated data. Until now most data have
been authored by hand, using the keyboard, and widely distributed and replicated
through networking. This includes data in databases, textual data, and data on Web
pages. However, these types of data are now being dwarfed by machine-generated
data, that is, data sourced by digital devices such as sensors, surveillance cameras, and
digital medical imaging. The new data require extensive storage space and
nontraditional methods of analysis and, as such, provide new challenges to storage
and data management technologies.

Another important trend is in the growth of reference data. Reference data are
commonly defined as stored data that are only infrequently retrieved (if at all). As an
example, consider the archived copies of customer account statements—these are
rarely accessed after an initial period.

Another problem area of growing interest concerns the long-term preservation of data
and the backup of data in case of disaster. Paper is often decried as an inferior storage
medium because it is susceptible to water and fire damage. On the other hand
electronic media is not long lasting, as the formats of our digital records are subject to
change, and the devices and programs to read these records are relatively short-lived.
The most viable solution for such a challenge is to create data in a form that is self-
describing; that is, it comes with the data structures and programs needed to interpret
the data, coded in a simple universal language, for which an interpreter can easily be
created at some later time.

A related trend is the growing importance of higher levels of availability for storage
systems. It is generally acknowledged that decreasing the down time (e.g., an increase
of availability from 0.99999 to 0.999999) represents a significant engineering and
operational challenge as well as additional expense. Traditional RAID systems (e.g.,
RAID 5) are reaching a point where they cannot support the higher levels of
reliability, performance, or storage density required. Further storage space on HDDs
has outrun the rate at which data can be accessed, rebuild times have increased
correspondingly. Although the probability of undetected write errors on HDDs is
small, the massive increase in storage space will over time increase the likelihood that
problems from these errors are encountered. So new approaches to storage structures
are being researched that rely on alternative coding schemes, methods of adaptively
creating additional copies of data, and also super-dense packaging of disk drives and
associated control circuitry.
A combination of trends is making the requirement of business continuance (the
business continues to operate in the face of IT failures or disaster) more challenging.
Pending SEC regulation, insurance company requirements, and concerns about
national and international security are placing increased demands on storage
architectures. The costs of down time for many enterprises have generally been
acknowledged to be large and increasing. Systems can no longer be taken down for
purposes of backing up data. Thus backup time is a problem and more severe, the
restoring time is another problem.
Components of Optimized Storage Deployment:

Utilization Availability
and Efficiency

Business
Networked Continuity
Storage Practices
Architectures
Optimized
Storage
Deployment

Think Defensively
AND
Offensively

Storage Competence and Agility

Basic Storage Building Blocks:

Operating System

File System

Volume Management

Storage Devices (Disks, LUNs)


The Shared Storage Model:
Storage Management Challenges:

1) Variety of information: information technology holds the promise of bringing


a variety of new types of information to the people who need it.
2) Volume of data: data is growing exponentially.
3) Velocity of change: It organizations are under tremendous pressure to deliver
the right IT services. 85% of problems are caused by the changing IT staff.
80% of the problems are not detected by the IT staff until reported by the end
user.
4) Leverage Information: This is Capitalize on data sharing for collaboration
along with storage investments and informational value. One can address
leverage information by reporting and data classification. Questions that may
be asked are…
a) How much storage do I have available for my applications.
b) Which applications, users and databases are the primary consumers of
my storage?
c) When do I need to buy more storage?
d) How reliable is my SAN?
e) How my storage is is being used.
5) Optimize IT: This is for automate and simplify the IT operations. Also to
optimize performance and functionality. One can address optimize IT
solutions by centralizing management and storage virtualization Questions
asked are…
a) How do I simplify and centralized my storage infrastructure.
b) How do I know the storage is not the bottleneck for user response time
issues?
c) Is the storage infrastructure available and performing as needed?
6) Mitigate Risks: This is for comply with regulatory and security requirement.
Also to keep your business running continuously. One can address mitigate
risks by tiered storage and ILM. Questions asked are…
a) How do I monitor and centrally manage my replication services?
b) How do I maintain storage service levels?
c) Which files must be backed up, archived and retained for compliance?
7) Enable Business Flexibility: This is for flexible, on demand IT Infrastructure
and to protect your IT investments. One can address enable business flexibility
by service management. Questions are…
a) How can I automate the provisioning of my storage systems, databases,
file systems and SAN?
b) How do I maintain storage service levels?
c) How do I monitor and centrally manage my replication services?

What needs to be managed?

1) servers
 Applications
 Databases
 File systems
 Volume managers
 Host Bus Adaptors and Multi-path drivers
2) Network components
 Switches, hubs, routers
 Intelligent switch replication
3) Storage components
 Volume mapping/ Virtualization
 Storage array provisioning
 NAS filers
 Tape libraries
4) Discovery
 Topologies view
 Asset Management
5) Configuration Management
 Provisioning
 Optimization
 Problem determination
6) Performance Management
 Bottleneck Analysis
 Load Balancing
7) Reporting
 Asset/ Capacity/ Utilization
 Accounting/ Chargeback
 Performance/ Trending
 Problem reports

Storage Resource Management:


Components of SRM Application:

How does SRM Helps:

1)
2)

How can you start SRM?


Level 1 and 2 Reporting:
Level 1 and 2 optimization:
Few Issues Related to DATA:

Data identity: Persistent Unique Identifiers (or an alternative means to achieve this
functionality) will enable global cross referencing between data objects. Such
Identifiers will not only be used for data and software but also for other resources
such as people, equipment, organizations etc.
On the other hand, any scheme of identification is likely to undergo evolution so
preservation, and in particular integration of archival and current data, is likely to
require active management of identifiers.

Data objects: Data will be made available with all the necessary metadata to enable
reuse. From its creation, and throughout its lifecycle, data will be packaged with its
metadata which will progressively accrue information regarding its history throughout
its evolution.

Data agents: Data will be “intelligent” in that it maintains for itself knowledge of
where is has been used as well as what it uses. (This can be achieved by bidirectional
links between data and its uses or by making associations between data themselves
stand alone entities. In either case, active maintenance of the associations is required.)

Software: Software will join simulations, data, multimedia and text as a core research
output. It will therefore require similar treatment in terms of metadata schemas,
metadata creation and propagation (including context of software development and
use, resilience, versioning and rights).

Data Forge”: Rather like sourceForge for software. We imagine a global (probably
distributed) self service repository of the data which is made available under a variety
of access agreements.

There a requirement on the data management technology for greater ease in data
collection, interoperation, aggregation and access. Technology and tooling must be
developed to meet these requirements both in the manifestation of the data itself and
the software that manages it. Critical to this will be the collection, management and
propagation of metadata along with the data itself. Technology and tools must be
developed which facilitate:

11. Metadata schema creation with optimal reuse of existing schema hence
maximizing the potential for data interoperation,
22. Metadata creation as early as possible and as automatically as possible, and
33. Metadata propagation, so that data and metadata are managed together by
applications hence enabling additional types of analysis of the data at each stage in
its evolution (provenance, audit, etc).

Along with these technologies will have to be developed the “library” of the data
objects themselves.
Data created by physical research:
The future e-infrastructure will reduce the cycle time from conducting research,
through analysis, publication and feedback into new research.

Issues
Metadata: Key to reuse of data is the (automated) collection of metadata – as much
as possible and as early as possible. The ideal is that all the information about an
experiment, the environment, the people as well as the data flowing from the
equipment and samples is captured in a digital manner as early on in the process as
possible.

Research plans: For hypothesis driven research, the research plan will provide a
digital context to form the basis of automated capture of data and metadata and the
policy for data distribution. For exploratory research, the analysis tools will provide
some assistance in gathering of metadata.

Real-time analysis: of data enables new possibilities for efficiency such as a “just
enough” approach to experimental planning which limits data collection to just
sufficient to prove or disprove the hypothesis thereby releasing researchers and other
resources to move on to the next experiment more quickly. This type of “Heads up”
process management (perhaps supported by expert systems technology) should be
used to optimize the efficiency of the research process by dynamically modifying the
experiment as partial results become available. (Note that the modeling involved is
likely to be non-linear and that statistical analysis will be required as well as
application specific data analysis.)

Data Acquisition Tools: Many instruments come with manufacturer specific data
capture tools. Pressure is increasing on the manufactures to conform to standards at
least in the way the data collected can be exported from their software.

Data Analysis Software: This is often subject and experiment specific. Software
packages such as IGOR, R, even EXCEL, plotting packages, need to inter-operate in a
more robust manner and the ability to keep track of the workflow, provenance etc is
crucial.

Virtual Access to Physical Facilities: Virtual access to physical facilities will


become increasingly important as larger teams are needed in multi-disciplinary
research which can't all be physically present at an experiment. There is a push to
provide remote access to equipment and the resulting data. There is also a need for
provision of other services to support the data.

Electronic Lab Notebooks: ELN is a rapidly growing area. Industry is concerned


with information capture, re-use, enforcement and legal issues. Academic labs are
more concerned with versatility. The Human Computer Interfaces of ELNs are very
important.

The quantity of data and metadata captured by this approach will be a significant
increase on current data flows, which are already rising exponentially. The quality of
data collected in this manner will mark it out as much more useful in the longer term.
This is a role for a hierarchy of storage systems corresponding to the different
administrative domains active in the research environment (researcher, research
group, department, university, national collection, journal, private database holders
etc.)

Data created by e-research:


There will be a much greater use of e-research and much tighter integration of
physical research with e-research.

Issues

New opportunities: e-research, that is research conducted entirely on computers,


such as in simulations, is playing an increasingly important role in the process of
scientific discovery. As well as opening up new avenues of research it can also
provide a reduction in cost of several orders of magnitude and so open up the
possibility of a vastly increased number of experiments. With these increases in scale
come requirements for improved reliability, data analysis and statistical analysis
software, etc.

Auditability: There will be complex issues surrounding such as accuracy,


reproducibility and audit, a sensitive analysis of the research outputs. In e-research,
just as in physical research, there will be a need to automatically “understand” and
make use of previous data. An e-research equivalent of the lab notebook culture will
need to be developed.

Integration of e-research and physical research: Computation and simulation in


real-time will increasing become a integral part of the process of undertaking physical
experiments with e-results, being used to drive the collection and analysis of physical
processes. This will enable much more effective and efficient use of physical
resources.

Large scale research: The e-infrastructure will also be required to support the large
scale data processing and analysis driven by large scale e-experiments and large scale,
community based research involving hundreds or thousands of researchers linked by
webs & grids. The possibility of conducting such large scale research will open up
new avenues for research in many disciplines.

There is considerable scope for improvement in research efficiency, and indeed in


research capability, by closer integration of physical and simulation based research.

Data created by digitisation and repurposing:


The future e-Infrastructure will support the use for research purposes of data collected
for other purposes.

Issues:
Strategic Investment: Digitization is not an end it itself, it is a means to an end, and
whilst the last decade has seen a rush to digitize within the UK public sector,
supported by considerable sums of money, as well as providing digital content,
experience gained on these digitization programmes has helped build capacity with
ICT and has amplified strategic weakness such as digital preservation needs.

Repurposing of data for research: Increasingly scientific research will be able to be


done by harvesting data that already exists for other purposes. Data standards across
government departments will enable large-scale research particularly in sociological
and medical disciplines. New research methods will also be enabled by the e-
infrastructure and its ability to co-ordinate large number of technologically equipped
“amateur“scientists. Community supported research conducted by interest groups
such as environmental or orthinological organizations will become increasingly
important and valid. Data collected by the private sector (for example from loyalty
cards) will also be available for scientific as well as commercial marketing purposes.

Public engagement: The e-infrastructure will also support initiatives promoting the
public engagement of science.

In order to maximize the re-use of a digital object then we have to identifying,


reducing and where possible eliminating scientific barriers. The use cases are:
1• The digital surrogate
2• The digital reference collection
3• The digital research archive
4• The digital learning object
5• The digital archive

Digital surrogates provide access to objects that cannot be easily reproduced or


distributed.
A digital reference collection has two likely uses: as a finding aid for other resources
and as an authority upon which analysis or discussion can be built.
The digital research archive is the digital residue that derives from research. A
research project may quite legitimately have an analogue output as the principal goal
of its research: classically a book, article or monograph.
Digital learning objects have seen particular expenditure over the last ten years and
the literature on them is extensive.
The digital archive consists of the digital objects associated with an event or process.

5 steps of data gathering electronically:


Now a day’s number of document productions involving electronic data increases.
Gathering electronic information such as email messages, letters, memos,
spreadsheets and other critical electronic documents can be a complicated and
expensive process if not planned and carried out correctly.

So one should follow these steps while collecting electronic information.


Step 3: Identify relevant data.

It is important to gather as much specific information as possible about the layout of


the organization’s information services. This includes:
Step 4: Prepare data gathering plan

Once the interview process is complete and the information is aggregated, a


customized retrieval plan can be developed. A data gathering plan includes:

 A diagram of the data to be gathered.


 A project plan for all physical locations.
 A summary of the anticipated impact on operations, and plan for minimizing
business operations.

Anda mungkin juga menyukai