Ian F. Adams
Darrell D. E. Long
University of California, Santa Cruz
Shankar Pasupathy
NetApp
Abstract
Traditionally, computing has meant calculating results
and then storing those results for later use. Unfortunately, committing large volumes of rarely used data to
storage wastes space and energy, making it a very expensive strategy. Cloud computing, with its readily available
and flexibly allocatable computing resources, suggests
an alternative: storing the provenance data, and means
to recomputing results as needed.
While computation and storage are equivalent, finding
the balance between the two that maximizes efficiency is
difficult. One of the fundamental challenges of this issue is rooted in the knowledge gap separating the users
and the cloud administratorsneither has a completely
informed view. Users have a semantic understanding of
their data, while administrators have an understanding
of the clouds underlying structure. We detail the user
knowledge and system knowledge needed to construct
a comprehensive cost model for analyzing the trade-off
between storing a result and regenerating a result, allowing users and administrators to make an informed costbenefit analysis.
1 Introduction
In traditional computing, storage is used to hold the results of computation. In this simple model where the
final computational state is preserved, results are simply read from storage each time they are needed. Cloud
computing, with its promise of readily available, flexibly allocated computational resources, calls into question
the relationship between processing and storage. Instead
of storing a result, it may be more efficient to store the
Supported in part by the Petascale Data Storage Institute under Department of Energy award DE-FC02-06ER25768 and by the industrial
sponsors of the Storage Systems Research Center.
Ethan L. Miller
Mark W. Storer
Pergamum Systems
Cr
Inputs
Process
Result
Figure 1: The three components involved with generating computed data: one or more inputs, a process that
transforms those inputs, and the ensuing result.
2 Discussion
The basic relationship governing results computation can
be expressed with a simple, three entity model. As Figure 1 illustrates, the computation is rooted in one or more
inputs, which can, in turn, be the result of previous computation. These inputs are acted upon, and transformed
by a process. The output of that process is a result.
In a traditional model of computing, this result is
stored for future use. In contrast, cloud computings
holistic view of storage and computation is well suited to
intelligently choosing which results to store and which
to recompute as needed. The efficiency gains of this approach can be likened to file compression, which trades
some computation costs for storage efficiency.
2.1 Requirements
As a first step in determining when it is desirable to recompute results as opposed to storing them, it is important to understand the conditions that make recomputation possible. A primary concern is the integrity constraint of the result. If there is a strict integrity constraint, then the goal is to always regenerate the same
result. Some results, however, may only carry loose integrity constraints, in which a merely acceptable result is required. For example, a simulation process might
generate different output each time it is run, but each of
the different results is admissible.
If inputs are being stored with an eye towards regenerating results, then the corresponding process is also required. This is especially true for results with a strict integrity constraint; for loose integrity results, an alternate,
slightly different, process might suffice.
If the process is to be stored and reused, there are a
number of requirements which must be met. First, the
process must be known, and it must be describable in
some manner that renders it reusable. Second, for strict
integrity, the process must be deterministic. This does
not, however, preclude the storage of pseudo-random
processes, so long as all the necessary input seed values
have also been stored. Third, the process must be reexecutable over the desired lifetime of the result, a par-
Result
Specific
Marginal
Costs
Trends
metadata associated with a result, and used in conjunction with a cloud providers cost model.
2.2.1 Result Specific Issues
There are a number of factors, intrinsic to the result itself,
that must be taken into account when choosing wether
to store or recompute. These result-specific issues can
be divided into low-level, and high-level factors. Lowlevel factors describe a system level view of how results
are created, while high-level factors are based on the the
meaning of the result, and are often highly subjective.
Provenance aware systems track a number of low-level
factors by constructing dependency graphs to capture the
relationship between inputs, processes and results. These
graphs can be extended to record factors such as resource
requirements and hashes of computed results. By recording resource requirements, it is possible to estimate how
much lead time is required to recalculate a result. Hashes
can be used to confirm reconstruction. As dependency
chains grow longer, such data is increasingly important;
the longer the chain of data products that must be regenerated, the greater the inclination to store at least some
of the key intermediate values.
There are a number of high-level factors that require
either an application or user level understanding of a result. First, the likelihood and frequency that a result will
be used at a later time is highly variable. Files are not
accessed uniformly; some files are very hot, though
the vast majority of storage is rarely accessed [9]. Second, there is a potential concern over the time needed
to recompute resultswhile retrieved results can be accessed nearly immediately, recomputing results can be
time-intensive, particularly if inputs earlier in the dependency graph must be recomputed as well. Moreover,
computation time can be greatly effected by the amount
of parallelism available in the cloud and the degree to
which the process can be parallelized to take advantage
of it. The miss penalty is also a critical factor: does it
incur a small financial cost or is the miss penalty very
high? If miss penalties are high, a strategy that provides
the lowest possible access time will likely be optimal.
Miss penalties are especially important to consider,
since overbooking is a standard practice in many serviceoriented businesses. One way that miss penalties could
be mitigated is with the use of insurance: a service
contract with a cloud provider could include a standard
SLA (Service Level Agreement), detailing assured levels
of service, along with an insurance clause offering monetary compensation to the user if the provider fails to provide service up to the SLA. Cloud providers like Amazons EC2 [1] and RackSpace [11] already provide basic
1000
$ per MB
100
10
0
1973
1978
1983
1988
1993
1998
2003
2008
Year
Residential
12
Commercial
Industrial
10
8
6
4
2
0
1973
1978
1983
1988
1993
1998
2003
2008
Year
Figure 2: Predicting costs can be difficult; some factors such as price per megabyte of hard drive storage are
fairly predictable, others such as energy prices are more
volatile.
annually [14], while Moores law states that the number
of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years.
Further, market volatility can also play into a cost efficiency model, making it difficult to predict the future
state of a volatile market. As Figure 2 shows, while
the price of energy has trended upwards, it has not followed as predictable a growth rate as hard drive costs [4].
Similarly, while DRAM prices have trended lower, market forces such as supply and demand, and extra-market
forces such as collusion and price fixing, conspire against
prediction confidence [10]. Furthering the difficulty of
prediction are technological plateaus and disruptive technologies.
3 Related Work
The equivalence of storage and computation is well established, and several applications utilize storage as a replacement for repeated computation. For example, dynamic programming techniques exploit storage to solve
4 Conclusions
The presence of readily available, and easily allocated
computational utility promised by cloud computing calls
into the question the traditional role of storage as a way to
preserve past computations. Instead of storing a result, it
may be most cost efficient to store the inputs, processes
and provenance data needed to regenerate a result, and
then regenerate results on-demand. In essence, computation can be used in place of storage. If data is unlikely to
be reused, as is often the case, this approach may yield
significant cost savings.
Deciding when to store a result or when to rely instead
on computation comes down to a cost-benefit analysis.
We have discussed the constraints and requirements for
storing the input, the process, and the results. Furthermore, we presented the factors involved in a cost model
covering three key areas. First, the semantic meaning of
the data, such as miss penalties, and the odds and frequency of reuse. Second, marginal costs describing the
costs for additional units of cloud utility. Third, forecasting to predict where prices will be in the future. These
factors span the knowledge of both users, and cloud administrators, motivating the need for methods of interaction between the two. By combining information from
both users and cloud administrators, users can make informed decisions between storage and recomputation to
minimize costs.
References
[1] Amazon. Amazon EC2 service level agreement. http:
//aws.amazon.com/ec2-sla, Oct. 2008.
[2] Amazon. AWS customer agreement. http://aws.
amazon.com/agreement/, Apr. 2009.
[3] R. Balani et al. Above the clouds: A berkeley view of
cloud computing. Technical Report UCB/EECS-200928, UCB, Feb. 2009.
[4] Energy Information Administration. Monthly energy review January 2009. http://www.eia.doe.gov/
emeu/mer/contents.html, Jan. 2009.
[5] Fort Worth Star-Telegram. FBI raids Dallas computer
firm. http://www.star-telegram.com, Apr.
2009.
[6] H. M. Gladney and R. A. Lorie. Trustworthy 100-year
digital objects: Durable encoding for when its too late
to ask. ACM Transactions on Information Systems, July
2005.
[7] J. Gray. Distributed computing economics. ACM Queue,
May 2008.
[8] M. Hellman. A cryptanalytic time-memory trade-off. In
IEEE Transactions on Information Theory, volume 26,
July 1980.
[9] A. W. Leung et al. Measurement and analysis of largescale network file system workloads. In USENIX 2008
Technical Conference.
[10] S. K. Moore. Price fixing in the memory market. IEEE
Spectrum, Dec. 2004.
[11] Mosso. Rackspace SLA. http://www.mosso.
com/downloads/sla_cloud_servers.pdf,
Mar. 2009.
[12] K.-K. Muniswamy-Reddy et al. Provenance-aware storage systems. In USENIX 2006 Technical Conference.
[13] L. Rao. HD cloud puts video formatting in the cloud.
http://www.techcrunch.com/2009/04/14/
hd-cloud-puts-video-formatting-in-the-cloud/,
Apr. 2009.
[14] C. Walter. Kryders law. Scientific American, July 2005.
[15] P. Wayner. Cloud versus cloud: A guided tour of Amazon, Google, AppNexus, and GoGrid. http://www.
infoworld.com/print/37122, July 2008.