Anda di halaman 1dari 4

Memory Compression for Virtualized Environments

Nitin Gupta (ngupta@vflare.org)

Problem Statement
Virtualization provides independent containers, each capable of running entire OS stacks. This allows for
scale-out model of scalability where we run multiple Virtual Machines (VMs) on the same host to
achieve near native, or sometimes even exceed, native performance (where we run a single instance
natively). Thus Virtualization presents a highly simplified model of parallelization and is driving
computing industry towards increasingly large number of cores.

However, memory remains to be a bottleneck. While servers with 64 or more cores are going to be
common soon, the prices of DRAM remain prohibitive. To address these, solutions like Kernel Shared
Memory (KSM) – which merges duplicate pages – have been developed for the Linux kernel. However,
this is far from sufficient to allow high consolidation ratios (number of VMs per host) needed to fully
exploit server resources. Support for higher consolidation ratios is essential for the success of a
virtualization solution.

Proposed Solution
Memory Compression – compressing relatively unused VM pages and storing them in memory itself –
can provide significant memory savings. This should allow hosting more number of VMs for a given
amount of RAM.

Under memory overcommit; performance degradation is severe due to slow I/O to swap disks.
However, compression and decompression is several orders of magnitude faster than disk I/O. So, the
performance degradation will be much smoother under memory overcommit when compared to slow
swap I/O.

Existing Solutions
Starting with Linux kernel 2.6.33, ramzswap (aka compcache) driver has been accepted into the staging
tree. The driver creates virtual block devices which act (only) as swap disks. Pages swapped to these
disks are compressed and stored in memory itself. The implementation is now stable and works well,
however it has several problems:

- It is just a virtual swap device, so it cannot compress page cache (filesystem backed) pages
- Suffers from unnecessary overhead of block I/O layer
- Difficult to dynamically adjust compressed cache size since it must be presented as a fixed sized
block device
- Needs intrusive hooks within kernel which suggests that its better to properly integrate this
entire feature with swap/page cache code instead

Considering above problems and also the fact that its so important for virtualization scenarios, the
memory compression feature should be well integrated with swap/page cache code to minimize any
overhead and gain maximum flexibility. Using this new approach it should also be much more efficient
and cleaner to implement dynamic (compressed) cache resizing which previous research [1] [2] has
shown to be critical to gain any advantage of such a technique.

However, the work on ramzswap resulted in some important contributions which this new project aims
to reuse:

- xvmalloc memory allocator [3]: This is an O(1) memory allocator designed specifically for
handling variable sized compressed chunks. For all practical workloads, fragmentation observed
was within 10%
- It proved that memory compression gives significant performance gains on desktops and
embedded systems. It has been part of popular distributions like Ubuntu since a long time. It is
also part of in (unofficial) builds of Google Android where it obviates the need for any flash
based swap device which suffer from slow writes and wear-leveling issues

Design
The Linux kernel already has mechanism to determine relatively unused pages in the system. These are
reclaimed according the type of a page:

- Anonymous pages: written to swap disk (if present) and freed


- Clean pagecache pages: simply freed
- Dirty pagecache pages: flush to disk and freed

The plan is to compress anonymous and clean pagecache pages. For dirty pagecache pages, write (flush)
is issued and this clean page then comes for reclaim eventually where we will try to compress the same.

Now we hook into shrink_page_list() which is called when LRU pages are to be reclaimed and try to
compress and store these pages. New nodes will be introduced in /proc/sys/vm/ to control size of this
compressed cache and to export statistics.

When page fault occurs, we will check if the page has been compressed (as explained later), decompress
and proceed with the page fault as usual.

The compression will be done using LZO compressor since it offers very fast compression and
decompression while providing good compression ratios. To store the compressed chunks, xvmalloc
memory allocator will be used which is already included in mainline staging tree.

Dynamic cache resizing


Compressed cache is using part of memory to stored pages which would otherwise have been sent to
physical disk(s). Thus it’s important to ensure that it contains pages that are likely to be used in future
and also periodically discard pages that remain unused in this “second chance” cache.

For this, we maintain LRU list for compressed chunks to determine victim pages when shrinking this
cache or to make room for new pages (its size is bound and configurable through /proc nodes). The
pages are freed according to their type:
- Clean pagecache pages: simply free compress chunks
- Anonymous pages: decompress and send to physical swap disk. If physical swap disk is not
present, then these pages cannot be freed until it is used (i.e. a page fault is triggered) again.

(dirty pagecache pages are not stored)

A kernel thread will monitor the size and hit rate for the compressed cache: % of page faults serviced
from this cache. If the hit rate is low or the size exceeds some threshold (configurable through /proc),
the cache is shrunk as explained above. New pages are allocated for this cache only demand i.e. when
new compressed chunks cannot fit into existing pages.

Defragmentation
[NOT within Google SoC 2010 timeline]

As compressed chunks are allocated and freed, the xvmalloc memory pool can become quite
fragmented. It has been found that under heavy use, the fragmentation can rise to about 50% - 60%.
Thus an efficient scheme is needed to achieve defragmentation. In principle it should simply involve
moving compressed chunks to contiguous memory locations but the details are yet to be worked out. In
particular, handling locking issues is expected to be a major challenge.

Implementation
The aim is to integrate well with existing reclaim code and minimizing weird hooks all over the kernel
code. First, high level overview is shown followed by details, highlighting important data structures to be
used and how they fit into existing code.

shrink_page_list()
For every page in reclaim list

try_to_unmap() Dirty Page


Unmap from all PTEs
Clean Page
Success Anonymous Page
Compress
Type?

Failure
FS page

FS page Page Anonymous


Free Page Swap/FS
Type?
Disk
mapping->a_ops->writepage()

Figure 1: Journey of a page to compressed cache


Above flow chart explains how a page is sent to compressed cache. Note that the usual swap path is the
same as if storage to compress cache had failed. Compression can fail if the compress size is more than
half of the original size (default) or the compression cache is out of space (bounds configurable).

When a page fault occurs, we check if the page is in compressed cache (as explained below) in which
case we decompress the page and proceed with fault as usual.

To integrate well with existing reclaim path, a “fake” struct address_space is created:

struct address_space zmem_space = {


.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
.tree_lock = __SPIN_LOCK_UNLOCKED(zmem_space.tree_lock),
.a_ops = &zmem_aops,
...
.assoc_mapping = f_mapping,
};

static const struct address_space_operations zmem_aops = {


.writepage = zmem_writepage,
...
};

Thus, shrink_page_list()  pageout()  mapping->a_ops->writepage() will


now end up calling zmem_writepage() which compresses the given pages and stores it in
compressed cache (xvmalloc memory pool). While a radix tree (zmem_space.page_tree) is used to
maintain [pfn  address in compressed cache] mapping.

For anonymous pages, we use a swap type (see swap_entry_t) as compressed swap device:
SWP_COMPRESSED (just like SWP_HWPOISON, SWP_MIGRATION_READ and
SWP_MIGRATION_WRITE). This allows easily handling compressed page case in do_swap_page()
which invokes handler functions depending on swap type.

For filesystem backed pages, zmem_space->f_mapping points to mapping created for tracking
compressed page cache pages. The corresponding radix tree maps [file offset  address in compressed
cache]. This tree is searched when the normal page cache does not contain this page (see
filemap_fault()).

References
[1] Rodrigo S. de Castro , Alair Pereira do Lago , Dilma Da Silva, Adaptive Compressed Caching: Design
and Implementation, Proceedings of the 15th Symposium on Computer Architecture and High
Performance Computing, p.10, November 10-12, 2003
[2] Adaptive main memory compression: Authors: Irina Chihaia Tuduce, Thomas Gross. ATEC '05
Proceedings of the annual conference on USENIX Annual Technical Conference.
[3] xvmalloc memory allocator: http://code.google.com/p/compcache/wiki/xvMalloc

Anda mungkin juga menyukai