|
@@ -0,0 +1,278 @@
|
|
|
|
+Frontswap provides a "transcendent memory" interface for swap pages.
|
|
|
|
+In some environments, dramatic performance savings may be obtained because
|
|
|
|
+swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
|
|
|
|
+
|
|
|
|
+(Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends"
|
|
|
|
+and the only necessary changes to the core kernel for transcendent memory;
|
|
|
|
+all other supporting code -- the "backends" -- is implemented as drivers.
|
|
|
|
+See the LWN.net article "Transcendent memory in a nutshell" for a detailed
|
|
|
|
+overview of frontswap and related kernel parts:
|
|
|
|
+https://lwn.net/Articles/454795/ )
|
|
|
|
+
|
|
|
|
+Frontswap is so named because it can be thought of as the opposite of
|
|
|
|
+a "backing" store for a swap device. The storage is assumed to be
|
|
|
|
+a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
|
|
|
|
+to the requirements of transcendent memory (such as Xen's "tmem", or
|
|
|
|
+in-kernel compressed memory, aka "zcache", or future RAM-like devices);
|
|
|
|
+this pseudo-RAM device is not directly accessible or addressable by the
|
|
|
|
+kernel and is of unknown and possibly time-varying size. The driver
|
|
|
|
+links itself to frontswap by calling frontswap_register_ops to set the
|
|
|
|
+frontswap_ops funcs appropriately and the functions it provides must
|
|
|
|
+conform to certain policies as follows:
|
|
|
|
+
|
|
|
|
+An "init" prepares the device to receive frontswap pages associated
|
|
|
|
+with the specified swap device number (aka "type"). A "store" will
|
|
|
|
+copy the page to transcendent memory and associate it with the type and
|
|
|
|
+offset associated with the page. A "load" will copy the page, if found,
|
|
|
|
+from transcendent memory into kernel memory, but will NOT remove the page
|
|
|
|
+from from transcendent memory. An "invalidate_page" will remove the page
|
|
|
|
+from transcendent memory and an "invalidate_area" will remove ALL pages
|
|
|
|
+associated with the swap type (e.g., like swapoff) and notify the "device"
|
|
|
|
+to refuse further stores with that swap type.
|
|
|
|
+
|
|
|
|
+Once a page is successfully stored, a matching load on the page will normally
|
|
|
|
+succeed. So when the kernel finds itself in a situation where it needs
|
|
|
|
+to swap out a page, it first attempts to use frontswap. If the store returns
|
|
|
|
+success, the data has been successfully saved to transcendent memory and
|
|
|
|
+a disk write and, if the data is later read back, a disk read are avoided.
|
|
|
|
+If a store returns failure, transcendent memory has rejected the data, and the
|
|
|
|
+page can be written to swap as usual.
|
|
|
|
+
|
|
|
|
+If a backend chooses, frontswap can be configured as a "writethrough
|
|
|
|
+cache" by calling frontswap_writethrough(). In this mode, the reduction
|
|
|
|
+in swap device writes is lost (and also a non-trivial performance advantage)
|
|
|
|
+in order to allow the backend to arbitrarily "reclaim" space used to
|
|
|
|
+store frontswap pages to more completely manage its memory usage.
|
|
|
|
+
|
|
|
|
+Note that if a page is stored and the page already exists in transcendent memory
|
|
|
|
+(a "duplicate" store), either the store succeeds and the data is overwritten,
|
|
|
|
+or the store fails AND the page is invalidated. This ensures stale data may
|
|
|
|
+never be obtained from frontswap.
|
|
|
|
+
|
|
|
|
+If properly configured, monitoring of frontswap is done via debugfs in
|
|
|
|
+the /sys/kernel/debug/frontswap directory. The effectiveness of
|
|
|
|
+frontswap can be measured (across all swap devices) with:
|
|
|
|
+
|
|
|
|
+failed_stores - how many store attempts have failed
|
|
|
|
+loads - how many loads were attempted (all should succeed)
|
|
|
|
+succ_stores - how many store attempts have succeeded
|
|
|
|
+invalidates - how many invalidates were attempted
|
|
|
|
+
|
|
|
|
+A backend implementation may provide additional metrics.
|
|
|
|
+
|
|
|
|
+FAQ
|
|
|
|
+
|
|
|
|
+1) Where's the value?
|
|
|
|
+
|
|
|
|
+When a workload starts swapping, performance falls through the floor.
|
|
|
|
+Frontswap significantly increases performance in many such workloads by
|
|
|
|
+providing a clean, dynamic interface to read and write swap pages to
|
|
|
|
+"transcendent memory" that is otherwise not directly addressable to the kernel.
|
|
|
|
+This interface is ideal when data is transformed to a different form
|
|
|
|
+and size (such as with compression) or secretly moved (as might be
|
|
|
|
+useful for write-balancing for some RAM-like devices). Swap pages (and
|
|
|
|
+evicted page-cache pages) are a great use for this kind of slower-than-RAM-
|
|
|
|
+but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
|
|
|
|
+cleancache) interface to transcendent memory provides a nice way to read
|
|
|
|
+and write -- and indirectly "name" -- the pages.
|
|
|
|
+
|
|
|
|
+Frontswap -- and cleancache -- with a fairly small impact on the kernel,
|
|
|
|
+provides a huge amount of flexibility for more dynamic, flexible RAM
|
|
|
|
+utilization in various system configurations:
|
|
|
|
+
|
|
|
|
+In the single kernel case, aka "zcache", pages are compressed and
|
|
|
|
+stored in local memory, thus increasing the total anonymous pages
|
|
|
|
+that can be safely kept in RAM. Zcache essentially trades off CPU
|
|
|
|
+cycles used in compression/decompression for better memory utilization.
|
|
|
|
+Benchmarks have shown little or no impact when memory pressure is
|
|
|
|
+low while providing a significant performance improvement (25%+)
|
|
|
|
+on some workloads under high memory pressure.
|
|
|
|
+
|
|
|
|
+"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
|
|
|
|
+support for clustered systems. Frontswap pages are locally compressed
|
|
|
|
+as in zcache, but then "remotified" to another system's RAM. This
|
|
|
|
+allows RAM to be dynamically load-balanced back-and-forth as needed,
|
|
|
|
+i.e. when system A is overcommitted, it can swap to system B, and
|
|
|
|
+vice versa. RAMster can also be configured as a memory server so
|
|
|
|
+many servers in a cluster can swap, dynamically as needed, to a single
|
|
|
|
+server configured with a large amount of RAM... without pre-configuring
|
|
|
|
+how much of the RAM is available for each of the clients!
|
|
|
|
+
|
|
|
|
+In the virtual case, the whole point of virtualization is to statistically
|
|
|
|
+multiplex physical resources acrosst the varying demands of multiple
|
|
|
|
+virtual machines. This is really hard to do with RAM and efforts to do
|
|
|
|
+it well with no kernel changes have essentially failed (except in some
|
|
|
|
+well-publicized special-case workloads).
|
|
|
|
+Specifically, the Xen Transcendent Memory backend allows otherwise
|
|
|
|
+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
|
|
|
|
+virtual machines, but the pages can be compressed and deduplicated to
|
|
|
|
+optimize RAM utilization. And when guest OS's are induced to surrender
|
|
|
|
+underutilized RAM (e.g. with "selfballooning"), sudden unexpected
|
|
|
|
+memory pressure may result in swapping; frontswap allows those pages
|
|
|
|
+to be swapped to and from hypervisor RAM (if overall host system memory
|
|
|
|
+conditions allow), thus mitigating the potentially awful performance impact
|
|
|
|
+of unplanned swapping.
|
|
|
|
+
|
|
|
|
+A KVM implementation is underway and has been RFC'ed to lkml. And,
|
|
|
|
+using frontswap, investigation is also underway on the use of NVM as
|
|
|
|
+a memory extension technology.
|
|
|
|
+
|
|
|
|
+2) Sure there may be performance advantages in some situations, but
|
|
|
|
+ what's the space/time overhead of frontswap?
|
|
|
|
+
|
|
|
|
+If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
|
|
|
|
+nothingness and the only overhead is a few extra bytes per swapon'ed
|
|
|
|
+swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
|
|
|
|
+registers, there is one extra global variable compared to zero for
|
|
|
|
+every swap page read or written. If CONFIG_FRONTSWAP is enabled
|
|
|
|
+AND a frontswap backend registers AND the backend fails every "store"
|
|
|
|
+request (i.e. provides no memory despite claiming it might),
|
|
|
|
+CPU overhead is still negligible -- and since every frontswap fail
|
|
|
|
+precedes a swap page write-to-disk, the system is highly likely
|
|
|
|
+to be I/O bound and using a small fraction of a percent of a CPU
|
|
|
|
+will be irrelevant anyway.
|
|
|
|
+
|
|
|
|
+As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
|
|
|
|
+registers, one bit is allocated for every swap page for every swap
|
|
|
|
+device that is swapon'd. This is added to the EIGHT bits (which
|
|
|
|
+was sixteen until about 2.6.34) that the kernel already allocates
|
|
|
|
+for every swap page for every swap device that is swapon'd. (Hugh
|
|
|
|
+Dickins has observed that frontswap could probably steal one of
|
|
|
|
+the existing eight bits, but let's worry about that minor optimization
|
|
|
|
+later.) For very large swap disks (which are rare) on a standard
|
|
|
|
+4K pagesize, this is 1MB per 32GB swap.
|
|
|
|
+
|
|
|
|
+When swap pages are stored in transcendent memory instead of written
|
|
|
|
+out to disk, there is a side effect that this may create more memory
|
|
|
|
+pressure that can potentially outweigh the other advantages. A
|
|
|
|
+backend, such as zcache, must implement policies to carefully (but
|
|
|
|
+dynamically) manage memory limits to ensure this doesn't happen.
|
|
|
|
+
|
|
|
|
+3) OK, how about a quick overview of what this frontswap patch does
|
|
|
|
+ in terms that a kernel hacker can grok?
|
|
|
|
+
|
|
|
|
+Let's assume that a frontswap "backend" has registered during
|
|
|
|
+kernel initialization; this registration indicates that this
|
|
|
|
+frontswap backend has access to some "memory" that is not directly
|
|
|
|
+accessible by the kernel. Exactly how much memory it provides is
|
|
|
|
+entirely dynamic and random.
|
|
|
|
+
|
|
|
|
+Whenever a swap-device is swapon'd frontswap_init() is called,
|
|
|
|
+passing the swap device number (aka "type") as a parameter.
|
|
|
|
+This notifies frontswap to expect attempts to "store" swap pages
|
|
|
|
+associated with that number.
|
|
|
|
+
|
|
|
|
+Whenever the swap subsystem is readying a page to write to a swap
|
|
|
|
+device (c.f swap_writepage()), frontswap_store is called. Frontswap
|
|
|
|
+consults with the frontswap backend and if the backend says it does NOT
|
|
|
|
+have room, frontswap_store returns -1 and the kernel swaps the page
|
|
|
|
+to the swap device as normal. Note that the response from the frontswap
|
|
|
|
+backend is unpredictable to the kernel; it may choose to never accept a
|
|
|
|
+page, it could accept every ninth page, or it might accept every
|
|
|
|
+page. But if the backend does accept a page, the data from the page
|
|
|
|
+has already been copied and associated with the type and offset,
|
|
|
|
+and the backend guarantees the persistence of the data. In this case,
|
|
|
|
+frontswap sets a bit in the "frontswap_map" for the swap device
|
|
|
|
+corresponding to the page offset on the swap device to which it would
|
|
|
|
+otherwise have written the data.
|
|
|
|
+
|
|
|
|
+When the swap subsystem needs to swap-in a page (swap_readpage()),
|
|
|
|
+it first calls frontswap_load() which checks the frontswap_map to
|
|
|
|
+see if the page was earlier accepted by the frontswap backend. If
|
|
|
|
+it was, the page of data is filled from the frontswap backend and
|
|
|
|
+the swap-in is complete. If not, the normal swap-in code is
|
|
|
|
+executed to obtain the page of data from the real swap device.
|
|
|
|
+
|
|
|
|
+So every time the frontswap backend accepts a page, a swap device read
|
|
|
|
+and (potentially) a swap device write are replaced by a "frontswap backend
|
|
|
|
+store" and (possibly) a "frontswap backend loads", which are presumably much
|
|
|
|
+faster.
|
|
|
|
+
|
|
|
|
+4) Can't frontswap be configured as a "special" swap device that is
|
|
|
|
+ just higher priority than any real swap device (e.g. like zswap,
|
|
|
|
+ or maybe swap-over-nbd/NFS)?
|
|
|
|
+
|
|
|
|
+No. First, the existing swap subsystem doesn't allow for any kind of
|
|
|
|
+swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy,
|
|
|
|
+but this would require fairly drastic changes. Even if it were
|
|
|
|
+rewritten, the existing swap subsystem uses the block I/O layer which
|
|
|
|
+assumes a swap device is fixed size and any page in it is linearly
|
|
|
|
+addressable. Frontswap barely touches the existing swap subsystem,
|
|
|
|
+and works around the constraints of the block I/O subsystem to provide
|
|
|
|
+a great deal of flexibility and dynamicity.
|
|
|
|
+
|
|
|
|
+For example, the acceptance of any swap page by the frontswap backend is
|
|
|
|
+entirely unpredictable. This is critical to the definition of frontswap
|
|
|
|
+backends because it grants completely dynamic discretion to the
|
|
|
|
+backend. In zcache, one cannot know a priori how compressible a page is.
|
|
|
|
+"Poorly" compressible pages can be rejected, and "poorly" can itself be
|
|
|
|
+defined dynamically depending on current memory constraints.
|
|
|
|
+
|
|
|
|
+Further, frontswap is entirely synchronous whereas a real swap
|
|
|
|
+device is, by definition, asynchronous and uses block I/O. The
|
|
|
|
+block I/O layer is not only unnecessary, but may perform "optimizations"
|
|
|
|
+that are inappropriate for a RAM-oriented device including delaying
|
|
|
|
+the write of some pages for a significant amount of time. Synchrony is
|
|
|
|
+required to ensure the dynamicity of the backend and to avoid thorny race
|
|
|
|
+conditions that would unnecessarily and greatly complicate frontswap
|
|
|
|
+and/or the block I/O subsystem. That said, only the initial "store"
|
|
|
|
+and "load" operations need be synchronous. A separate asynchronous thread
|
|
|
|
+is free to manipulate the pages stored by frontswap. For example,
|
|
|
|
+the "remotification" thread in RAMster uses standard asynchronous
|
|
|
|
+kernel sockets to move compressed frontswap pages to a remote machine.
|
|
|
|
+Similarly, a KVM guest-side implementation could do in-guest compression
|
|
|
|
+and use "batched" hypercalls.
|
|
|
|
+
|
|
|
|
+In a virtualized environment, the dynamicity allows the hypervisor
|
|
|
|
+(or host OS) to do "intelligent overcommit". For example, it can
|
|
|
|
+choose to accept pages only until host-swapping might be imminent,
|
|
|
|
+then force guests to do their own swapping.
|
|
|
|
+
|
|
|
|
+There is a downside to the transcendent memory specifications for
|
|
|
|
+frontswap: Since any "store" might fail, there must always be a real
|
|
|
|
+slot on a real swap device to swap the page. Thus frontswap must be
|
|
|
|
+implemented as a "shadow" to every swapon'd device with the potential
|
|
|
|
+capability of holding every page that the swap device might have held
|
|
|
|
+and the possibility that it might hold no pages at all. This means
|
|
|
|
+that frontswap cannot contain more pages than the total of swapon'd
|
|
|
|
+swap devices. For example, if NO swap device is configured on some
|
|
|
|
+installation, frontswap is useless. Swapless portable devices
|
|
|
|
+can still use frontswap but a backend for such devices must configure
|
|
|
|
+some kind of "ghost" swap device and ensure that it is never used.
|
|
|
|
+
|
|
|
|
+5) Why this weird definition about "duplicate stores"? If a page
|
|
|
|
+ has been previously successfully stored, can't it always be
|
|
|
|
+ successfully overwritten?
|
|
|
|
+
|
|
|
|
+Nearly always it can, but no, sometimes it cannot. Consider an example
|
|
|
|
+where data is compressed and the original 4K page has been compressed
|
|
|
|
+to 1K. Now an attempt is made to overwrite the page with data that
|
|
|
|
+is non-compressible and so would take the entire 4K. But the backend
|
|
|
|
+has no more space. In this case, the store must be rejected. Whenever
|
|
|
|
+frontswap rejects a store that would overwrite, it also must invalidate
|
|
|
|
+the old data and ensure that it is no longer accessible. Since the
|
|
|
|
+swap subsystem then writes the new data to the read swap device,
|
|
|
|
+this is the correct course of action to ensure coherency.
|
|
|
|
+
|
|
|
|
+6) What is frontswap_shrink for?
|
|
|
|
+
|
|
|
|
+When the (non-frontswap) swap subsystem swaps out a page to a real
|
|
|
|
+swap device, that page is only taking up low-value pre-allocated disk
|
|
|
|
+space. But if frontswap has placed a page in transcendent memory, that
|
|
|
|
+page may be taking up valuable real estate. The frontswap_shrink
|
|
|
|
+routine allows code outside of the swap subsystem to force pages out
|
|
|
|
+of the memory managed by frontswap and back into kernel-addressable memory.
|
|
|
|
+For example, in RAMster, a "suction driver" thread will attempt
|
|
|
|
+to "repatriate" pages sent to a remote machine back to the local machine;
|
|
|
|
+this is driven using the frontswap_shrink mechanism when memory pressure
|
|
|
|
+subsides.
|
|
|
|
+
|
|
|
|
+7) Why does the frontswap patch create the new include file swapfile.h?
|
|
|
|
+
|
|
|
|
+The frontswap code depends on some swap-subsystem-internal data
|
|
|
|
+structures that have, over the years, moved back and forth between
|
|
|
|
+static and global. This seemed a reasonable compromise: Define
|
|
|
|
+them as global but declare them in a new include file that isn't
|
|
|
|
+included by the large number of source files that include swap.h.
|
|
|
|
+
|
|
|
|
+Dan Magenheimer, last updated April 9, 2012
|