123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243 |
- Introduction
- ============
- dm-cache is a device mapper target written by Joe Thornber, Heinz
- Mauelshagen, and Mike Snitzer.
- It aims to improve performance of a block device (eg, a spindle) by
- dynamically migrating some of its data to a faster, smaller device
- (eg, an SSD).
- This device-mapper solution allows us to insert this caching at
- different levels of the dm stack, for instance above the data device for
- a thin-provisioning pool. Caching solutions that are integrated more
- closely with the virtual memory system should give better performance.
- The target reuses the metadata library used in the thin-provisioning
- library.
- The decision as to what data to migrate and when is left to a plug-in
- policy module. Several of these have been written as we experiment,
- and we hope other people will contribute others for specific io
- scenarios (eg. a vm image server).
- Glossary
- ========
- Migration - Movement of the primary copy of a logical block from one
- device to the other.
- Promotion - Migration from slow device to fast device.
- Demotion - Migration from fast device to slow device.
- The origin device always contains a copy of the logical block, which
- may be out of date or kept in sync with the copy on the cache device
- (depending on policy).
- Design
- ======
- Sub-devices
- -----------
- The target is constructed by passing three devices to it (along with
- other parameters detailed later):
- 1. An origin device - the big, slow one.
- 2. A cache device - the small, fast one.
- 3. A small metadata device - records which blocks are in the cache,
- which are dirty, and extra hints for use by the policy object.
- This information could be put on the cache device, but having it
- separate allows the volume manager to configure it differently,
- e.g. as a mirror for extra robustness.
- Fixed block size
- ----------------
- The origin is divided up into blocks of a fixed size. This block size
- is configurable when you first create the cache. Typically we've been
- using block sizes of 256k - 1024k.
- Having a fixed block size simplifies the target a lot. But it is
- something of a compromise. For instance, a small part of a block may be
- getting hit a lot, yet the whole block will be promoted to the cache.
- So large block sizes are bad because they waste cache space. And small
- block sizes are bad because they increase the amount of metadata (both
- in core and on disk).
- Writeback/writethrough
- ----------------------
- The cache has two modes, writeback and writethrough.
- If writeback, the default, is selected then a write to a block that is
- cached will go only to the cache and the block will be marked dirty in
- the metadata.
- If writethrough is selected then a write to a cached block will not
- complete until it has hit both the origin and cache devices. Clean
- blocks should remain clean.
- A simple cleaner policy is provided, which will clean (write back) all
- dirty blocks in a cache. Useful for decommissioning a cache.
- Migration throttling
- --------------------
- Migrating data between the origin and cache device uses bandwidth.
- The user can set a throttle to prevent more than a certain amount of
- migration occuring at any one time. Currently we're not taking any
- account of normal io traffic going to the devices. More work needs
- doing here to avoid migrating during those peak io moments.
- For the time being, a message "migration_threshold <#sectors>"
- can be used to set the maximum number of sectors being migrated,
- the default being 204800 sectors (or 100MB).
- Updating on-disk metadata
- -------------------------
- On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is
- written. If no such requests are made then commits will occur every
- second. This means the cache behaves like a physical disk that has a
- write cache (the same is true of the thin-provisioning target). If
- power is lost you may lose some recent writes. The metadata should
- always be consistent in spite of any crash.
- The 'dirty' state for a cache block changes far too frequently for us
- to keep updating it on the fly. So we treat it as a hint. In normal
- operation it will be written when the dm device is suspended. If the
- system crashes all cache blocks will be assumed dirty when restarted.
- Per-block policy hints
- ----------------------
- Policy plug-ins can store a chunk of data per cache block. It's up to
- the policy how big this chunk is, but it should be kept small. Like the
- dirty flags this data is lost if there's a crash so a safe fallback
- value should always be possible.
- For instance, the 'mq' policy, which is currently the default policy,
- uses this facility to store the hit count of the cache blocks. If
- there's a crash this information will be lost, which means the cache
- may be less efficient until those hit counts are regenerated.
- Policy hints affect performance, not correctness.
- Policy messaging
- ----------------
- Policies will have different tunables, specific to each one, so we
- need a generic way of getting and setting these. Device-mapper
- messages are used. Refer to cache-policies.txt.
- Discard bitset resolution
- -------------------------
- We can avoid copying data during migration if we know the block has
- been discarded. A prime example of this is when mkfs discards the
- whole block device. We store a bitset tracking the discard state of
- blocks. However, we allow this bitset to have a different block size
- from the cache blocks. This is because we need to track the discard
- state for all of the origin device (compare with the dirty bitset
- which is just for the smaller cache device).
- Target interface
- ================
- Constructor
- -----------
- cache <metadata dev> <cache dev> <origin dev> <block size>
- <#feature args> [<feature arg>]*
- <policy> <#policy args> [policy args]*
- metadata dev : fast device holding the persistent metadata
- cache dev : fast device holding cached data blocks
- origin dev : slow device holding original data blocks
- block size : cache unit size in sectors
- #feature args : number of feature arguments passed
- feature args : writethrough. (The default is writeback.)
- policy : the replacement policy to use
- #policy args : an even number of arguments corresponding to
- key/value pairs passed to the policy
- policy args : key/value pairs passed to the policy
- E.g. 'sequential_threshold 1024'
- See cache-policies.txt for details.
- Optional feature arguments are:
- writethrough : write through caching that prohibits cache block
- content from being different from origin block content.
- Without this argument, the default behaviour is to write
- back cache block contents later for performance reasons,
- so they may differ from the corresponding origin blocks.
- A policy called 'default' is always registered. This is an alias for
- the policy we currently think is giving best all round performance.
- As the default policy could vary between kernels, if you are relying on
- the characteristics of a specific policy, always request it by name.
- Status
- ------
- <#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses>
- <#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache>
- <#dirty> <#features> <features>* <#core args> <core args>* <#policy args>
- <policy args>*
- #used metadata blocks : Number of metadata blocks used
- #total metadata blocks : Total number of metadata blocks
- #read hits : Number of times a READ bio has been mapped
- to the cache
- #read misses : Number of times a READ bio has been mapped
- to the origin
- #write hits : Number of times a WRITE bio has been mapped
- to the cache
- #write misses : Number of times a WRITE bio has been
- mapped to the origin
- #demotions : Number of times a block has been removed
- from the cache
- #promotions : Number of times a block has been moved to
- the cache
- #blocks in cache : Number of blocks resident in the cache
- #dirty : Number of blocks in the cache that differ
- from the origin
- #feature args : Number of feature args to follow
- feature args : 'writethrough' (optional)
- #core args : Number of core arguments (must be even)
- core args : Key/value pairs for tuning the core
- e.g. migration_threshold
- #policy args : Number of policy arguments to follow (must be even)
- policy args : Key/value pairs
- e.g. 'sequential_threshold 1024
- Messages
- --------
- Policies will have different tunables, specific to each one, so we
- need a generic way of getting and setting these. Device-mapper
- messages are used. (A sysfs interface would also be possible.)
- The message format is:
- <key> <value>
- E.g.
- dmsetup message my_cache 0 sequential_threshold 1024
- Examples
- ========
- The test suite can be found here:
- https://github.com/jthornber/thinp-test-suite
- dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
- /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0'
- dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \
- /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \
- mq 4 sequential_threshold 1024 random_threshold 8'
|