|
@@ -0,0 +1,327 @@
|
|
|
+----------------------------------------------------------------------
|
|
|
+1. INTRODUCTION
|
|
|
+
|
|
|
+Modern filesystems feature checksumming of data and metadata to
|
|
|
+protect against data corruption. However, the detection of the
|
|
|
+corruption is done at read time which could potentially be months
|
|
|
+after the data was written. At that point the original data that the
|
|
|
+application tried to write is most likely lost.
|
|
|
+
|
|
|
+The solution is to ensure that the disk is actually storing what the
|
|
|
+application meant it to. Recent additions to both the SCSI family
|
|
|
+protocols (SBC Data Integrity Field, SCC protection proposal) as well
|
|
|
+as SATA/T13 (External Path Protection) try to remedy this by adding
|
|
|
+support for appending integrity metadata to an I/O. The integrity
|
|
|
+metadata (or protection information in SCSI terminology) includes a
|
|
|
+checksum for each sector as well as an incrementing counter that
|
|
|
+ensures the individual sectors are written in the right order. And
|
|
|
+for some protection schemes also that the I/O is written to the right
|
|
|
+place on disk.
|
|
|
+
|
|
|
+Current storage controllers and devices implement various protective
|
|
|
+measures, for instance checksumming and scrubbing. But these
|
|
|
+technologies are working in their own isolated domains or at best
|
|
|
+between adjacent nodes in the I/O path. The interesting thing about
|
|
|
+DIF and the other integrity extensions is that the protection format
|
|
|
+is well defined and every node in the I/O path can verify the
|
|
|
+integrity of the I/O and reject it if corruption is detected. This
|
|
|
+allows not only corruption prevention but also isolation of the point
|
|
|
+of failure.
|
|
|
+
|
|
|
+----------------------------------------------------------------------
|
|
|
+2. THE DATA INTEGRITY EXTENSIONS
|
|
|
+
|
|
|
+As written, the protocol extensions only protect the path between
|
|
|
+controller and storage device. However, many controllers actually
|
|
|
+allow the operating system to interact with the integrity metadata
|
|
|
+(IMD). We have been working with several FC/SAS HBA vendors to enable
|
|
|
+the protection information to be transferred to and from their
|
|
|
+controllers.
|
|
|
+
|
|
|
+The SCSI Data Integrity Field works by appending 8 bytes of protection
|
|
|
+information to each sector. The data + integrity metadata is stored
|
|
|
+in 520 byte sectors on disk. Data + IMD are interleaved when
|
|
|
+transferred between the controller and target. The T13 proposal is
|
|
|
+similar.
|
|
|
+
|
|
|
+Because it is highly inconvenient for operating systems to deal with
|
|
|
+520 (and 4104) byte sectors, we approached several HBA vendors and
|
|
|
+encouraged them to allow separation of the data and integrity metadata
|
|
|
+scatter-gather lists.
|
|
|
+
|
|
|
+The controller will interleave the buffers on write and split them on
|
|
|
+read. This means that the Linux can DMA the data buffers to and from
|
|
|
+host memory without changes to the page cache.
|
|
|
+
|
|
|
+Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
|
|
|
+is somewhat heavy to compute in software. Benchmarks found that
|
|
|
+calculating this checksum had a significant impact on system
|
|
|
+performance for a number of workloads. Some controllers allow a
|
|
|
+lighter-weight checksum to be used when interfacing with the operating
|
|
|
+system. Emulex, for instance, supports the TCP/IP checksum instead.
|
|
|
+The IP checksum received from the OS is converted to the 16-bit CRC
|
|
|
+when writing and vice versa. This allows the integrity metadata to be
|
|
|
+generated by Linux or the application at very low cost (comparable to
|
|
|
+software RAID5).
|
|
|
+
|
|
|
+The IP checksum is weaker than the CRC in terms of detecting bit
|
|
|
+errors. However, the strength is really in the separation of the data
|
|
|
+buffers and the integrity metadata. These two distinct buffers much
|
|
|
+match up for an I/O to complete.
|
|
|
+
|
|
|
+The separation of the data and integrity metadata buffers as well as
|
|
|
+the choice in checksums is referred to as the Data Integrity
|
|
|
+Extensions. As these extensions are outside the scope of the protocol
|
|
|
+bodies (T10, T13), Oracle and its partners are trying to standardize
|
|
|
+them within the Storage Networking Industry Association.
|
|
|
+
|
|
|
+----------------------------------------------------------------------
|
|
|
+3. KERNEL CHANGES
|
|
|
+
|
|
|
+The data integrity framework in Linux enables protection information
|
|
|
+to be pinned to I/Os and sent to/received from controllers that
|
|
|
+support it.
|
|
|
+
|
|
|
+The advantage to the integrity extensions in SCSI and SATA is that
|
|
|
+they enable us to protect the entire path from application to storage
|
|
|
+device. However, at the same time this is also the biggest
|
|
|
+disadvantage. It means that the protection information must be in a
|
|
|
+format that can be understood by the disk.
|
|
|
+
|
|
|
+Generally Linux/POSIX applications are agnostic to the intricacies of
|
|
|
+the storage devices they are accessing. The virtual filesystem switch
|
|
|
+and the block layer make things like hardware sector size and
|
|
|
+transport protocols completely transparent to the application.
|
|
|
+
|
|
|
+However, this level of detail is required when preparing the
|
|
|
+protection information to send to a disk. Consequently, the very
|
|
|
+concept of an end-to-end protection scheme is a layering violation.
|
|
|
+It is completely unreasonable for an application to be aware whether
|
|
|
+it is accessing a SCSI or SATA disk.
|
|
|
+
|
|
|
+The data integrity support implemented in Linux attempts to hide this
|
|
|
+from the application. As far as the application (and to some extent
|
|
|
+the kernel) is concerned, the integrity metadata is opaque information
|
|
|
+that's attached to the I/O.
|
|
|
+
|
|
|
+The current implementation allows the block layer to automatically
|
|
|
+generate the protection information for any I/O. Eventually the
|
|
|
+intent is to move the integrity metadata calculation to userspace for
|
|
|
+user data. Metadata and other I/O that originates within the kernel
|
|
|
+will still use the automatic generation interface.
|
|
|
+
|
|
|
+Some storage devices allow each hardware sector to be tagged with a
|
|
|
+16-bit value. The owner of this tag space is the owner of the block
|
|
|
+device. I.e. the filesystem in most cases. The filesystem can use
|
|
|
+this extra space to tag sectors as they see fit. Because the tag
|
|
|
+space is limited, the block interface allows tagging bigger chunks by
|
|
|
+way of interleaving. This way, 8*16 bits of information can be
|
|
|
+attached to a typical 4KB filesystem block.
|
|
|
+
|
|
|
+This also means that applications such as fsck and mkfs will need
|
|
|
+access to manipulate the tags from user space. A passthrough
|
|
|
+interface for this is being worked on.
|
|
|
+
|
|
|
+
|
|
|
+----------------------------------------------------------------------
|
|
|
+4. BLOCK LAYER IMPLEMENTATION DETAILS
|
|
|
+
|
|
|
+4.1 BIO
|
|
|
+
|
|
|
+The data integrity patches add a new field to struct bio when
|
|
|
+CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer
|
|
|
+to a struct bip which contains the bio integrity payload. Essentially
|
|
|
+a bip is a trimmed down struct bio which holds a bio_vec containing
|
|
|
+the integrity metadata and the required housekeeping information (bvec
|
|
|
+pool, vector count, etc.)
|
|
|
+
|
|
|
+A kernel subsystem can enable data integrity protection on a bio by
|
|
|
+calling bio_integrity_alloc(bio). This will allocate and attach the
|
|
|
+bip to the bio.
|
|
|
+
|
|
|
+Individual pages containing integrity metadata can subsequently be
|
|
|
+attached using bio_integrity_add_page().
|
|
|
+
|
|
|
+bio_free() will automatically free the bip.
|
|
|
+
|
|
|
+
|
|
|
+4.2 BLOCK DEVICE
|
|
|
+
|
|
|
+Because the format of the protection data is tied to the physical
|
|
|
+disk, each block device has been extended with a block integrity
|
|
|
+profile (struct blk_integrity). This optional profile is registered
|
|
|
+with the block layer using blk_integrity_register().
|
|
|
+
|
|
|
+The profile contains callback functions for generating and verifying
|
|
|
+the protection data, as well as getting and setting application tags.
|
|
|
+The profile also contains a few constants to aid in completing,
|
|
|
+merging and splitting the integrity metadata.
|
|
|
+
|
|
|
+Layered block devices will need to pick a profile that's appropriate
|
|
|
+for all subdevices. blk_integrity_compare() can help with that. DM
|
|
|
+and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
|
|
|
+will require extra work due to the application tag.
|
|
|
+
|
|
|
+
|
|
|
+----------------------------------------------------------------------
|
|
|
+5.0 BLOCK LAYER INTEGRITY API
|
|
|
+
|
|
|
+5.1 NORMAL FILESYSTEM
|
|
|
+
|
|
|
+ The normal filesystem is unaware that the underlying block device
|
|
|
+ is capable of sending/receiving integrity metadata. The IMD will
|
|
|
+ be automatically generated by the block layer at submit_bio() time
|
|
|
+ in case of a WRITE. A READ request will cause the I/O integrity
|
|
|
+ to be verified upon completion.
|
|
|
+
|
|
|
+ IMD generation and verification can be toggled using the
|
|
|
+
|
|
|
+ /sys/block/<bdev>/integrity/write_generate
|
|
|
+
|
|
|
+ and
|
|
|
+
|
|
|
+ /sys/block/<bdev>/integrity/read_verify
|
|
|
+
|
|
|
+ flags.
|
|
|
+
|
|
|
+
|
|
|
+5.2 INTEGRITY-AWARE FILESYSTEM
|
|
|
+
|
|
|
+ A filesystem that is integrity-aware can prepare I/Os with IMD
|
|
|
+ attached. It can also use the application tag space if this is
|
|
|
+ supported by the block device.
|
|
|
+
|
|
|
+
|
|
|
+ int bdev_integrity_enabled(block_device, int rw);
|
|
|
+
|
|
|
+ bdev_integrity_enabled() will return 1 if the block device
|
|
|
+ supports integrity metadata transfer for the data direction
|
|
|
+ specified in 'rw'.
|
|
|
+
|
|
|
+ bdev_integrity_enabled() honors the write_generate and
|
|
|
+ read_verify flags in sysfs and will respond accordingly.
|
|
|
+
|
|
|
+
|
|
|
+ int bio_integrity_prep(bio);
|
|
|
+
|
|
|
+ To generate IMD for WRITE and to set up buffers for READ, the
|
|
|
+ filesystem must call bio_integrity_prep(bio).
|
|
|
+
|
|
|
+ Prior to calling this function, the bio data direction and start
|
|
|
+ sector must be set, and the bio should have all data pages
|
|
|
+ added. It is up to the caller to ensure that the bio does not
|
|
|
+ change while I/O is in progress.
|
|
|
+
|
|
|
+ bio_integrity_prep() should only be called if
|
|
|
+ bio_integrity_enabled() returned 1.
|
|
|
+
|
|
|
+
|
|
|
+ int bio_integrity_tag_size(bio);
|
|
|
+
|
|
|
+ If the filesystem wants to use the application tag space it will
|
|
|
+ first have to find out how much storage space is available.
|
|
|
+ Because tag space is generally limited (usually 2 bytes per
|
|
|
+ sector regardless of sector size), the integrity framework
|
|
|
+ supports interleaving the information between the sectors in an
|
|
|
+ I/O.
|
|
|
+
|
|
|
+ Filesystems can call bio_integrity_tag_size(bio) to find out how
|
|
|
+ many bytes of storage are available for that particular bio.
|
|
|
+
|
|
|
+ Another option is bdev_get_tag_size(block_device) which will
|
|
|
+ return the number of available bytes per hardware sector.
|
|
|
+
|
|
|
+
|
|
|
+ int bio_integrity_set_tag(bio, void *tag_buf, len);
|
|
|
+
|
|
|
+ After a successful return from bio_integrity_prep(),
|
|
|
+ bio_integrity_set_tag() can be used to attach an opaque tag
|
|
|
+ buffer to a bio. Obviously this only makes sense if the I/O is
|
|
|
+ a WRITE.
|
|
|
+
|
|
|
+
|
|
|
+ int bio_integrity_get_tag(bio, void *tag_buf, len);
|
|
|
+
|
|
|
+ Similarly, at READ I/O completion time the filesystem can
|
|
|
+ retrieve the tag buffer using bio_integrity_get_tag().
|
|
|
+
|
|
|
+
|
|
|
+6.3 PASSING EXISTING INTEGRITY METADATA
|
|
|
+
|
|
|
+ Filesystems that either generate their own integrity metadata or
|
|
|
+ are capable of transferring IMD from user space can use the
|
|
|
+ following calls:
|
|
|
+
|
|
|
+
|
|
|
+ struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
|
|
|
+
|
|
|
+ Allocates the bio integrity payload and hangs it off of the bio.
|
|
|
+ nr_pages indicate how many pages of protection data need to be
|
|
|
+ stored in the integrity bio_vec list (similar to bio_alloc()).
|
|
|
+
|
|
|
+ The integrity payload will be freed at bio_free() time.
|
|
|
+
|
|
|
+
|
|
|
+ int bio_integrity_add_page(bio, page, len, offset);
|
|
|
+
|
|
|
+ Attaches a page containing integrity metadata to an existing
|
|
|
+ bio. The bio must have an existing bip,
|
|
|
+ i.e. bio_integrity_alloc() must have been called. For a WRITE,
|
|
|
+ the integrity metadata in the pages must be in a format
|
|
|
+ understood by the target device with the notable exception that
|
|
|
+ the sector numbers will be remapped as the request traverses the
|
|
|
+ I/O stack. This implies that the pages added using this call
|
|
|
+ will be modified during I/O! The first reference tag in the
|
|
|
+ integrity metadata must have a value of bip->bip_sector.
|
|
|
+
|
|
|
+ Pages can be added using bio_integrity_add_page() as long as
|
|
|
+ there is room in the bip bio_vec array (nr_pages).
|
|
|
+
|
|
|
+ Upon completion of a READ operation, the attached pages will
|
|
|
+ contain the integrity metadata received from the storage device.
|
|
|
+ It is up to the receiver to process them and verify data
|
|
|
+ integrity upon completion.
|
|
|
+
|
|
|
+
|
|
|
+6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
|
|
|
+ METADATA
|
|
|
+
|
|
|
+ To enable integrity exchange on a block device the gendisk must be
|
|
|
+ registered as capable:
|
|
|
+
|
|
|
+ int blk_integrity_register(gendisk, blk_integrity);
|
|
|
+
|
|
|
+ The blk_integrity struct is a template and should contain the
|
|
|
+ following:
|
|
|
+
|
|
|
+ static struct blk_integrity my_profile = {
|
|
|
+ .name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
|
|
|
+ .generate_fn = my_generate_fn,
|
|
|
+ .verify_fn = my_verify_fn,
|
|
|
+ .get_tag_fn = my_get_tag_fn,
|
|
|
+ .set_tag_fn = my_set_tag_fn,
|
|
|
+ .tuple_size = sizeof(struct my_tuple_size),
|
|
|
+ .tag_size = <tag bytes per hw sector>,
|
|
|
+ };
|
|
|
+
|
|
|
+ 'name' is a text string which will be visible in sysfs. This is
|
|
|
+ part of the userland API so chose it carefully and never change
|
|
|
+ it. The format is standards body-type-variant.
|
|
|
+ E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
|
|
|
+
|
|
|
+ 'generate_fn' generates appropriate integrity metadata (for WRITE).
|
|
|
+
|
|
|
+ 'verify_fn' verifies that the data buffer matches the integrity
|
|
|
+ metadata.
|
|
|
+
|
|
|
+ 'tuple_size' must be set to match the size of the integrity
|
|
|
+ metadata per sector. I.e. 8 for DIF and EPP.
|
|
|
+
|
|
|
+ 'tag_size' must be set to identify how many bytes of tag space
|
|
|
+ are available per hardware sector. For DIF this is either 2 or
|
|
|
+ 0 depending on the value of the Control Mode Page ATO bit.
|
|
|
+
|
|
|
+ See 6.2 for a description of get_tag_fn and set_tag_fn.
|
|
|
+
|
|
|
+----------------------------------------------------------------------
|
|
|
+2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
|