16 years ago · 9eb425c046
--- a/Documentation/filesystems/squashfs.txt
+++ b/Documentation/filesystems/squashfs.txt
@@ -0,0 +1,225 @@
 
				+SQUASHFS 4.0 FILESYSTEM
			
 
				+=======================
			
 
				+
			
 
				+Squashfs is a compressed read-only filesystem for Linux.
			
 
				+It uses zlib compression to compress files, inodes and directories.
			
 
				+Inodes in the system are very small and all blocks are packed to minimise
			
 
				+data overhead. Block sizes greater than 4K are supported up to a maximum
			
 
				+of 1Mbytes (default block size 128K).
			
 
				+
			
 
				+Squashfs is intended for general read-only filesystem use, for archival
			
 
				+use (i.e. in cases where a .tar.gz file may be used), and in constrained
			
 
				+block device/memory systems (e.g. embedded systems) where low overhead is
			
 
				+needed.
			
 
				+
			
 
				+Mailing list: squashfs-devel@lists.sourceforge.net
			
 
				+Web site: www.squashfs.org
			
 
				+
			
 
				+1. FILESYSTEM FEATURES
			
 
				+----------------------
			
 
				+
			
 
				+Squashfs filesystem features versus Cramfs:
			
 
				+
			
 
				+				Squashfs		Cramfs
			
 
				+
			
 
				+Max filesystem size:		2^64			16 MiB
			
 
				+Max file size:			~ 2 TiB			16 MiB
			
 
				+Max files:			unlimited		unlimited
			
 
				+Max directories:		unlimited		unlimited
			
 
				+Max entries per directory:	unlimited		unlimited
			
 
				+Max block size:			1 MiB			4 KiB
			
 
				+Metadata compression:		yes			no
			
 
				+Directory indexes:		yes			no
			
 
				+Sparse file support:		yes			no
			
 
				+Tail-end packing (fragments):	yes			no
			
 
				+Exportable (NFS etc.):		yes			no
			
 
				+Hard link support:		yes			no
			
 
				+"." and ".." in readdir:	yes			no
			
 
				+Real inode numbers:		yes			no
			
 
				+32-bit uids/gids:		yes			no
			
 
				+File creation time:		yes			no
			
 
				+Xattr and ACL support:		no			no
			
 
				+
			
 
				+Squashfs compresses data, inodes and directories.  In addition, inode and
			
 
				+directory data are highly compacted, and packed on byte boundaries.  Each
			
 
				+compressed inode is on average 8 bytes in length (the exact length varies on
			
 
				+file type, i.e. regular file, directory, symbolic link, and block/char device
			
 
				+inodes have different sizes).
			
 
				+
			
 
				+2. USING SQUASHFS
			
 
				+-----------------
			
 
				+
			
 
				+As squashfs is a read-only filesystem, the mksquashfs program must be used to
			
 
				+create populated squashfs filesystems.  This and other squashfs utilities
			
 
				+can be obtained from http://www.squashfs.org.  Usage instructions can be
			
 
				+obtained from this site also.
			
 
				+
			
 
				+
			
 
				+3. SQUASHFS FILESYSTEM DESIGN
			
 
				+-----------------------------
			
 
				+
			
 
				+A squashfs filesystem consists of seven parts, packed together on a byte
			
 
				+alignment:
			
 
				+
			
 
				+	 ---------------
			
 
				+	|  superblock 	|
			
 
				+	|---------------|
			
 
				+	|  datablocks   |
			
 
				+	|  & fragments  |
			
 
				+	|---------------|
			
 
				+	|  inode table	|
			
 
				+	|---------------|
			
 
				+	|   directory	|
			
 
				+	|     table     |
			
 
				+	|---------------|
			
 
				+	|   fragment	|
			
 
				+	|    table      |
			
 
				+	|---------------|
			
 
				+	|    export     |
			
 
				+	|    table      |
			
 
				+	|---------------|
			
 
				+	|    uid/gid	|
			
 
				+	|  lookup table	|
			
 
				+	 ---------------
			
 
				+
			
 
				+Compressed data blocks are written to the filesystem as files are read from
			
 
				+the source directory, and checked for duplicates.  Once all file data has been
			
 
				+written the completed inode, directory, fragment, export and uid/gid lookup
			
 
				+tables are written.
			
 
				+
			
 
				+3.1 Inodes
			
 
				+----------
			
 
				+
			
 
				+Metadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
			
 
				+compressed block is prefixed by a two byte length, the top bit is set if the
			
 
				+block is uncompressed.  A block will be uncompressed if the -noI option is set,
			
 
				+or if the compressed block was larger than the uncompressed block.
			
 
				+
			
 
				+Inodes are packed into the metadata blocks, and are not aligned to block
			
 
				+boundaries, therefore inodes overlap compressed blocks.  Inodes are identified
			
 
				+by a 48-bit number which encodes the location of the compressed metadata block
			
 
				+containing the inode, and the byte offset into that block where the inode is
			
 
				+placed (<block, offset>).
			
 
				+
			
 
				+To maximise compression there are different inodes for each file type
			
 
				+(regular file, directory, device, etc.), the inode contents and length
			
 
				+varying with the type.
			
 
				+
			
 
				+To further maximise compression, two types of regular file inode and
			
 
				+directory inode are defined: inodes optimised for frequently occurring
			
 
				+regular files and directories, and extended types where extra
			
 
				+information has to be stored.
			
 
				+
			
 
				+3.2 Directories
			
 
				+---------------
			
 
				+
			
 
				+Like inodes, directories are packed into compressed metadata blocks, stored
			
 
				+in a directory table.  Directories are accessed using the start address of
			
 
				+the metablock containing the directory and the offset into the
			
 
				+decompressed block (<block, offset>).
			
 
				+
			
 
				+Directories are organised in a slightly complex way, and are not simply
			
 
				+a list of file names.  The organisation takes advantage of the
			
 
				+fact that (in most cases) the inodes of the files will be in the same
			
 
				+compressed metadata block, and therefore, can share the start block.
			
 
				+Directories are therefore organised in a two level list, a directory
			
 
				+header containing the shared start block value, and a sequence of directory
			
 
				+entries, each of which share the shared start block.  A new directory header
			
 
				+is written once/if the inode start block changes.  The directory
			
 
				+header/directory entry list is repeated as many times as necessary.
			
 
				+
			
 
				+Directories are sorted, and can contain a directory index to speed up
			
 
				+file lookup.  Directory indexes store one entry per metablock, each entry
			
 
				+storing the index/filename mapping to the first directory header
			
 
				+in each metadata block.  Directories are sorted in alphabetical order,
			
 
				+and at lookup the index is scanned linearly looking for the first filename
			
 
				+alphabetically larger than the filename being looked up.  At this point the
			
 
				+location of the metadata block the filename is in has been found.
			
 
				+The general idea of the index is ensure only one metadata block needs to be
			
 
				+decompressed to do a lookup irrespective of the length of the directory.
			
 
				+This scheme has the advantage that it doesn't require extra memory overhead
			
 
				+and doesn't require much extra storage on disk.
			
 
				+
			
 
				+3.3 File data
			
 
				+-------------
			
 
				+
			
 
				+Regular files consist of a sequence of contiguous compressed blocks, and/or a
			
 
				+compressed fragment block (tail-end packed block).   The compressed size
			
 
				+of each datablock is stored in a block list contained within the
			
 
				+file inode.
			
 
				+
			
 
				+To speed up access to datablocks when reading 'large' files (256 Mbytes or
			
 
				+larger), the code implements an index cache that caches the mapping from
			
 
				+block index to datablock location on disk.
			
 
				+
			
 
				+The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
			
 
				+retaining a simple and space-efficient block list on disk.  The cache
			
 
				+is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
			
 
				+Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
			
 
				+The index cache is designed to be memory efficient, and by default uses
			
 
				+16 KiB.
			
 
				+
			
 
				+3.4 Fragment lookup table
			
 
				+-------------------------
			
 
				+
			
 
				+Regular files can contain a fragment index which is mapped to a fragment
			
 
				+location on disk and compressed size using a fragment lookup table.  This
			
 
				+fragment lookup table is itself stored compressed into metadata blocks.
			
 
				+A second index table is used to locate these.  This second index table for
			
 
				+speed of access (and because it is small) is read at mount time and cached
			
 
				+in memory.
			
 
				+
			
 
				+3.5 Uid/gid lookup table
			
 
				+------------------------
			
 
				+
			
 
				+For space efficiency regular files store uid and gid indexes, which are
			
 
				+converted to 32-bit uids/gids using an id look up table.  This table is
			
 
				+stored compressed into metadata blocks.  A second index table is used to
			
 
				+locate these.  This second index table for speed of access (and because it
			
 
				+is small) is read at mount time and cached in memory.
			
 
				+
			
 
				+3.6 Export table
			
 
				+----------------
			
 
				+
			
 
				+To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
			
 
				+can optionally (disabled with the -no-exports Mksquashfs option) contain
			
 
				+an inode number to inode disk location lookup table.  This is required to
			
 
				+enable Squashfs to map inode numbers passed in filehandles to the inode
			
 
				+location on disk, which is necessary when the export code reinstantiates
			
 
				+expired/flushed inodes.
			
 
				+
			
 
				+This table is stored compressed into metadata blocks.  A second index table is
			
 
				+used to locate these.  This second index table for speed of access (and because
			
 
				+it is small) is read at mount time and cached in memory.
			
 
				+
			
 
				+
			
 
				+4. TODOS AND OUTSTANDING ISSUES
			
 
				+-------------------------------
			
 
				+
			
 
				+4.1 Todo list
			
 
				+-------------
			
 
				+
			
 
				+Implement Xattr and ACL support.  The Squashfs 4.0 filesystem layout has hooks
			
 
				+for these but the code has not been written.  Once the code has been written
			
 
				+the existing layout should not require modification.
			
 
				+
			
 
				+4.2 Squashfs internal cache
			
 
				+---------------------------
			
 
				+
			
 
				+Blocks in Squashfs are compressed.  To avoid repeatedly decompressing
			
 
				+recently accessed data Squashfs uses two small metadata and fragment caches.
			
 
				+
			
 
				+The cache is not used for file datablocks, these are decompressed and cached in
			
 
				+the page-cache in the normal way.  The cache is used to temporarily cache
			
 
				+fragment and metadata blocks which have been read as a result of a metadata
			
 
				+(i.e. inode or directory) or fragment access.  Because metadata and fragments
			
 
				+are packed together into blocks (to gain greater compression) the read of a
			
 
				+particular piece of metadata or fragment will retrieve other metadata/fragments
			
 
				+which have been packed with it, these because of locality-of-reference may be
			
 
				+read in the near future. Temporarily caching them ensures they are available
			
 
				+for near future access without requiring an additional read and decompress.
			
 
				+
			
 
				+In the future this internal cache may be replaced with an implementation which
			
 
				+uses the kernel page cache.  Because the page cache operates on page sized
			
 
				+units this may introduce additional complexity in terms of locking and
			
 
				+associated race conditions.