|
@@ -0,0 +1,332 @@
|
|
|
+
|
|
|
+What is Linux Memory Policy?
|
|
|
+
|
|
|
+In the Linux kernel, "memory policy" determines from which node the kernel will
|
|
|
+allocate memory in a NUMA system or in an emulated NUMA system. Linux has
|
|
|
+supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
|
|
|
+The current memory policy support was added to Linux 2.6 around May 2004. This
|
|
|
+document attempts to describe the concepts and APIs of the 2.6 memory policy
|
|
|
+support.
|
|
|
+
|
|
|
+Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
|
|
|
+which is an administrative mechanism for restricting the nodes from which
|
|
|
+memory may be allocated by a set of processes. Memory policies are a
|
|
|
+programming interface that a NUMA-aware application can take advantage of. When
|
|
|
+both cpusets and policies are applied to a task, the restrictions of the cpuset
|
|
|
+takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
|
|
|
+
|
|
|
+MEMORY POLICY CONCEPTS
|
|
|
+
|
|
|
+Scope of Memory Policies
|
|
|
+
|
|
|
+The Linux kernel supports _scopes_ of memory policy, described here from
|
|
|
+most general to most specific:
|
|
|
+
|
|
|
+ System Default Policy: this policy is "hard coded" into the kernel. It
|
|
|
+ is the policy that governs all page allocations that aren't controlled
|
|
|
+ by one of the more specific policy scopes discussed below. When the
|
|
|
+ system is "up and running", the system default policy will use "local
|
|
|
+ allocation" described below. However, during boot up, the system
|
|
|
+ default policy will be set to interleave allocations across all nodes
|
|
|
+ with "sufficient" memory, so as not to overload the initial boot node
|
|
|
+ with boot-time allocations.
|
|
|
+
|
|
|
+ Task/Process Policy: this is an optional, per-task policy. When defined
|
|
|
+ for a specific task, this policy controls all page allocations made by or
|
|
|
+ on behalf of the task that aren't controlled by a more specific scope.
|
|
|
+ If a task does not define a task policy, then all page allocations that
|
|
|
+ would have been controlled by the task policy "fall back" to the System
|
|
|
+ Default Policy.
|
|
|
+
|
|
|
+ The task policy applies to the entire address space of a task. Thus,
|
|
|
+ it is inheritable, and indeed is inherited, across both fork()
|
|
|
+ [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
|
|
|
+ to establish the task policy for a child task exec()'d from an
|
|
|
+ executable image that has no awareness of memory policy. See the
|
|
|
+ MEMORY POLICY APIS section, below, for an overview of the system call
|
|
|
+ that a task may use to set/change it's task/process policy.
|
|
|
+
|
|
|
+ In a multi-threaded task, task policies apply only to the thread
|
|
|
+ [Linux kernel task] that installs the policy and any threads
|
|
|
+ subsequently created by that thread. Any sibling threads existing
|
|
|
+ at the time a new task policy is installed retain their current
|
|
|
+ policy.
|
|
|
+
|
|
|
+ A task policy applies only to pages allocated after the policy is
|
|
|
+ installed. Any pages already faulted in by the task when the task
|
|
|
+ changes its task policy remain where they were allocated based on
|
|
|
+ the policy at the time they were allocated.
|
|
|
+
|
|
|
+ VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
|
|
|
+ virtual adddress space. A task may define a specific policy for a range
|
|
|
+ of its virtual address space. See the MEMORY POLICIES APIS section,
|
|
|
+ below, for an overview of the mbind() system call used to set a VMA
|
|
|
+ policy.
|
|
|
+
|
|
|
+ A VMA policy will govern the allocation of pages that back this region of
|
|
|
+ the address space. Any regions of the task's address space that don't
|
|
|
+ have an explicit VMA policy will fall back to the task policy, which may
|
|
|
+ itself fall back to the System Default Policy.
|
|
|
+
|
|
|
+ VMA policies have a few complicating details:
|
|
|
+
|
|
|
+ VMA policy applies ONLY to anonymous pages. These include pages
|
|
|
+ allocated for anonymous segments, such as the task stack and heap, and
|
|
|
+ any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
|
|
|
+ If a VMA policy is applied to a file mapping, it will be ignored if
|
|
|
+ the mapping used the MAP_SHARED flag. If the file mapping used the
|
|
|
+ MAP_PRIVATE flag, the VMA policy will only be applied when an
|
|
|
+ anonymous page is allocated on an attempt to write to the mapping--
|
|
|
+ i.e., at Copy-On-Write.
|
|
|
+
|
|
|
+ VMA policies are shared between all tasks that share a virtual address
|
|
|
+ space--a.k.a. threads--independent of when the policy is installed; and
|
|
|
+ they are inherited across fork(). However, because VMA policies refer
|
|
|
+ to a specific region of a task's address space, and because the address
|
|
|
+ space is discarded and recreated on exec*(), VMA policies are NOT
|
|
|
+ inheritable across exec(). Thus, only NUMA-aware applications may
|
|
|
+ use VMA policies.
|
|
|
+
|
|
|
+ A task may install a new VMA policy on a sub-range of a previously
|
|
|
+ mmap()ed region. When this happens, Linux splits the existing virtual
|
|
|
+ memory area into 2 or 3 VMAs, each with it's own policy.
|
|
|
+
|
|
|
+ By default, VMA policy applies only to pages allocated after the policy
|
|
|
+ is installed. Any pages already faulted into the VMA range remain
|
|
|
+ where they were allocated based on the policy at the time they were
|
|
|
+ allocated. However, since 2.6.16, Linux supports page migration via
|
|
|
+ the mbind() system call, so that page contents can be moved to match
|
|
|
+ a newly installed policy.
|
|
|
+
|
|
|
+ Shared Policy: Conceptually, shared policies apply to "memory objects"
|
|
|
+ mapped shared into one or more tasks' distinct address spaces. An
|
|
|
+ application installs a shared policies the same way as VMA policies--using
|
|
|
+ the mbind() system call specifying a range of virtual addresses that map
|
|
|
+ the shared object. However, unlike VMA policies, which can be considered
|
|
|
+ to be an attribute of a range of a task's address space, shared policies
|
|
|
+ apply directly to the shared object. Thus, all tasks that attach to the
|
|
|
+ object share the policy, and all pages allocated for the shared object,
|
|
|
+ by any task, will obey the shared policy.
|
|
|
+
|
|
|
+ As of 2.6.22, only shared memory segments, created by shmget() or
|
|
|
+ mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
|
|
|
+ policy support was added to Linux, the associated data structures were
|
|
|
+ added to hugetlbfs shmem segments. At the time, hugetlbfs did not
|
|
|
+ support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
|
|
|
+ shmem segments were never "hooked up" to the shared policy support.
|
|
|
+ Although hugetlbfs segments now support lazy allocation, their support
|
|
|
+ for shared policy has not been completed.
|
|
|
+
|
|
|
+ As mentioned above [re: VMA policies], allocations of page cache
|
|
|
+ pages for regular files mmap()ed with MAP_SHARED ignore any VMA
|
|
|
+ policy installed on the virtual address range backed by the shared
|
|
|
+ file mapping. Rather, shared page cache pages, including pages backing
|
|
|
+ private mappings that have not yet been written by the task, follow
|
|
|
+ task policy, if any, else System Default Policy.
|
|
|
+
|
|
|
+ The shared policy infrastructure supports different policies on subset
|
|
|
+ ranges of the shared object. However, Linux still splits the VMA of
|
|
|
+ the task that installs the policy for each range of distinct policy.
|
|
|
+ Thus, different tasks that attach to a shared memory segment can have
|
|
|
+ different VMA configurations mapping that one shared object. This
|
|
|
+ can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
|
|
|
+ a shared memory region, when one task has installed shared policy on
|
|
|
+ one or more ranges of the region.
|
|
|
+
|
|
|
+Components of Memory Policies
|
|
|
+
|
|
|
+ A Linux memory policy is a tuple consisting of a "mode" and an optional set
|
|
|
+ of nodes. The mode determine the behavior of the policy, while the
|
|
|
+ optional set of nodes can be viewed as the arguments to the behavior.
|
|
|
+
|
|
|
+ Internally, memory policies are implemented by a reference counted
|
|
|
+ structure, struct mempolicy. Details of this structure will be discussed
|
|
|
+ in context, below, as required to explain the behavior.
|
|
|
+
|
|
|
+ Note: in some functions AND in the struct mempolicy itself, the mode
|
|
|
+ is called "policy". However, to avoid confusion with the policy tuple,
|
|
|
+ this document will continue to use the term "mode".
|
|
|
+
|
|
|
+ Linux memory policy supports the following 4 behavioral modes:
|
|
|
+
|
|
|
+ Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
|
|
|
+ context or scope dependent.
|
|
|
+
|
|
|
+ As mentioned in the Policy Scope section above, during normal
|
|
|
+ system operation, the System Default Policy is hard coded to
|
|
|
+ contain the Default mode.
|
|
|
+
|
|
|
+ In this context, default mode means "local" allocation--that is
|
|
|
+ attempt to allocate the page from the node associated with the cpu
|
|
|
+ where the fault occurs. If the "local" node has no memory, or the
|
|
|
+ node's memory can be exhausted [no free pages available], local
|
|
|
+ allocation will "fallback to"--attempt to allocate pages from--
|
|
|
+ "nearby" nodes, in order of increasing "distance".
|
|
|
+
|
|
|
+ Implementation detail -- subject to change: "Fallback" uses
|
|
|
+ a per node list of sibling nodes--called zonelists--built at
|
|
|
+ boot time, or when nodes or memory are added or removed from
|
|
|
+ the system [memory hotplug]. These per node zonelist are
|
|
|
+ constructed with nodes in order of increasing distance based
|
|
|
+ on information provided by the platform firmware.
|
|
|
+
|
|
|
+ When a task/process policy or a shared policy contains the Default
|
|
|
+ mode, this also means "local allocation", as described above.
|
|
|
+
|
|
|
+ In the context of a VMA, Default mode means "fall back to task
|
|
|
+ policy"--which may or may not specify Default mode. Thus, Default
|
|
|
+ mode can not be counted on to mean local allocation when used
|
|
|
+ on a non-shared region of the address space. However, see
|
|
|
+ MPOL_PREFERRED below.
|
|
|
+
|
|
|
+ The Default mode does not use the optional set of nodes.
|
|
|
+
|
|
|
+ MPOL_BIND: This mode specifies that memory must come from the
|
|
|
+ set of nodes specified by the policy.
|
|
|
+
|
|
|
+ The memory policy APIs do not specify an order in which the nodes
|
|
|
+ will be searched. However, unlike "local allocation", the Bind
|
|
|
+ policy does not consider the distance between the nodes. Rather,
|
|
|
+ allocations will fallback to the nodes specified by the policy in
|
|
|
+ order of numeric node id. Like everything in Linux, this is subject
|
|
|
+ to change.
|
|
|
+
|
|
|
+ MPOL_PREFERRED: This mode specifies that the allocation should be
|
|
|
+ attempted from the single node specified in the policy. If that
|
|
|
+ allocation fails, the kernel will search other nodes, exactly as
|
|
|
+ it would for a local allocation that started at the preferred node
|
|
|
+ in increasing distance from the preferred node. "Local" allocation
|
|
|
+ policy can be viewed as a Preferred policy that starts at the node
|
|
|
+ containing the cpu where the allocation takes place.
|
|
|
+
|
|
|
+ Internally, the Preferred policy uses a single node--the
|
|
|
+ preferred_node member of struct mempolicy. A "distinguished
|
|
|
+ value of this preferred_node, currently '-1', is interpreted
|
|
|
+ as "the node containing the cpu where the allocation takes
|
|
|
+ place"--local allocation. This is the way to specify
|
|
|
+ local allocation for a specific range of addresses--i.e. for
|
|
|
+ VMA policies.
|
|
|
+
|
|
|
+ MPOL_INTERLEAVED: This mode specifies that page allocations be
|
|
|
+ interleaved, on a page granularity, across the nodes specified in
|
|
|
+ the policy. This mode also behaves slightly differently, based on
|
|
|
+ the context where it is used:
|
|
|
+
|
|
|
+ For allocation of anonymous pages and shared memory pages,
|
|
|
+ Interleave mode indexes the set of nodes specified by the policy
|
|
|
+ using the page offset of the faulting address into the segment
|
|
|
+ [VMA] containing the address modulo the number of nodes specified
|
|
|
+ by the policy. It then attempts to allocate a page, starting at
|
|
|
+ the selected node, as if the node had been specified by a Preferred
|
|
|
+ policy or had been selected by a local allocation. That is,
|
|
|
+ allocation will follow the per node zonelist.
|
|
|
+
|
|
|
+ For allocation of page cache pages, Interleave mode indexes the set
|
|
|
+ of nodes specified by the policy using a node counter maintained
|
|
|
+ per task. This counter wraps around to the lowest specified node
|
|
|
+ after it reaches the highest specified node. This will tend to
|
|
|
+ spread the pages out over the nodes specified by the policy based
|
|
|
+ on the order in which they are allocated, rather than based on any
|
|
|
+ page offset into an address range or file. During system boot up,
|
|
|
+ the temporary interleaved system default policy works in this
|
|
|
+ mode.
|
|
|
+
|
|
|
+MEMORY POLICY APIs
|
|
|
+
|
|
|
+Linux supports 3 system calls for controlling memory policy. These APIS
|
|
|
+always affect only the calling task, the calling task's address space, or
|
|
|
+some shared object mapped into the calling task's address space.
|
|
|
+
|
|
|
+ Note: the headers that define these APIs and the parameter data types
|
|
|
+ for user space applications reside in a package that is not part of
|
|
|
+ the Linux kernel. The kernel system call interfaces, with the 'sys_'
|
|
|
+ prefix, are defined in <linux/syscalls.h>; the mode and flag
|
|
|
+ definitions are defined in <linux/mempolicy.h>.
|
|
|
+
|
|
|
+Set [Task] Memory Policy:
|
|
|
+
|
|
|
+ long set_mempolicy(int mode, const unsigned long *nmask,
|
|
|
+ unsigned long maxnode);
|
|
|
+
|
|
|
+ Set's the calling task's "task/process memory policy" to mode
|
|
|
+ specified by the 'mode' argument and the set of nodes defined
|
|
|
+ by 'nmask'. 'nmask' points to a bit mask of node ids containing
|
|
|
+ at least 'maxnode' ids.
|
|
|
+
|
|
|
+ See the set_mempolicy(2) man page for more details
|
|
|
+
|
|
|
+
|
|
|
+Get [Task] Memory Policy or Related Information
|
|
|
+
|
|
|
+ long get_mempolicy(int *mode,
|
|
|
+ const unsigned long *nmask, unsigned long maxnode,
|
|
|
+ void *addr, int flags);
|
|
|
+
|
|
|
+ Queries the "task/process memory policy" of the calling task, or
|
|
|
+ the policy or location of a specified virtual address, depending
|
|
|
+ on the 'flags' argument.
|
|
|
+
|
|
|
+ See the get_mempolicy(2) man page for more details
|
|
|
+
|
|
|
+
|
|
|
+Install VMA/Shared Policy for a Range of Task's Address Space
|
|
|
+
|
|
|
+ long mbind(void *start, unsigned long len, int mode,
|
|
|
+ const unsigned long *nmask, unsigned long maxnode,
|
|
|
+ unsigned flags);
|
|
|
+
|
|
|
+ mbind() installs the policy specified by (mode, nmask, maxnodes) as
|
|
|
+ a VMA policy for the range of the calling task's address space
|
|
|
+ specified by the 'start' and 'len' arguments. Additional actions
|
|
|
+ may be requested via the 'flags' argument.
|
|
|
+
|
|
|
+ See the mbind(2) man page for more details.
|
|
|
+
|
|
|
+MEMORY POLICY COMMAND LINE INTERFACE
|
|
|
+
|
|
|
+Although not strictly part of the Linux implementation of memory policy,
|
|
|
+a command line tool, numactl(8), exists that allows one to:
|
|
|
+
|
|
|
++ set the task policy for a specified program via set_mempolicy(2), fork(2) and
|
|
|
+ exec(2)
|
|
|
+
|
|
|
++ set the shared policy for a shared memory segment via mbind(2)
|
|
|
+
|
|
|
+The numactl(8) tool is packages with the run-time version of the library
|
|
|
+containing the memory policy system call wrappers. Some distributions
|
|
|
+package the headers and compile-time libraries in a separate development
|
|
|
+package.
|
|
|
+
|
|
|
+
|
|
|
+MEMORY POLICIES AND CPUSETS
|
|
|
+
|
|
|
+Memory policies work within cpusets as described above. For memory policies
|
|
|
+that require a node or set of nodes, the nodes are restricted to the set of
|
|
|
+nodes whose memories are allowed by the cpuset constraints. If the
|
|
|
+intersection of the set of nodes specified for the policy and the set of nodes
|
|
|
+allowed by the cpuset is the empty set, the policy is considered invalid and
|
|
|
+cannot be installed.
|
|
|
+
|
|
|
+The interaction of memory policies and cpusets can be problematic for a
|
|
|
+couple of reasons:
|
|
|
+
|
|
|
+1) the memory policy APIs take physical node id's as arguments. However, the
|
|
|
+ memory policy APIs do not provide a way to determine what nodes are valid
|
|
|
+ in the context where the application is running. An application MAY consult
|
|
|
+ the cpuset file system [directly or via an out of tree, and not generally
|
|
|
+ available, libcpuset API] to obtain this information, but then the
|
|
|
+ application must be aware that it is running in a cpuset and use what are
|
|
|
+ intended primarily as administrative APIs.
|
|
|
+
|
|
|
+ However, as long as the policy specifies at least one node that is valid
|
|
|
+ in the controlling cpuset, the policy can be used.
|
|
|
+
|
|
|
+2) when tasks in two cpusets share access to a memory region, such as shared
|
|
|
+ memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
|
|
|
+ MAP_SHARED flags, and any of the tasks install shared policy on the region,
|
|
|
+ only nodes whose memories are allowed in both cpusets may be used in the
|
|
|
+ policies. Again, obtaining this information requires "stepping outside"
|
|
|
+ the memory policy APIs, as well as knowing in what cpusets other task might
|
|
|
+ be attaching to the shared region, to use the cpuset information.
|
|
|
+ Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
|
|
|
+ allocation is the only valid policy.
|