123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331 |
- What is Linux Memory Policy?
- In the Linux kernel, "memory policy" determines from which node the kernel will
- allocate memory in a NUMA system or in an emulated NUMA system. Linux has
- supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
- The current memory policy support was added to Linux 2.6 around May 2004. This
- document attempts to describe the concepts and APIs of the 2.6 memory policy
- support.
- Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
- which is an administrative mechanism for restricting the nodes from which
- memory may be allocated by a set of processes. Memory policies are a
- programming interface that a NUMA-aware application can take advantage of. When
- both cpusets and policies are applied to a task, the restrictions of the cpuset
- takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details.
- MEMORY POLICY CONCEPTS
- Scope of Memory Policies
- The Linux kernel supports _scopes_ of memory policy, described here from
- most general to most specific:
- System Default Policy: this policy is "hard coded" into the kernel. It
- is the policy that governs all page allocations that aren't controlled
- by one of the more specific policy scopes discussed below. When the
- system is "up and running", the system default policy will use "local
- allocation" described below. However, during boot up, the system
- default policy will be set to interleave allocations across all nodes
- with "sufficient" memory, so as not to overload the initial boot node
- with boot-time allocations.
- Task/Process Policy: this is an optional, per-task policy. When defined
- for a specific task, this policy controls all page allocations made by or
- on behalf of the task that aren't controlled by a more specific scope.
- If a task does not define a task policy, then all page allocations that
- would have been controlled by the task policy "fall back" to the System
- Default Policy.
- The task policy applies to the entire address space of a task. Thus,
- it is inheritable, and indeed is inherited, across both fork()
- [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
- to establish the task policy for a child task exec()'d from an
- executable image that has no awareness of memory policy. See the
- MEMORY POLICY APIS section, below, for an overview of the system call
- that a task may use to set/change it's task/process policy.
- In a multi-threaded task, task policies apply only to the thread
- [Linux kernel task] that installs the policy and any threads
- subsequently created by that thread. Any sibling threads existing
- at the time a new task policy is installed retain their current
- policy.
- A task policy applies only to pages allocated after the policy is
- installed. Any pages already faulted in by the task when the task
- changes its task policy remain where they were allocated based on
- the policy at the time they were allocated.
- VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
- virtual adddress space. A task may define a specific policy for a range
- of its virtual address space. See the MEMORY POLICIES APIS section,
- below, for an overview of the mbind() system call used to set a VMA
- policy.
- A VMA policy will govern the allocation of pages that back this region of
- the address space. Any regions of the task's address space that don't
- have an explicit VMA policy will fall back to the task policy, which may
- itself fall back to the System Default Policy.
- VMA policies have a few complicating details:
- VMA policy applies ONLY to anonymous pages. These include pages
- allocated for anonymous segments, such as the task stack and heap, and
- any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
- If a VMA policy is applied to a file mapping, it will be ignored if
- the mapping used the MAP_SHARED flag. If the file mapping used the
- MAP_PRIVATE flag, the VMA policy will only be applied when an
- anonymous page is allocated on an attempt to write to the mapping--
- i.e., at Copy-On-Write.
- VMA policies are shared between all tasks that share a virtual address
- space--a.k.a. threads--independent of when the policy is installed; and
- they are inherited across fork(). However, because VMA policies refer
- to a specific region of a task's address space, and because the address
- space is discarded and recreated on exec*(), VMA policies are NOT
- inheritable across exec(). Thus, only NUMA-aware applications may
- use VMA policies.
- A task may install a new VMA policy on a sub-range of a previously
- mmap()ed region. When this happens, Linux splits the existing virtual
- memory area into 2 or 3 VMAs, each with it's own policy.
- By default, VMA policy applies only to pages allocated after the policy
- is installed. Any pages already faulted into the VMA range remain
- where they were allocated based on the policy at the time they were
- allocated. However, since 2.6.16, Linux supports page migration via
- the mbind() system call, so that page contents can be moved to match
- a newly installed policy.
- Shared Policy: Conceptually, shared policies apply to "memory objects"
- mapped shared into one or more tasks' distinct address spaces. An
- application installs a shared policies the same way as VMA policies--using
- the mbind() system call specifying a range of virtual addresses that map
- the shared object. However, unlike VMA policies, which can be considered
- to be an attribute of a range of a task's address space, shared policies
- apply directly to the shared object. Thus, all tasks that attach to the
- object share the policy, and all pages allocated for the shared object,
- by any task, will obey the shared policy.
- As of 2.6.22, only shared memory segments, created by shmget() or
- mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
- policy support was added to Linux, the associated data structures were
- added to hugetlbfs shmem segments. At the time, hugetlbfs did not
- support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
- shmem segments were never "hooked up" to the shared policy support.
- Although hugetlbfs segments now support lazy allocation, their support
- for shared policy has not been completed.
- As mentioned above [re: VMA policies], allocations of page cache
- pages for regular files mmap()ed with MAP_SHARED ignore any VMA
- policy installed on the virtual address range backed by the shared
- file mapping. Rather, shared page cache pages, including pages backing
- private mappings that have not yet been written by the task, follow
- task policy, if any, else System Default Policy.
- The shared policy infrastructure supports different policies on subset
- ranges of the shared object. However, Linux still splits the VMA of
- the task that installs the policy for each range of distinct policy.
- Thus, different tasks that attach to a shared memory segment can have
- different VMA configurations mapping that one shared object. This
- can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
- a shared memory region, when one task has installed shared policy on
- one or more ranges of the region.
- Components of Memory Policies
- A Linux memory policy is a tuple consisting of a "mode" and an optional set
- of nodes. The mode determine the behavior of the policy, while the
- optional set of nodes can be viewed as the arguments to the behavior.
- Internally, memory policies are implemented by a reference counted
- structure, struct mempolicy. Details of this structure will be discussed
- in context, below, as required to explain the behavior.
- Note: in some functions AND in the struct mempolicy itself, the mode
- is called "policy". However, to avoid confusion with the policy tuple,
- this document will continue to use the term "mode".
- Linux memory policy supports the following 4 behavioral modes:
- Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
- context or scope dependent.
- As mentioned in the Policy Scope section above, during normal
- system operation, the System Default Policy is hard coded to
- contain the Default mode.
- In this context, default mode means "local" allocation--that is
- attempt to allocate the page from the node associated with the cpu
- where the fault occurs. If the "local" node has no memory, or the
- node's memory can be exhausted [no free pages available], local
- allocation will "fallback to"--attempt to allocate pages from--
- "nearby" nodes, in order of increasing "distance".
- Implementation detail -- subject to change: "Fallback" uses
- a per node list of sibling nodes--called zonelists--built at
- boot time, or when nodes or memory are added or removed from
- the system [memory hotplug]. These per node zonelist are
- constructed with nodes in order of increasing distance based
- on information provided by the platform firmware.
- When a task/process policy or a shared policy contains the Default
- mode, this also means "local allocation", as described above.
- In the context of a VMA, Default mode means "fall back to task
- policy"--which may or may not specify Default mode. Thus, Default
- mode can not be counted on to mean local allocation when used
- on a non-shared region of the address space. However, see
- MPOL_PREFERRED below.
- The Default mode does not use the optional set of nodes.
- MPOL_BIND: This mode specifies that memory must come from the
- set of nodes specified by the policy.
- The memory policy APIs do not specify an order in which the nodes
- will be searched. However, unlike "local allocation", the Bind
- policy does not consider the distance between the nodes. Rather,
- allocations will fallback to the nodes specified by the policy in
- order of numeric node id. Like everything in Linux, this is subject
- to change.
- MPOL_PREFERRED: This mode specifies that the allocation should be
- attempted from the single node specified in the policy. If that
- allocation fails, the kernel will search other nodes, exactly as
- it would for a local allocation that started at the preferred node
- in increasing distance from the preferred node. "Local" allocation
- policy can be viewed as a Preferred policy that starts at the node
- containing the cpu where the allocation takes place.
- Internally, the Preferred policy uses a single node--the
- preferred_node member of struct mempolicy. A "distinguished
- value of this preferred_node, currently '-1', is interpreted
- as "the node containing the cpu where the allocation takes
- place"--local allocation. This is the way to specify
- local allocation for a specific range of addresses--i.e. for
- VMA policies.
- MPOL_INTERLEAVED: This mode specifies that page allocations be
- interleaved, on a page granularity, across the nodes specified in
- the policy. This mode also behaves slightly differently, based on
- the context where it is used:
- For allocation of anonymous pages and shared memory pages,
- Interleave mode indexes the set of nodes specified by the policy
- using the page offset of the faulting address into the segment
- [VMA] containing the address modulo the number of nodes specified
- by the policy. It then attempts to allocate a page, starting at
- the selected node, as if the node had been specified by a Preferred
- policy or had been selected by a local allocation. That is,
- allocation will follow the per node zonelist.
- For allocation of page cache pages, Interleave mode indexes the set
- of nodes specified by the policy using a node counter maintained
- per task. This counter wraps around to the lowest specified node
- after it reaches the highest specified node. This will tend to
- spread the pages out over the nodes specified by the policy based
- on the order in which they are allocated, rather than based on any
- page offset into an address range or file. During system boot up,
- the temporary interleaved system default policy works in this
- mode.
- MEMORY POLICY APIs
- Linux supports 3 system calls for controlling memory policy. These APIS
- always affect only the calling task, the calling task's address space, or
- some shared object mapped into the calling task's address space.
- Note: the headers that define these APIs and the parameter data types
- for user space applications reside in a package that is not part of
- the Linux kernel. The kernel system call interfaces, with the 'sys_'
- prefix, are defined in <linux/syscalls.h>; the mode and flag
- definitions are defined in <linux/mempolicy.h>.
- Set [Task] Memory Policy:
- long set_mempolicy(int mode, const unsigned long *nmask,
- unsigned long maxnode);
- Set's the calling task's "task/process memory policy" to mode
- specified by the 'mode' argument and the set of nodes defined
- by 'nmask'. 'nmask' points to a bit mask of node ids containing
- at least 'maxnode' ids.
- See the set_mempolicy(2) man page for more details
- Get [Task] Memory Policy or Related Information
- long get_mempolicy(int *mode,
- const unsigned long *nmask, unsigned long maxnode,
- void *addr, int flags);
- Queries the "task/process memory policy" of the calling task, or
- the policy or location of a specified virtual address, depending
- on the 'flags' argument.
- See the get_mempolicy(2) man page for more details
- Install VMA/Shared Policy for a Range of Task's Address Space
- long mbind(void *start, unsigned long len, int mode,
- const unsigned long *nmask, unsigned long maxnode,
- unsigned flags);
- mbind() installs the policy specified by (mode, nmask, maxnodes) as
- a VMA policy for the range of the calling task's address space
- specified by the 'start' and 'len' arguments. Additional actions
- may be requested via the 'flags' argument.
- See the mbind(2) man page for more details.
- MEMORY POLICY COMMAND LINE INTERFACE
- Although not strictly part of the Linux implementation of memory policy,
- a command line tool, numactl(8), exists that allows one to:
- + set the task policy for a specified program via set_mempolicy(2), fork(2) and
- exec(2)
- + set the shared policy for a shared memory segment via mbind(2)
- The numactl(8) tool is packages with the run-time version of the library
- containing the memory policy system call wrappers. Some distributions
- package the headers and compile-time libraries in a separate development
- package.
- MEMORY POLICIES AND CPUSETS
- Memory policies work within cpusets as described above. For memory policies
- that require a node or set of nodes, the nodes are restricted to the set of
- nodes whose memories are allowed by the cpuset constraints. If the nodemask
- specified for the policy contains nodes that are not allowed by the cpuset, or
- the intersection of the set of nodes specified for the policy and the set of
- nodes with memory is the empty set, the policy is considered invalid
- and cannot be installed.
- The interaction of memory policies and cpusets can be problematic for a
- couple of reasons:
- 1) the memory policy APIs take physical node id's as arguments. As mentioned
- above, it is illegal to specify nodes that are not allowed in the cpuset.
- The application must query the allowed nodes using the get_mempolicy()
- API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
- restrict itself to those nodes. However, the resources available to a
- cpuset can be changed by the system administrator, or a workload manager
- application, at any time. So, a task may still get errors attempting to
- specify policy nodes, and must query the allowed memories again.
- 2) when tasks in two cpusets share access to a memory region, such as shared
- memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
- MAP_SHARED flags, and any of the tasks install shared policy on the region,
- only nodes whose memories are allowed in both cpusets may be used in the
- policies. Obtaining this information requires "stepping outside" the
- memory policy APIs to use the cpuset information and requires that one
- know in what cpusets other task might be attaching to the shared region.
- Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
- allocation is the only valid policy.
|