|
@@ -135,77 +135,58 @@ most general to most specific:
|
|
|
|
|
|
Components of Memory Policies
|
|
|
|
|
|
- A Linux memory policy is a tuple consisting of a "mode" and an optional set
|
|
|
- of nodes. The mode determine the behavior of the policy, while the
|
|
|
- optional set of nodes can be viewed as the arguments to the behavior.
|
|
|
+ A Linux memory policy consists of a "mode", optional mode flags, and an
|
|
|
+ optional set of nodes. The mode determines the behavior of the policy,
|
|
|
+ the optional mode flags determine the behavior of the mode, and the
|
|
|
+ optional set of nodes can be viewed as the arguments to the policy
|
|
|
+ behavior.
|
|
|
|
|
|
Internally, memory policies are implemented by a reference counted
|
|
|
structure, struct mempolicy. Details of this structure will be discussed
|
|
|
in context, below, as required to explain the behavior.
|
|
|
|
|
|
- Note: in some functions AND in the struct mempolicy itself, the mode
|
|
|
- is called "policy". However, to avoid confusion with the policy tuple,
|
|
|
- this document will continue to use the term "mode".
|
|
|
-
|
|
|
Linux memory policy supports the following 4 behavioral modes:
|
|
|
|
|
|
- Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
|
|
|
- context or scope dependent.
|
|
|
-
|
|
|
- As mentioned in the Policy Scope section above, during normal
|
|
|
- system operation, the System Default Policy is hard coded to
|
|
|
- contain the Default mode.
|
|
|
-
|
|
|
- In this context, default mode means "local" allocation--that is
|
|
|
- attempt to allocate the page from the node associated with the cpu
|
|
|
- where the fault occurs. If the "local" node has no memory, or the
|
|
|
- node's memory can be exhausted [no free pages available], local
|
|
|
- allocation will "fallback to"--attempt to allocate pages from--
|
|
|
- "nearby" nodes, in order of increasing "distance".
|
|
|
+ Default Mode--MPOL_DEFAULT: This mode is only used in the memory
|
|
|
+ policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
|
|
|
+ memory policy in all policy scopes. Any existing non-default policy
|
|
|
+ will simply be removed when MPOL_DEFAULT is specified. As a result,
|
|
|
+ MPOL_DEFAULT means "fall back to the next most specific policy scope."
|
|
|
|
|
|
- Implementation detail -- subject to change: "Fallback" uses
|
|
|
- a per node list of sibling nodes--called zonelists--built at
|
|
|
- boot time, or when nodes or memory are added or removed from
|
|
|
- the system [memory hotplug]. These per node zonelist are
|
|
|
- constructed with nodes in order of increasing distance based
|
|
|
- on information provided by the platform firmware.
|
|
|
+ For example, a NULL or default task policy will fall back to the
|
|
|
+ system default policy. A NULL or default vma policy will fall
|
|
|
+ back to the task policy.
|
|
|
|
|
|
- When a task/process policy or a shared policy contains the Default
|
|
|
- mode, this also means "local allocation", as described above.
|
|
|
+ When specified in one of the memory policy APIs, the Default mode
|
|
|
+ does not use the optional set of nodes.
|
|
|
|
|
|
- In the context of a VMA, Default mode means "fall back to task
|
|
|
- policy"--which may or may not specify Default mode. Thus, Default
|
|
|
- mode can not be counted on to mean local allocation when used
|
|
|
- on a non-shared region of the address space. However, see
|
|
|
- MPOL_PREFERRED below.
|
|
|
-
|
|
|
- The Default mode does not use the optional set of nodes.
|
|
|
+ It is an error for the set of nodes specified for this policy to
|
|
|
+ be non-empty.
|
|
|
|
|
|
MPOL_BIND: This mode specifies that memory must come from the
|
|
|
- set of nodes specified by the policy.
|
|
|
-
|
|
|
- The memory policy APIs do not specify an order in which the nodes
|
|
|
- will be searched. However, unlike "local allocation", the Bind
|
|
|
- policy does not consider the distance between the nodes. Rather,
|
|
|
- allocations will fallback to the nodes specified by the policy in
|
|
|
- order of numeric node id. Like everything in Linux, this is subject
|
|
|
- to change.
|
|
|
+ set of nodes specified by the policy. Memory will be allocated from
|
|
|
+ the node in the set with sufficient free memory that is closest to
|
|
|
+ the node where the allocation takes place.
|
|
|
|
|
|
MPOL_PREFERRED: This mode specifies that the allocation should be
|
|
|
attempted from the single node specified in the policy. If that
|
|
|
- allocation fails, the kernel will search other nodes, exactly as
|
|
|
- it would for a local allocation that started at the preferred node
|
|
|
- in increasing distance from the preferred node. "Local" allocation
|
|
|
- policy can be viewed as a Preferred policy that starts at the node
|
|
|
+ allocation fails, the kernel will search other nodes, in order of
|
|
|
+ increasing distance from the preferred node based on information
|
|
|
+ provided by the platform firmware.
|
|
|
containing the cpu where the allocation takes place.
|
|
|
|
|
|
Internally, the Preferred policy uses a single node--the
|
|
|
- preferred_node member of struct mempolicy. A "distinguished
|
|
|
- value of this preferred_node, currently '-1', is interpreted
|
|
|
- as "the node containing the cpu where the allocation takes
|
|
|
- place"--local allocation. This is the way to specify
|
|
|
- local allocation for a specific range of addresses--i.e. for
|
|
|
- VMA policies.
|
|
|
+ preferred_node member of struct mempolicy. When the internal
|
|
|
+ mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and
|
|
|
+ the policy is interpreted as local allocation. "Local" allocation
|
|
|
+ policy can be viewed as a Preferred policy that starts at the node
|
|
|
+ containing the cpu where the allocation takes place.
|
|
|
+
|
|
|
+ It is possible for the user to specify that local allocation is
|
|
|
+ always preferred by passing an empty nodemask with this mode.
|
|
|
+ If an empty nodemask is passed, the policy cannot use the
|
|
|
+ MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
|
|
|
+ below.
|
|
|
|
|
|
MPOL_INTERLEAVED: This mode specifies that page allocations be
|
|
|
interleaved, on a page granularity, across the nodes specified in
|
|
@@ -231,6 +212,154 @@ Components of Memory Policies
|
|
|
the temporary interleaved system default policy works in this
|
|
|
mode.
|
|
|
|
|
|
+ Linux memory policy supports the following optional mode flags:
|
|
|
+
|
|
|
+ MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
|
|
|
+ the user should not be remapped if the task or VMA's set of allowed
|
|
|
+ nodes changes after the memory policy has been defined.
|
|
|
+
|
|
|
+ Without this flag, anytime a mempolicy is rebound because of a
|
|
|
+ change in the set of allowed nodes, the node (Preferred) or
|
|
|
+ nodemask (Bind, Interleave) is remapped to the new set of
|
|
|
+ allowed nodes. This may result in nodes being used that were
|
|
|
+ previously undesired.
|
|
|
+
|
|
|
+ With this flag, if the user-specified nodes overlap with the
|
|
|
+ nodes allowed by the task's cpuset, then the memory policy is
|
|
|
+ applied to their intersection. If the two sets of nodes do not
|
|
|
+ overlap, the Default policy is used.
|
|
|
+
|
|
|
+ For example, consider a task that is attached to a cpuset with
|
|
|
+ mems 1-3 that sets an Interleave policy over the same set. If
|
|
|
+ the cpuset's mems change to 3-5, the Interleave will now occur
|
|
|
+ over nodes 3, 4, and 5. With this flag, however, since only node
|
|
|
+ 3 is allowed from the user's nodemask, the "interleave" only
|
|
|
+ occurs over that node. If no nodes from the user's nodemask are
|
|
|
+ now allowed, the Default behavior is used.
|
|
|
+
|
|
|
+ MPOL_F_STATIC_NODES cannot be combined with the
|
|
|
+ MPOL_F_RELATIVE_NODES flag. It also cannot be used for
|
|
|
+ MPOL_PREFERRED policies that were created with an empty nodemask
|
|
|
+ (local allocation).
|
|
|
+
|
|
|
+ MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed
|
|
|
+ by the user will be mapped relative to the set of the task or VMA's
|
|
|
+ set of allowed nodes. The kernel stores the user-passed nodemask,
|
|
|
+ and if the allowed nodes changes, then that original nodemask will
|
|
|
+ be remapped relative to the new set of allowed nodes.
|
|
|
+
|
|
|
+ Without this flag (and without MPOL_F_STATIC_NODES), anytime a
|
|
|
+ mempolicy is rebound because of a change in the set of allowed
|
|
|
+ nodes, the node (Preferred) or nodemask (Bind, Interleave) is
|
|
|
+ remapped to the new set of allowed nodes. That remap may not
|
|
|
+ preserve the relative nature of the user's passed nodemask to its
|
|
|
+ set of allowed nodes upon successive rebinds: a nodemask of
|
|
|
+ 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
|
|
|
+ allowed nodes is restored to its original state.
|
|
|
+
|
|
|
+ With this flag, the remap is done so that the node numbers from
|
|
|
+ the user's passed nodemask are relative to the set of allowed
|
|
|
+ nodes. In other words, if nodes 0, 2, and 4 are set in the user's
|
|
|
+ nodemask, the policy will be effected over the first (and in the
|
|
|
+ Bind or Interleave case, the third and fifth) nodes in the set of
|
|
|
+ allowed nodes. The nodemask passed by the user represents nodes
|
|
|
+ relative to task or VMA's set of allowed nodes.
|
|
|
+
|
|
|
+ If the user's nodemask includes nodes that are outside the range
|
|
|
+ of the new set of allowed nodes (for example, node 5 is set in
|
|
|
+ the user's nodemask when the set of allowed nodes is only 0-3),
|
|
|
+ then the remap wraps around to the beginning of the nodemask and,
|
|
|
+ if not already set, sets the node in the mempolicy nodemask.
|
|
|
+
|
|
|
+ For example, consider a task that is attached to a cpuset with
|
|
|
+ mems 2-5 that sets an Interleave policy over the same set with
|
|
|
+ MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
|
|
|
+ interleave now occurs over nodes 3,5-6. If the cpuset's mems
|
|
|
+ then change to 0,2-3,5, then the interleave occurs over nodes
|
|
|
+ 0,3,5.
|
|
|
+
|
|
|
+ Thanks to the consistent remapping, applications preparing
|
|
|
+ nodemasks to specify memory policies using this flag should
|
|
|
+ disregard their current, actual cpuset imposed memory placement
|
|
|
+ and prepare the nodemask as if they were always located on
|
|
|
+ memory nodes 0 to N-1, where N is the number of memory nodes the
|
|
|
+ policy is intended to manage. Let the kernel then remap to the
|
|
|
+ set of memory nodes allowed by the task's cpuset, as that may
|
|
|
+ change over time.
|
|
|
+
|
|
|
+ MPOL_F_RELATIVE_NODES cannot be combined with the
|
|
|
+ MPOL_F_STATIC_NODES flag. It also cannot be used for
|
|
|
+ MPOL_PREFERRED policies that were created with an empty nodemask
|
|
|
+ (local allocation).
|
|
|
+
|
|
|
+MEMORY POLICY REFERENCE COUNTING
|
|
|
+
|
|
|
+To resolve use/free races, struct mempolicy contains an atomic reference
|
|
|
+count field. Internal interfaces, mpol_get()/mpol_put() increment and
|
|
|
+decrement this reference count, respectively. mpol_put() will only free
|
|
|
+the structure back to the mempolicy kmem cache when the reference count
|
|
|
+goes to zero.
|
|
|
+
|
|
|
+When a new memory policy is allocated, it's reference count is initialized
|
|
|
+to '1', representing the reference held by the task that is installing the
|
|
|
+new policy. When a pointer to a memory policy structure is stored in another
|
|
|
+structure, another reference is added, as the task's reference will be dropped
|
|
|
+on completion of the policy installation.
|
|
|
+
|
|
|
+During run-time "usage" of the policy, we attempt to minimize atomic operations
|
|
|
+on the reference count, as this can lead to cache lines bouncing between cpus
|
|
|
+and NUMA nodes. "Usage" here means one of the following:
|
|
|
+
|
|
|
+1) querying of the policy, either by the task itself [using the get_mempolicy()
|
|
|
+ API discussed below] or by another task using the /proc/<pid>/numa_maps
|
|
|
+ interface.
|
|
|
+
|
|
|
+2) examination of the policy to determine the policy mode and associated node
|
|
|
+ or node lists, if any, for page allocation. This is considered a "hot
|
|
|
+ path". Note that for MPOL_BIND, the "usage" extends across the entire
|
|
|
+ allocation process, which may sleep during page reclaimation, because the
|
|
|
+ BIND policy nodemask is used, by reference, to filter ineligible nodes.
|
|
|
+
|
|
|
+We can avoid taking an extra reference during the usages listed above as
|
|
|
+follows:
|
|
|
+
|
|
|
+1) we never need to get/free the system default policy as this is never
|
|
|
+ changed nor freed, once the system is up and running.
|
|
|
+
|
|
|
+2) for querying the policy, we do not need to take an extra reference on the
|
|
|
+ target task's task policy nor vma policies because we always acquire the
|
|
|
+ task's mm's mmap_sem for read during the query. The set_mempolicy() and
|
|
|
+ mbind() APIs [see below] always acquire the mmap_sem for write when
|
|
|
+ installing or replacing task or vma policies. Thus, there is no possibility
|
|
|
+ of a task or thread freeing a policy while another task or thread is
|
|
|
+ querying it.
|
|
|
+
|
|
|
+3) Page allocation usage of task or vma policy occurs in the fault path where
|
|
|
+ we hold them mmap_sem for read. Again, because replacing the task or vma
|
|
|
+ policy requires that the mmap_sem be held for write, the policy can't be
|
|
|
+ freed out from under us while we're using it for page allocation.
|
|
|
+
|
|
|
+4) Shared policies require special consideration. One task can replace a
|
|
|
+ shared memory policy while another task, with a distinct mmap_sem, is
|
|
|
+ querying or allocating a page based on the policy. To resolve this
|
|
|
+ potential race, the shared policy infrastructure adds an extra reference
|
|
|
+ to the shared policy during lookup while holding a spin lock on the shared
|
|
|
+ policy management structure. This requires that we drop this extra
|
|
|
+ reference when we're finished "using" the policy. We must drop the
|
|
|
+ extra reference on shared policies in the same query/allocation paths
|
|
|
+ used for non-shared policies. For this reason, shared policies are marked
|
|
|
+ as such, and the extra reference is dropped "conditionally"--i.e., only
|
|
|
+ for shared policies.
|
|
|
+
|
|
|
+ Because of this extra reference counting, and because we must lookup
|
|
|
+ shared policies in a tree structure under spinlock, shared policies are
|
|
|
+ more expensive to use in the page allocation path. This is expecially
|
|
|
+ true for shared policies on shared memory regions shared by tasks running
|
|
|
+ on different NUMA nodes. This extra overhead can be avoided by always
|
|
|
+ falling back to task or system default policy for shared memory regions,
|
|
|
+ or by prefaulting the entire shared memory region into memory and locking
|
|
|
+ it down. However, this might not be appropriate for all applications.
|
|
|
+
|
|
|
MEMORY POLICY APIs
|
|
|
|
|
|
Linux supports 3 system calls for controlling memory policy. These APIS
|
|
@@ -251,7 +380,9 @@ Set [Task] Memory Policy:
|
|
|
Set's the calling task's "task/process memory policy" to mode
|
|
|
specified by the 'mode' argument and the set of nodes defined
|
|
|
by 'nmask'. 'nmask' points to a bit mask of node ids containing
|
|
|
- at least 'maxnode' ids.
|
|
|
+ at least 'maxnode' ids. Optional mode flags may be passed by
|
|
|
+ combining the 'mode' argument with the flag (for example:
|
|
|
+ MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).
|
|
|
|
|
|
See the set_mempolicy(2) man page for more details
|
|
|
|
|
@@ -303,29 +434,19 @@ MEMORY POLICIES AND CPUSETS
|
|
|
Memory policies work within cpusets as described above. For memory policies
|
|
|
that require a node or set of nodes, the nodes are restricted to the set of
|
|
|
nodes whose memories are allowed by the cpuset constraints. If the nodemask
|
|
|
-specified for the policy contains nodes that are not allowed by the cpuset, or
|
|
|
-the intersection of the set of nodes specified for the policy and the set of
|
|
|
-nodes with memory is the empty set, the policy is considered invalid
|
|
|
-and cannot be installed.
|
|
|
-
|
|
|
-The interaction of memory policies and cpusets can be problematic for a
|
|
|
-couple of reasons:
|
|
|
-
|
|
|
-1) the memory policy APIs take physical node id's as arguments. As mentioned
|
|
|
- above, it is illegal to specify nodes that are not allowed in the cpuset.
|
|
|
- The application must query the allowed nodes using the get_mempolicy()
|
|
|
- API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
|
|
|
- restrict itself to those nodes. However, the resources available to a
|
|
|
- cpuset can be changed by the system administrator, or a workload manager
|
|
|
- application, at any time. So, a task may still get errors attempting to
|
|
|
- specify policy nodes, and must query the allowed memories again.
|
|
|
-
|
|
|
-2) when tasks in two cpusets share access to a memory region, such as shared
|
|
|
- memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
|
|
|
- MAP_SHARED flags, and any of the tasks install shared policy on the region,
|
|
|
- only nodes whose memories are allowed in both cpusets may be used in the
|
|
|
- policies. Obtaining this information requires "stepping outside" the
|
|
|
- memory policy APIs to use the cpuset information and requires that one
|
|
|
- know in what cpusets other task might be attaching to the shared region.
|
|
|
- Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
|
|
|
- allocation is the only valid policy.
|
|
|
+specified for the policy contains nodes that are not allowed by the cpuset and
|
|
|
+MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
|
|
|
+specified for the policy and the set of nodes with memory is used. If the
|
|
|
+result is the empty set, the policy is considered invalid and cannot be
|
|
|
+installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
|
|
|
+onto and folded into the task's set of allowed nodes as previously described.
|
|
|
+
|
|
|
+The interaction of memory policies and cpusets can be problematic when tasks
|
|
|
+in two cpusets share access to a memory region, such as shared memory segments
|
|
|
+created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
|
|
|
+any of the tasks install shared policy on the region, only nodes whose
|
|
|
+memories are allowed in both cpusets may be used in the policies. Obtaining
|
|
|
+this information requires "stepping outside" the memory policy APIs to use the
|
|
|
+cpuset information and requires that one know in what cpusets other task might
|
|
|
+be attaching to the shared region. Furthermore, if the cpusets' allowed
|
|
|
+memory sets are disjoint, "local" allocation is the only valid policy.
|