|
@@ -7,6 +7,7 @@ Written by Simon.Derr@bull.net
|
|
|
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
|
|
Modified by Paul Jackson <pj@sgi.com>
|
|
|
Modified by Christoph Lameter <clameter@sgi.com>
|
|
|
+Modified by Paul Menage <menage@google.com>
|
|
|
|
|
|
CONTENTS:
|
|
|
=========
|
|
@@ -16,10 +17,9 @@ CONTENTS:
|
|
|
1.2 Why are cpusets needed ?
|
|
|
1.3 How are cpusets implemented ?
|
|
|
1.4 What are exclusive cpusets ?
|
|
|
- 1.5 What does notify_on_release do ?
|
|
|
- 1.6 What is memory_pressure ?
|
|
|
- 1.7 What is memory spread ?
|
|
|
- 1.8 How do I use cpusets ?
|
|
|
+ 1.5 What is memory_pressure ?
|
|
|
+ 1.6 What is memory spread ?
|
|
|
+ 1.7 How do I use cpusets ?
|
|
|
2. Usage Examples and Syntax
|
|
|
2.1 Basic Usage
|
|
|
2.2 Adding/removing cpus
|
|
@@ -44,18 +44,19 @@ hierarchy visible in a virtual file system. These are the essential
|
|
|
hooks, beyond what is already present, required to manage dynamic
|
|
|
job placement on large systems.
|
|
|
|
|
|
-Each task has a pointer to a cpuset. Multiple tasks may reference
|
|
|
-the same cpuset. Requests by a task, using the sched_setaffinity(2)
|
|
|
-system call to include CPUs in its CPU affinity mask, and using the
|
|
|
-mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
|
|
|
-in its memory policy, are both filtered through that tasks cpuset,
|
|
|
-filtering out any CPUs or Memory Nodes not in that cpuset. The
|
|
|
-scheduler will not schedule a task on a CPU that is not allowed in
|
|
|
-its cpus_allowed vector, and the kernel page allocator will not
|
|
|
-allocate a page on a node that is not allowed in the requesting tasks
|
|
|
-mems_allowed vector.
|
|
|
-
|
|
|
-User level code may create and destroy cpusets by name in the cpuset
|
|
|
+Cpusets use the generic cgroup subsystem described in
|
|
|
+Documentation/cgroup.txt.
|
|
|
+
|
|
|
+Requests by a task, using the sched_setaffinity(2) system call to
|
|
|
+include CPUs in its CPU affinity mask, and using the mbind(2) and
|
|
|
+set_mempolicy(2) system calls to include Memory Nodes in its memory
|
|
|
+policy, are both filtered through that tasks cpuset, filtering out any
|
|
|
+CPUs or Memory Nodes not in that cpuset. The scheduler will not
|
|
|
+schedule a task on a CPU that is not allowed in its cpus_allowed
|
|
|
+vector, and the kernel page allocator will not allocate a page on a
|
|
|
+node that is not allowed in the requesting tasks mems_allowed vector.
|
|
|
+
|
|
|
+User level code may create and destroy cpusets by name in the cgroup
|
|
|
virtual file system, manage the attributes and permissions of these
|
|
|
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
|
|
|
specify and query to which cpuset a task is assigned, and list the
|
|
@@ -115,7 +116,7 @@ Cpusets extends these two mechanisms as follows:
|
|
|
- Cpusets are sets of allowed CPUs and Memory Nodes, known to the
|
|
|
kernel.
|
|
|
- Each task in the system is attached to a cpuset, via a pointer
|
|
|
- in the task structure to a reference counted cpuset structure.
|
|
|
+ in the task structure to a reference counted cgroup structure.
|
|
|
- Calls to sched_setaffinity are filtered to just those CPUs
|
|
|
allowed in that tasks cpuset.
|
|
|
- Calls to mbind and set_mempolicy are filtered to just
|
|
@@ -145,15 +146,10 @@ into the rest of the kernel, none in performance critical paths:
|
|
|
- in page_alloc.c, to restrict memory to allowed nodes.
|
|
|
- in vmscan.c, to restrict page recovery to the current cpuset.
|
|
|
|
|
|
-In addition a new file system, of type "cpuset" may be mounted,
|
|
|
-typically at /dev/cpuset, to enable browsing and modifying the cpusets
|
|
|
-presently known to the kernel. No new system calls are added for
|
|
|
-cpusets - all support for querying and modifying cpusets is via
|
|
|
-this cpuset file system.
|
|
|
-
|
|
|
-Each task under /proc has an added file named 'cpuset', displaying
|
|
|
-the cpuset name, as the path relative to the root of the cpuset file
|
|
|
-system.
|
|
|
+You should mount the "cgroup" filesystem type in order to enable
|
|
|
+browsing and modifying the cpusets presently known to the kernel. No
|
|
|
+new system calls are added for cpusets - all support for querying and
|
|
|
+modifying cpusets is via this cpuset file system.
|
|
|
|
|
|
The /proc/<pid>/status file for each task has two added lines,
|
|
|
displaying the tasks cpus_allowed (on which CPUs it may be scheduled)
|
|
@@ -163,16 +159,15 @@ in the format seen in the following example:
|
|
|
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
|
|
Mems_allowed: ffffffff,ffffffff
|
|
|
|
|
|
-Each cpuset is represented by a directory in the cpuset file system
|
|
|
-containing the following files describing that cpuset:
|
|
|
+Each cpuset is represented by a directory in the cgroup file system
|
|
|
+containing (on top of the standard cgroup files) the following
|
|
|
+files describing that cpuset:
|
|
|
|
|
|
- cpus: list of CPUs in that cpuset
|
|
|
- mems: list of Memory Nodes in that cpuset
|
|
|
- memory_migrate flag: if set, move pages to cpusets nodes
|
|
|
- cpu_exclusive flag: is cpu placement exclusive?
|
|
|
- mem_exclusive flag: is memory placement exclusive?
|
|
|
- - tasks: list of tasks (by pid) attached to that cpuset
|
|
|
- - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
|
|
|
- memory_pressure: measure of how much paging pressure in cpuset
|
|
|
|
|
|
In addition, the root cpuset only has the following file:
|
|
@@ -237,21 +232,7 @@ such as requests from interrupt handlers, is allowed to be taken
|
|
|
outside even a mem_exclusive cpuset.
|
|
|
|
|
|
|
|
|
-1.5 What does notify_on_release do ?
|
|
|
-------------------------------------
|
|
|
-
|
|
|
-If the notify_on_release flag is enabled (1) in a cpuset, then whenever
|
|
|
-the last task in the cpuset leaves (exits or attaches to some other
|
|
|
-cpuset) and the last child cpuset of that cpuset is removed, then
|
|
|
-the kernel runs the command /sbin/cpuset_release_agent, supplying the
|
|
|
-pathname (relative to the mount point of the cpuset file system) of the
|
|
|
-abandoned cpuset. This enables automatic removal of abandoned cpusets.
|
|
|
-The default value of notify_on_release in the root cpuset at system
|
|
|
-boot is disabled (0). The default value of other cpusets at creation
|
|
|
-is the current value of their parents notify_on_release setting.
|
|
|
-
|
|
|
-
|
|
|
-1.6 What is memory_pressure ?
|
|
|
+1.5 What is memory_pressure ?
|
|
|
-----------------------------
|
|
|
The memory_pressure of a cpuset provides a simple per-cpuset metric
|
|
|
of the rate that the tasks in a cpuset are attempting to free up in
|
|
@@ -308,7 +289,7 @@ the tasks in the cpuset, in units of reclaims attempted per second,
|
|
|
times 1000.
|
|
|
|
|
|
|
|
|
-1.7 What is memory spread ?
|
|
|
+1.6 What is memory spread ?
|
|
|
---------------------------
|
|
|
There are two boolean flag files per cpuset that control where the
|
|
|
kernel allocates pages for the file system buffers and related in
|
|
@@ -379,7 +360,7 @@ data set, the memory allocation across the nodes in the jobs cpuset
|
|
|
can become very uneven.
|
|
|
|
|
|
|
|
|
-1.8 How do I use cpusets ?
|
|
|
+1.7 How do I use cpusets ?
|
|
|
--------------------------
|
|
|
|
|
|
In order to minimize the impact of cpusets on critical kernel
|
|
@@ -469,7 +450,7 @@ than stress the kernel.
|
|
|
To start a new job that is to be contained within a cpuset, the steps are:
|
|
|
|
|
|
1) mkdir /dev/cpuset
|
|
|
- 2) mount -t cpuset none /dev/cpuset
|
|
|
+ 2) mount -t cgroup -ocpuset cpuset /dev/cpuset
|
|
|
3) Create the new cpuset by doing mkdir's and write's (or echo's) in
|
|
|
the /dev/cpuset virtual file system.
|
|
|
4) Start a task that will be the "founding father" of the new job.
|
|
@@ -481,7 +462,7 @@ For example, the following sequence of commands will setup a cpuset
|
|
|
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
|
|
and then start a subshell 'sh' in that cpuset:
|
|
|
|
|
|
- mount -t cpuset none /dev/cpuset
|
|
|
+ mount -t cgroup -ocpuset cpuset /dev/cpuset
|
|
|
cd /dev/cpuset
|
|
|
mkdir Charlie
|
|
|
cd Charlie
|
|
@@ -513,7 +494,7 @@ Creating, modifying, using the cpusets can be done through the cpuset
|
|
|
virtual filesystem.
|
|
|
|
|
|
To mount it, type:
|
|
|
-# mount -t cpuset none /dev/cpuset
|
|
|
+# mount -t cgroup -o cpuset cpuset /dev/cpuset
|
|
|
|
|
|
Then under /dev/cpuset you can find a tree that corresponds to the
|
|
|
tree of the cpusets in the system. For instance, /dev/cpuset
|
|
@@ -556,6 +537,18 @@ To remove a cpuset, just use rmdir:
|
|
|
This will fail if the cpuset is in use (has cpusets inside, or has
|
|
|
processes attached).
|
|
|
|
|
|
+Note that for legacy reasons, the "cpuset" filesystem exists as a
|
|
|
+wrapper around the cgroup filesystem.
|
|
|
+
|
|
|
+The command
|
|
|
+
|
|
|
+mount -t cpuset X /dev/cpuset
|
|
|
+
|
|
|
+is equivalent to
|
|
|
+
|
|
|
+mount -t cgroup -ocpuset X /dev/cpuset
|
|
|
+echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
|
|
|
+
|
|
|
2.2 Adding/removing cpus
|
|
|
------------------------
|
|
|
|