|
@@ -0,0 +1,498 @@
|
|
|
+Cluster-wide Power-up/power-down race avoidance algorithm
|
|
|
+=========================================================
|
|
|
+
|
|
|
+This file documents the algorithm which is used to coordinate CPU and
|
|
|
+cluster setup and teardown operations and to manage hardware coherency
|
|
|
+controls safely.
|
|
|
+
|
|
|
+The section "Rationale" explains what the algorithm is for and why it is
|
|
|
+needed. "Basic model" explains general concepts using a simplified view
|
|
|
+of the system. The other sections explain the actual details of the
|
|
|
+algorithm in use.
|
|
|
+
|
|
|
+
|
|
|
+Rationale
|
|
|
+---------
|
|
|
+
|
|
|
+In a system containing multiple CPUs, it is desirable to have the
|
|
|
+ability to turn off individual CPUs when the system is idle, reducing
|
|
|
+power consumption and thermal dissipation.
|
|
|
+
|
|
|
+In a system containing multiple clusters of CPUs, it is also desirable
|
|
|
+to have the ability to turn off entire clusters.
|
|
|
+
|
|
|
+Turning entire clusters off and on is a risky business, because it
|
|
|
+involves performing potentially destructive operations affecting a group
|
|
|
+of independently running CPUs, while the OS continues to run. This
|
|
|
+means that we need some coordination in order to ensure that critical
|
|
|
+cluster-level operations are only performed when it is truly safe to do
|
|
|
+so.
|
|
|
+
|
|
|
+Simple locking may not be sufficient to solve this problem, because
|
|
|
+mechanisms like Linux spinlocks may rely on coherency mechanisms which
|
|
|
+are not immediately enabled when a cluster powers up. Since enabling or
|
|
|
+disabling those mechanisms may itself be a non-atomic operation (such as
|
|
|
+writing some hardware registers and invalidating large caches), other
|
|
|
+methods of coordination are required in order to guarantee safe
|
|
|
+power-down and power-up at the cluster level.
|
|
|
+
|
|
|
+The mechanism presented in this document describes a coherent memory
|
|
|
+based protocol for performing the needed coordination. It aims to be as
|
|
|
+lightweight as possible, while providing the required safety properties.
|
|
|
+
|
|
|
+
|
|
|
+Basic model
|
|
|
+-----------
|
|
|
+
|
|
|
+Each cluster and CPU is assigned a state, as follows:
|
|
|
+
|
|
|
+ DOWN
|
|
|
+ COMING_UP
|
|
|
+ UP
|
|
|
+ GOING_DOWN
|
|
|
+
|
|
|
+ +---------> UP ----------+
|
|
|
+ | v
|
|
|
+
|
|
|
+ COMING_UP GOING_DOWN
|
|
|
+
|
|
|
+ ^ |
|
|
|
+ +--------- DOWN <--------+
|
|
|
+
|
|
|
+
|
|
|
+DOWN: The CPU or cluster is not coherent, and is either powered off or
|
|
|
+ suspended, or is ready to be powered off or suspended.
|
|
|
+
|
|
|
+COMING_UP: The CPU or cluster has committed to moving to the UP state.
|
|
|
+ It may be part way through the process of initialisation and
|
|
|
+ enabling coherency.
|
|
|
+
|
|
|
+UP: The CPU or cluster is active and coherent at the hardware
|
|
|
+ level. A CPU in this state is not necessarily being used
|
|
|
+ actively by the kernel.
|
|
|
+
|
|
|
+GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
|
|
|
+ state. It may be part way through the process of teardown and
|
|
|
+ coherency exit.
|
|
|
+
|
|
|
+
|
|
|
+Each CPU has one of these states assigned to it at any point in time.
|
|
|
+The CPU states are described in the "CPU state" section, below.
|
|
|
+
|
|
|
+Each cluster is also assigned a state, but it is necessary to split the
|
|
|
+state value into two parts (the "cluster" state and "inbound" state) and
|
|
|
+to introduce additional states in order to avoid races between different
|
|
|
+CPUs in the cluster simultaneously modifying the state. The cluster-
|
|
|
+level states are described in the "Cluster state" section.
|
|
|
+
|
|
|
+To help distinguish the CPU states from cluster states in this
|
|
|
+discussion, the state names are given a CPU_ prefix for the CPU states,
|
|
|
+and a CLUSTER_ or INBOUND_ prefix for the cluster states.
|
|
|
+
|
|
|
+
|
|
|
+CPU state
|
|
|
+---------
|
|
|
+
|
|
|
+In this algorithm, each individual core in a multi-core processor is
|
|
|
+referred to as a "CPU". CPUs are assumed to be single-threaded:
|
|
|
+therefore, a CPU can only be doing one thing at a single point in time.
|
|
|
+
|
|
|
+This means that CPUs fit the basic model closely.
|
|
|
+
|
|
|
+The algorithm defines the following states for each CPU in the system:
|
|
|
+
|
|
|
+ CPU_DOWN
|
|
|
+ CPU_COMING_UP
|
|
|
+ CPU_UP
|
|
|
+ CPU_GOING_DOWN
|
|
|
+
|
|
|
+ cluster setup and
|
|
|
+ CPU setup complete policy decision
|
|
|
+ +-----------> CPU_UP ------------+
|
|
|
+ | v
|
|
|
+
|
|
|
+ CPU_COMING_UP CPU_GOING_DOWN
|
|
|
+
|
|
|
+ ^ |
|
|
|
+ +----------- CPU_DOWN <----------+
|
|
|
+ policy decision CPU teardown complete
|
|
|
+ or hardware event
|
|
|
+
|
|
|
+
|
|
|
+The definitions of the four states correspond closely to the states of
|
|
|
+the basic model.
|
|
|
+
|
|
|
+Transitions between states occur as follows.
|
|
|
+
|
|
|
+A trigger event (spontaneous) means that the CPU can transition to the
|
|
|
+next state as a result of making local progress only, with no
|
|
|
+requirement for any external event to happen.
|
|
|
+
|
|
|
+
|
|
|
+CPU_DOWN:
|
|
|
+
|
|
|
+ A CPU reaches the CPU_DOWN state when it is ready for
|
|
|
+ power-down. On reaching this state, the CPU will typically
|
|
|
+ power itself down or suspend itself, via a WFI instruction or a
|
|
|
+ firmware call.
|
|
|
+
|
|
|
+ Next state: CPU_COMING_UP
|
|
|
+ Conditions: none
|
|
|
+
|
|
|
+ Trigger events:
|
|
|
+
|
|
|
+ a) an explicit hardware power-up operation, resulting
|
|
|
+ from a policy decision on another CPU;
|
|
|
+
|
|
|
+ b) a hardware event, such as an interrupt.
|
|
|
+
|
|
|
+
|
|
|
+CPU_COMING_UP:
|
|
|
+
|
|
|
+ A CPU cannot start participating in hardware coherency until the
|
|
|
+ cluster is set up and coherent. If the cluster is not ready,
|
|
|
+ then the CPU will wait in the CPU_COMING_UP state until the
|
|
|
+ cluster has been set up.
|
|
|
+
|
|
|
+ Next state: CPU_UP
|
|
|
+ Conditions: The CPU's parent cluster must be in CLUSTER_UP.
|
|
|
+ Trigger events: Transition of the parent cluster to CLUSTER_UP.
|
|
|
+
|
|
|
+ Refer to the "Cluster state" section for a description of the
|
|
|
+ CLUSTER_UP state.
|
|
|
+
|
|
|
+
|
|
|
+CPU_UP:
|
|
|
+ When a CPU reaches the CPU_UP state, it is safe for the CPU to
|
|
|
+ start participating in local coherency.
|
|
|
+
|
|
|
+ This is done by jumping to the kernel's CPU resume code.
|
|
|
+
|
|
|
+ Note that the definition of this state is slightly different
|
|
|
+ from the basic model definition: CPU_UP does not mean that the
|
|
|
+ CPU is coherent yet, but it does mean that it is safe to resume
|
|
|
+ the kernel. The kernel handles the rest of the resume
|
|
|
+ procedure, so the remaining steps are not visible as part of the
|
|
|
+ race avoidance algorithm.
|
|
|
+
|
|
|
+ The CPU remains in this state until an explicit policy decision
|
|
|
+ is made to shut down or suspend the CPU.
|
|
|
+
|
|
|
+ Next state: CPU_GOING_DOWN
|
|
|
+ Conditions: none
|
|
|
+ Trigger events: explicit policy decision
|
|
|
+
|
|
|
+
|
|
|
+CPU_GOING_DOWN:
|
|
|
+
|
|
|
+ While in this state, the CPU exits coherency, including any
|
|
|
+ operations required to achieve this (such as cleaning data
|
|
|
+ caches).
|
|
|
+
|
|
|
+ Next state: CPU_DOWN
|
|
|
+ Conditions: local CPU teardown complete
|
|
|
+ Trigger events: (spontaneous)
|
|
|
+
|
|
|
+
|
|
|
+Cluster state
|
|
|
+-------------
|
|
|
+
|
|
|
+A cluster is a group of connected CPUs with some common resources.
|
|
|
+Because a cluster contains multiple CPUs, it can be doing multiple
|
|
|
+things at the same time. This has some implications. In particular, a
|
|
|
+CPU can start up while another CPU is tearing the cluster down.
|
|
|
+
|
|
|
+In this discussion, the "outbound side" is the view of the cluster state
|
|
|
+as seen by a CPU tearing the cluster down. The "inbound side" is the
|
|
|
+view of the cluster state as seen by a CPU setting the CPU up.
|
|
|
+
|
|
|
+In order to enable safe coordination in such situations, it is important
|
|
|
+that a CPU which is setting up the cluster can advertise its state
|
|
|
+independently of the CPU which is tearing down the cluster. For this
|
|
|
+reason, the cluster state is split into two parts:
|
|
|
+
|
|
|
+ "cluster" state: The global state of the cluster; or the state
|
|
|
+ on the outbound side:
|
|
|
+
|
|
|
+ CLUSTER_DOWN
|
|
|
+ CLUSTER_UP
|
|
|
+ CLUSTER_GOING_DOWN
|
|
|
+
|
|
|
+ "inbound" state: The state of the cluster on the inbound side.
|
|
|
+
|
|
|
+ INBOUND_NOT_COMING_UP
|
|
|
+ INBOUND_COMING_UP
|
|
|
+
|
|
|
+
|
|
|
+ The different pairings of these states results in six possible
|
|
|
+ states for the cluster as a whole:
|
|
|
+
|
|
|
+ CLUSTER_UP
|
|
|
+ +==========> INBOUND_NOT_COMING_UP -------------+
|
|
|
+ # |
|
|
|
+ |
|
|
|
+ CLUSTER_UP <----+ |
|
|
|
+ INBOUND_COMING_UP | v
|
|
|
+
|
|
|
+ ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN
|
|
|
+ # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
|
|
|
+
|
|
|
+ CLUSTER_DOWN | |
|
|
|
+ INBOUND_COMING_UP <----+ |
|
|
|
+ |
|
|
|
+ ^ |
|
|
|
+ +=========== CLUSTER_DOWN <------------+
|
|
|
+ INBOUND_NOT_COMING_UP
|
|
|
+
|
|
|
+ Transitions -----> can only be made by the outbound CPU, and
|
|
|
+ only involve changes to the "cluster" state.
|
|
|
+
|
|
|
+ Transitions ===##> can only be made by the inbound CPU, and only
|
|
|
+ involve changes to the "inbound" state, except where there is no
|
|
|
+ further transition possible on the outbound side (i.e., the
|
|
|
+ outbound CPU has put the cluster into the CLUSTER_DOWN state).
|
|
|
+
|
|
|
+ The race avoidance algorithm does not provide a way to determine
|
|
|
+ which exact CPUs within the cluster play these roles. This must
|
|
|
+ be decided in advance by some other means. Refer to the section
|
|
|
+ "Last man and first man selection" for more explanation.
|
|
|
+
|
|
|
+
|
|
|
+ CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
|
|
|
+ cluster can actually be powered down.
|
|
|
+
|
|
|
+ The parallelism of the inbound and outbound CPUs is observed by
|
|
|
+ the existence of two different paths from CLUSTER_GOING_DOWN/
|
|
|
+ INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
|
|
|
+ model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
|
|
|
+ COMING_UP in the basic model). The second path avoids cluster
|
|
|
+ teardown completely.
|
|
|
+
|
|
|
+ CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
|
|
|
+ model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
|
|
|
+ is trivial and merely resets the state machine ready for the
|
|
|
+ next cycle.
|
|
|
+
|
|
|
+ Details of the allowable transitions follow.
|
|
|
+
|
|
|
+ The next state in each case is notated
|
|
|
+
|
|
|
+ <cluster state>/<inbound state> (<transitioner>)
|
|
|
+
|
|
|
+ where the <transitioner> is the side on which the transition
|
|
|
+ can occur; either the inbound or the outbound side.
|
|
|
+
|
|
|
+
|
|
|
+CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
|
|
|
+
|
|
|
+ Next state: CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
|
|
|
+ Conditions: none
|
|
|
+ Trigger events:
|
|
|
+
|
|
|
+ a) an explicit hardware power-up operation, resulting
|
|
|
+ from a policy decision on another CPU;
|
|
|
+
|
|
|
+ b) a hardware event, such as an interrupt.
|
|
|
+
|
|
|
+
|
|
|
+CLUSTER_DOWN/INBOUND_COMING_UP:
|
|
|
+
|
|
|
+ In this state, an inbound CPU sets up the cluster, including
|
|
|
+ enabling of hardware coherency at the cluster level and any
|
|
|
+ other operations (such as cache invalidation) which are required
|
|
|
+ in order to achieve this.
|
|
|
+
|
|
|
+ The purpose of this state is to do sufficient cluster-level
|
|
|
+ setup to enable other CPUs in the cluster to enter coherency
|
|
|
+ safely.
|
|
|
+
|
|
|
+ Next state: CLUSTER_UP/INBOUND_COMING_UP (inbound)
|
|
|
+ Conditions: cluster-level setup and hardware coherency complete
|
|
|
+ Trigger events: (spontaneous)
|
|
|
+
|
|
|
+
|
|
|
+CLUSTER_UP/INBOUND_COMING_UP:
|
|
|
+
|
|
|
+ Cluster-level setup is complete and hardware coherency is
|
|
|
+ enabled for the cluster. Other CPUs in the cluster can safely
|
|
|
+ enter coherency.
|
|
|
+
|
|
|
+ This is a transient state, leading immediately to
|
|
|
+ CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster
|
|
|
+ should consider treat these two states as equivalent.
|
|
|
+
|
|
|
+ Next state: CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
|
|
|
+ Conditions: none
|
|
|
+ Trigger events: (spontaneous)
|
|
|
+
|
|
|
+
|
|
|
+CLUSTER_UP/INBOUND_NOT_COMING_UP:
|
|
|
+
|
|
|
+ Cluster-level setup is complete and hardware coherency is
|
|
|
+ enabled for the cluster. Other CPUs in the cluster can safely
|
|
|
+ enter coherency.
|
|
|
+
|
|
|
+ The cluster will remain in this state until a policy decision is
|
|
|
+ made to power the cluster down.
|
|
|
+
|
|
|
+ Next state: CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
|
|
|
+ Conditions: none
|
|
|
+ Trigger events: policy decision to power down the cluster
|
|
|
+
|
|
|
+
|
|
|
+CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
|
|
|
+
|
|
|
+ An outbound CPU is tearing the cluster down. The selected CPU
|
|
|
+ must wait in this state until all CPUs in the cluster are in the
|
|
|
+ CPU_DOWN state.
|
|
|
+
|
|
|
+ When all CPUs are in the CPU_DOWN state, the cluster can be torn
|
|
|
+ down, for example by cleaning data caches and exiting
|
|
|
+ cluster-level coherency.
|
|
|
+
|
|
|
+ To avoid wasteful unnecessary teardown operations, the outbound
|
|
|
+ should check the inbound cluster state for asynchronous
|
|
|
+ transitions to INBOUND_COMING_UP. Alternatively, individual
|
|
|
+ CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
|
|
|
+
|
|
|
+
|
|
|
+ Next states:
|
|
|
+
|
|
|
+ CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
|
|
|
+ Conditions: cluster torn down and ready to power off
|
|
|
+ Trigger events: (spontaneous)
|
|
|
+
|
|
|
+ CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
|
|
|
+ Conditions: none
|
|
|
+ Trigger events:
|
|
|
+
|
|
|
+ a) an explicit hardware power-up operation,
|
|
|
+ resulting from a policy decision on another
|
|
|
+ CPU;
|
|
|
+
|
|
|
+ b) a hardware event, such as an interrupt.
|
|
|
+
|
|
|
+
|
|
|
+CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
|
|
|
+
|
|
|
+ The cluster is (or was) being torn down, but another CPU has
|
|
|
+ come online in the meantime and is trying to set up the cluster
|
|
|
+ again.
|
|
|
+
|
|
|
+ If the outbound CPU observes this state, it has two choices:
|
|
|
+
|
|
|
+ a) back out of teardown, restoring the cluster to the
|
|
|
+ CLUSTER_UP state;
|
|
|
+
|
|
|
+ b) finish tearing the cluster down and put the cluster
|
|
|
+ in the CLUSTER_DOWN state; the inbound CPU will
|
|
|
+ set up the cluster again from there.
|
|
|
+
|
|
|
+ Choice (a) permits the removal of some latency by avoiding
|
|
|
+ unnecessary teardown and setup operations in situations where
|
|
|
+ the cluster is not really going to be powered down.
|
|
|
+
|
|
|
+
|
|
|
+ Next states:
|
|
|
+
|
|
|
+ CLUSTER_UP/INBOUND_COMING_UP (outbound)
|
|
|
+ Conditions: cluster-level setup and hardware
|
|
|
+ coherency complete
|
|
|
+ Trigger events: (spontaneous)
|
|
|
+
|
|
|
+ CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
|
|
|
+ Conditions: cluster torn down and ready to power off
|
|
|
+ Trigger events: (spontaneous)
|
|
|
+
|
|
|
+
|
|
|
+Last man and First man selection
|
|
|
+--------------------------------
|
|
|
+
|
|
|
+The CPU which performs cluster tear-down operations on the outbound side
|
|
|
+is commonly referred to as the "last man".
|
|
|
+
|
|
|
+The CPU which performs cluster setup on the inbound side is commonly
|
|
|
+referred to as the "first man".
|
|
|
+
|
|
|
+The race avoidance algorithm documented above does not provide a
|
|
|
+mechanism to choose which CPUs should play these roles.
|
|
|
+
|
|
|
+
|
|
|
+Last man:
|
|
|
+
|
|
|
+When shutting down the cluster, all the CPUs involved are initially
|
|
|
+executing Linux and hence coherent. Therefore, ordinary spinlocks can
|
|
|
+be used to select a last man safely, before the CPUs become
|
|
|
+non-coherent.
|
|
|
+
|
|
|
+
|
|
|
+First man:
|
|
|
+
|
|
|
+Because CPUs may power up asynchronously in response to external wake-up
|
|
|
+events, a dynamic mechanism is needed to make sure that only one CPU
|
|
|
+attempts to play the first man role and do the cluster-level
|
|
|
+initialisation: any other CPUs must wait for this to complete before
|
|
|
+proceeding.
|
|
|
+
|
|
|
+Cluster-level initialisation may involve actions such as configuring
|
|
|
+coherency controls in the bus fabric.
|
|
|
+
|
|
|
+The current implementation in mcpm_head.S uses a separate mutual exclusion
|
|
|
+mechanism to do this arbitration. This mechanism is documented in
|
|
|
+detail in vlocks.txt.
|
|
|
+
|
|
|
+
|
|
|
+Features and Limitations
|
|
|
+------------------------
|
|
|
+
|
|
|
+Implementation:
|
|
|
+
|
|
|
+ The current ARM-based implementation is split between
|
|
|
+ arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and
|
|
|
+ arch/arm/common/mcpm_entry.c (everything else):
|
|
|
+
|
|
|
+ __mcpm_cpu_going_down() signals the transition of a CPU to the
|
|
|
+ CPU_GOING_DOWN state.
|
|
|
+
|
|
|
+ __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN
|
|
|
+ state.
|
|
|
+
|
|
|
+ A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
|
|
|
+ low-level power-up code in mcpm_head.S. This could
|
|
|
+ involve CPU-specific setup code, but in the current
|
|
|
+ implementation it does not.
|
|
|
+
|
|
|
+ __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical()
|
|
|
+ handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
|
|
|
+ and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
|
|
|
+ the case of an aborted cluster power-down).
|
|
|
+
|
|
|
+ These functions are more complex than the __mcpm_cpu_*()
|
|
|
+ functions due to the extra inter-CPU coordination which
|
|
|
+ is needed for safe transitions at the cluster level.
|
|
|
+
|
|
|
+ A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
|
|
|
+ the low-level power-up code in mcpm_head.S. This
|
|
|
+ typically involves platform-specific setup code,
|
|
|
+ provided by the platform-specific power_up_setup
|
|
|
+ function registered via mcpm_sync_init.
|
|
|
+
|
|
|
+Deep topologies:
|
|
|
+
|
|
|
+ As currently described and implemented, the algorithm does not
|
|
|
+ support CPU topologies involving more than two levels (i.e.,
|
|
|
+ clusters of clusters are not supported). The algorithm could be
|
|
|
+ extended by replicating the cluster-level states for the
|
|
|
+ additional topological levels, and modifying the transition
|
|
|
+ rules for the intermediate (non-outermost) cluster levels.
|
|
|
+
|
|
|
+
|
|
|
+Colophon
|
|
|
+--------
|
|
|
+
|
|
|
+Originally created and documented by Dave Martin for Linaro Limited, in
|
|
|
+collaboration with Nicolas Pitre and Achin Gupta.
|
|
|
+
|
|
|
+Copyright (C) 2012-2013 Linaro Limited
|
|
|
+Distributed under the terms of Version 2 of the GNU General Public
|
|
|
+License, as defined in linux/COPYING.
|