|
@@ -0,0 +1,211 @@
|
|
|
+vlocks for Bare-Metal Mutual Exclusion
|
|
|
+======================================
|
|
|
+
|
|
|
+Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
|
|
|
+mechanism, with reasonable but minimal requirements on the memory
|
|
|
+system.
|
|
|
+
|
|
|
+These are intended to be used to coordinate critical activity among CPUs
|
|
|
+which are otherwise non-coherent, in situations where the hardware
|
|
|
+provides no other mechanism to support this and ordinary spinlocks
|
|
|
+cannot be used.
|
|
|
+
|
|
|
+
|
|
|
+vlocks make use of the atomicity provided by the memory system for
|
|
|
+writes to a single memory location. To arbitrate, every CPU "votes for
|
|
|
+itself", by storing a unique number to a common memory location. The
|
|
|
+final value seen in that memory location when all the votes have been
|
|
|
+cast identifies the winner.
|
|
|
+
|
|
|
+In order to make sure that the election produces an unambiguous result
|
|
|
+in finite time, a CPU will only enter the election in the first place if
|
|
|
+no winner has been chosen and the election does not appear to have
|
|
|
+started yet.
|
|
|
+
|
|
|
+
|
|
|
+Algorithm
|
|
|
+---------
|
|
|
+
|
|
|
+The easiest way to explain the vlocks algorithm is with some pseudo-code:
|
|
|
+
|
|
|
+
|
|
|
+ int currently_voting[NR_CPUS] = { 0, };
|
|
|
+ int last_vote = -1; /* no votes yet */
|
|
|
+
|
|
|
+ bool vlock_trylock(int this_cpu)
|
|
|
+ {
|
|
|
+ /* signal our desire to vote */
|
|
|
+ currently_voting[this_cpu] = 1;
|
|
|
+ if (last_vote != -1) {
|
|
|
+ /* someone already volunteered himself */
|
|
|
+ currently_voting[this_cpu] = 0;
|
|
|
+ return false; /* not ourself */
|
|
|
+ }
|
|
|
+
|
|
|
+ /* let's suggest ourself */
|
|
|
+ last_vote = this_cpu;
|
|
|
+ currently_voting[this_cpu] = 0;
|
|
|
+
|
|
|
+ /* then wait until everyone else is done voting */
|
|
|
+ for_each_cpu(i) {
|
|
|
+ while (currently_voting[i] != 0)
|
|
|
+ /* wait */;
|
|
|
+ }
|
|
|
+
|
|
|
+ /* result */
|
|
|
+ if (last_vote == this_cpu)
|
|
|
+ return true; /* we won */
|
|
|
+ return false;
|
|
|
+ }
|
|
|
+
|
|
|
+ bool vlock_unlock(void)
|
|
|
+ {
|
|
|
+ last_vote = -1;
|
|
|
+ }
|
|
|
+
|
|
|
+
|
|
|
+The currently_voting[] array provides a way for the CPUs to determine
|
|
|
+whether an election is in progress, and plays a role analogous to the
|
|
|
+"entering" array in Lamport's bakery algorithm [1].
|
|
|
+
|
|
|
+However, once the election has started, the underlying memory system
|
|
|
+atomicity is used to pick the winner. This avoids the need for a static
|
|
|
+priority rule to act as a tie-breaker, or any counters which could
|
|
|
+overflow.
|
|
|
+
|
|
|
+As long as the last_vote variable is globally visible to all CPUs, it
|
|
|
+will contain only one value that won't change once every CPU has cleared
|
|
|
+its currently_voting flag.
|
|
|
+
|
|
|
+
|
|
|
+Features and limitations
|
|
|
+------------------------
|
|
|
+
|
|
|
+ * vlocks are not intended to be fair. In the contended case, it is the
|
|
|
+ _last_ CPU which attempts to get the lock which will be most likely
|
|
|
+ to win.
|
|
|
+
|
|
|
+ vlocks are therefore best suited to situations where it is necessary
|
|
|
+ to pick a unique winner, but it does not matter which CPU actually
|
|
|
+ wins.
|
|
|
+
|
|
|
+ * Like other similar mechanisms, vlocks will not scale well to a large
|
|
|
+ number of CPUs.
|
|
|
+
|
|
|
+ vlocks can be cascaded in a voting hierarchy to permit better scaling
|
|
|
+ if necessary, as in the following hypothetical example for 4096 CPUs:
|
|
|
+
|
|
|
+ /* first level: local election */
|
|
|
+ my_town = towns[(this_cpu >> 4) & 0xf];
|
|
|
+ I_won = vlock_trylock(my_town, this_cpu & 0xf);
|
|
|
+ if (I_won) {
|
|
|
+ /* we won the town election, let's go for the state */
|
|
|
+ my_state = states[(this_cpu >> 8) & 0xf];
|
|
|
+ I_won = vlock_lock(my_state, this_cpu & 0xf));
|
|
|
+ if (I_won) {
|
|
|
+ /* and so on */
|
|
|
+ I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
|
|
|
+ if (I_won) {
|
|
|
+ /* ... */
|
|
|
+ }
|
|
|
+ vlock_unlock(the_whole_country);
|
|
|
+ }
|
|
|
+ vlock_unlock(my_state);
|
|
|
+ }
|
|
|
+ vlock_unlock(my_town);
|
|
|
+
|
|
|
+
|
|
|
+ARM implementation
|
|
|
+------------------
|
|
|
+
|
|
|
+The current ARM implementation [2] contains some optimisations beyond
|
|
|
+the basic algorithm:
|
|
|
+
|
|
|
+ * By packing the members of the currently_voting array close together,
|
|
|
+ we can read the whole array in one transaction (providing the number
|
|
|
+ of CPUs potentially contending the lock is small enough). This
|
|
|
+ reduces the number of round-trips required to external memory.
|
|
|
+
|
|
|
+ In the ARM implementation, this means that we can use a single load
|
|
|
+ and comparison:
|
|
|
+
|
|
|
+ LDR Rt, [Rn]
|
|
|
+ CMP Rt, #0
|
|
|
+
|
|
|
+ ...in place of code equivalent to:
|
|
|
+
|
|
|
+ LDRB Rt, [Rn]
|
|
|
+ CMP Rt, #0
|
|
|
+ LDRBEQ Rt, [Rn, #1]
|
|
|
+ CMPEQ Rt, #0
|
|
|
+ LDRBEQ Rt, [Rn, #2]
|
|
|
+ CMPEQ Rt, #0
|
|
|
+ LDRBEQ Rt, [Rn, #3]
|
|
|
+ CMPEQ Rt, #0
|
|
|
+
|
|
|
+ This cuts down on the fast-path latency, as well as potentially
|
|
|
+ reducing bus contention in contended cases.
|
|
|
+
|
|
|
+ The optimisation relies on the fact that the ARM memory system
|
|
|
+ guarantees coherency between overlapping memory accesses of
|
|
|
+ different sizes, similarly to many other architectures. Note that
|
|
|
+ we do not care which element of currently_voting appears in which
|
|
|
+ bits of Rt, so there is no need to worry about endianness in this
|
|
|
+ optimisation.
|
|
|
+
|
|
|
+ If there are too many CPUs to read the currently_voting array in
|
|
|
+ one transaction then multiple transations are still required. The
|
|
|
+ implementation uses a simple loop of word-sized loads for this
|
|
|
+ case. The number of transactions is still fewer than would be
|
|
|
+ required if bytes were loaded individually.
|
|
|
+
|
|
|
+
|
|
|
+ In principle, we could aggregate further by using LDRD or LDM, but
|
|
|
+ to keep the code simple this was not attempted in the initial
|
|
|
+ implementation.
|
|
|
+
|
|
|
+
|
|
|
+ * vlocks are currently only used to coordinate between CPUs which are
|
|
|
+ unable to enable their caches yet. This means that the
|
|
|
+ implementation removes many of the barriers which would be required
|
|
|
+ when executing the algorithm in cached memory.
|
|
|
+
|
|
|
+ packing of the currently_voting array does not work with cached
|
|
|
+ memory unless all CPUs contending the lock are cache-coherent, due
|
|
|
+ to cache writebacks from one CPU clobbering values written by other
|
|
|
+ CPUs. (Though if all the CPUs are cache-coherent, you should be
|
|
|
+ probably be using proper spinlocks instead anyway).
|
|
|
+
|
|
|
+
|
|
|
+ * The "no votes yet" value used for the last_vote variable is 0 (not
|
|
|
+ -1 as in the pseudocode). This allows statically-allocated vlocks
|
|
|
+ to be implicitly initialised to an unlocked state simply by putting
|
|
|
+ them in .bss.
|
|
|
+
|
|
|
+ An offset is added to each CPU's ID for the purpose of setting this
|
|
|
+ variable, so that no CPU uses the value 0 for its ID.
|
|
|
+
|
|
|
+
|
|
|
+Colophon
|
|
|
+--------
|
|
|
+
|
|
|
+Originally created and documented by Dave Martin for Linaro Limited, for
|
|
|
+use in ARM-based big.LITTLE platforms, with review and input gratefully
|
|
|
+received from Nicolas Pitre and Achin Gupta. Thanks to Nicolas for
|
|
|
+grabbing most of this text out of the relevant mail thread and writing
|
|
|
+up the pseudocode.
|
|
|
+
|
|
|
+Copyright (C) 2012-2013 Linaro Limited
|
|
|
+Distributed under the terms of Version 2 of the GNU General Public
|
|
|
+License, as defined in linux/COPYING.
|
|
|
+
|
|
|
+
|
|
|
+References
|
|
|
+----------
|
|
|
+
|
|
|
+[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
|
|
|
+ Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
|
|
|
+
|
|
|
+ http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
|
|
|
+
|
|
|
+[2] linux/arch/arm/common/vlock.S, www.kernel.org.
|