|
@@ -0,0 +1,286 @@
|
|
|
+ Static Keys
|
|
|
+ -----------
|
|
|
+
|
|
|
+By: Jason Baron <jbaron@redhat.com>
|
|
|
+
|
|
|
+0) Abstract
|
|
|
+
|
|
|
+Static keys allows the inclusion of seldom used features in
|
|
|
+performance-sensitive fast-path kernel code, via a GCC feature and a code
|
|
|
+patching technique. A quick example:
|
|
|
+
|
|
|
+ struct static_key key = STATIC_KEY_INIT_FALSE;
|
|
|
+
|
|
|
+ ...
|
|
|
+
|
|
|
+ if (static_key_false(&key))
|
|
|
+ do unlikely code
|
|
|
+ else
|
|
|
+ do likely code
|
|
|
+
|
|
|
+ ...
|
|
|
+ static_key_slow_inc();
|
|
|
+ ...
|
|
|
+ static_key_slow_inc();
|
|
|
+ ...
|
|
|
+
|
|
|
+The static_key_false() branch will be generated into the code with as little
|
|
|
+impact to the likely code path as possible.
|
|
|
+
|
|
|
+
|
|
|
+1) Motivation
|
|
|
+
|
|
|
+
|
|
|
+Currently, tracepoints are implemented using a conditional branch. The
|
|
|
+conditional check requires checking a global variable for each tracepoint.
|
|
|
+Although the overhead of this check is small, it increases when the memory
|
|
|
+cache comes under pressure (memory cache lines for these global variables may
|
|
|
+be shared with other memory accesses). As we increase the number of tracepoints
|
|
|
+in the kernel this overhead may become more of an issue. In addition,
|
|
|
+tracepoints are often dormant (disabled) and provide no direct kernel
|
|
|
+functionality. Thus, it is highly desirable to reduce their impact as much as
|
|
|
+possible. Although tracepoints are the original motivation for this work, other
|
|
|
+kernel code paths should be able to make use of the static keys facility.
|
|
|
+
|
|
|
+
|
|
|
+2) Solution
|
|
|
+
|
|
|
+
|
|
|
+gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label:
|
|
|
+
|
|
|
+http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
|
|
|
+
|
|
|
+Using the 'asm goto', we can create branches that are either taken or not taken
|
|
|
+by default, without the need to check memory. Then, at run-time, we can patch
|
|
|
+the branch site to change the branch direction.
|
|
|
+
|
|
|
+For example, if we have a simple branch that is disabled by default:
|
|
|
+
|
|
|
+ if (static_key_false(&key))
|
|
|
+ printk("I am the true branch\n");
|
|
|
+
|
|
|
+Thus, by default the 'printk' will not be emitted. And the code generated will
|
|
|
+consist of a single atomic 'no-op' instruction (5 bytes on x86), in the
|
|
|
+straight-line code path. When the branch is 'flipped', we will patch the
|
|
|
+'no-op' in the straight-line codepath with a 'jump' instruction to the
|
|
|
+out-of-line true branch. Thus, changing branch direction is expensive but
|
|
|
+branch selection is basically 'free'. That is the basic tradeoff of this
|
|
|
+optimization.
|
|
|
+
|
|
|
+This lowlevel patching mechanism is called 'jump label patching', and it gives
|
|
|
+the basis for the static keys facility.
|
|
|
+
|
|
|
+3) Static key label API, usage and examples:
|
|
|
+
|
|
|
+
|
|
|
+In order to make use of this optimization you must first define a key:
|
|
|
+
|
|
|
+ struct static_key key;
|
|
|
+
|
|
|
+Which is initialized as:
|
|
|
+
|
|
|
+ struct static_key key = STATIC_KEY_INIT_TRUE;
|
|
|
+
|
|
|
+or:
|
|
|
+
|
|
|
+ struct static_key key = STATIC_KEY_INIT_FALSE;
|
|
|
+
|
|
|
+If the key is not initialized, it is default false. The 'struct static_key',
|
|
|
+must be a 'global'. That is, it can't be allocated on the stack or dynamically
|
|
|
+allocated at run-time.
|
|
|
+
|
|
|
+The key is then used in code as:
|
|
|
+
|
|
|
+ if (static_key_false(&key))
|
|
|
+ do unlikely code
|
|
|
+ else
|
|
|
+ do likely code
|
|
|
+
|
|
|
+Or:
|
|
|
+
|
|
|
+ if (static_key_true(&key))
|
|
|
+ do likely code
|
|
|
+ else
|
|
|
+ do unlikely code
|
|
|
+
|
|
|
+A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a
|
|
|
+'static_key_false()' construct. Likewise, a key initialized via
|
|
|
+'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A
|
|
|
+single key can be used in many branches, but all the branches must match the
|
|
|
+way that the key has been initialized.
|
|
|
+
|
|
|
+The branch(es) can then be switched via:
|
|
|
+
|
|
|
+ static_key_slow_inc(&key);
|
|
|
+ ...
|
|
|
+ static_key_slow_dec(&key);
|
|
|
+
|
|
|
+Thus, 'static_key_slow_inc()' means 'make the branch true', and
|
|
|
+'static_key_slow_dec()' means 'make the the branch false' with appropriate
|
|
|
+reference counting. For example, if the key is initialized true, a
|
|
|
+static_key_slow_dec(), will switch the branch to false. And a subsequent
|
|
|
+static_key_slow_inc(), will change the branch back to true. Likewise, if the
|
|
|
+key is initialized false, a 'static_key_slow_inc()', will change the branch to
|
|
|
+true. And then a 'static_key_slow_dec()', will again make the branch false.
|
|
|
+
|
|
|
+An example usage in the kernel is the implementation of tracepoints:
|
|
|
+
|
|
|
+ static inline void trace_##name(proto) \
|
|
|
+ { \
|
|
|
+ if (static_key_false(&__tracepoint_##name.key)) \
|
|
|
+ __DO_TRACE(&__tracepoint_##name, \
|
|
|
+ TP_PROTO(data_proto), \
|
|
|
+ TP_ARGS(data_args), \
|
|
|
+ TP_CONDITION(cond)); \
|
|
|
+ }
|
|
|
+
|
|
|
+Tracepoints are disabled by default, and can be placed in performance critical
|
|
|
+pieces of the kernel. Thus, by using a static key, the tracepoints can have
|
|
|
+absolutely minimal impact when not in use.
|
|
|
+
|
|
|
+
|
|
|
+4) Architecture level code patching interface, 'jump labels'
|
|
|
+
|
|
|
+
|
|
|
+There are a few functions and macros that architectures must implement in order
|
|
|
+to take advantage of this optimization. If there is no architecture support, we
|
|
|
+simply fall back to a traditional, load, test, and jump sequence.
|
|
|
+
|
|
|
+* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig
|
|
|
+
|
|
|
+* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h
|
|
|
+
|
|
|
+* __always_inline bool arch_static_branch(struct static_key *key), see:
|
|
|
+ arch/x86/include/asm/jump_label.h
|
|
|
+
|
|
|
+* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type),
|
|
|
+ see: arch/x86/kernel/jump_label.c
|
|
|
+
|
|
|
+* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type),
|
|
|
+ see: arch/x86/kernel/jump_label.c
|
|
|
+
|
|
|
+
|
|
|
+* struct jump_entry, see: arch/x86/include/asm/jump_label.h
|
|
|
+
|
|
|
+
|
|
|
+5) Static keys / jump label analysis, results (x86_64):
|
|
|
+
|
|
|
+
|
|
|
+As an example, let's add the following branch to 'getppid()', such that the
|
|
|
+system call now looks like:
|
|
|
+
|
|
|
+SYSCALL_DEFINE0(getppid)
|
|
|
+{
|
|
|
+ int pid;
|
|
|
+
|
|
|
++ if (static_key_false(&key))
|
|
|
++ printk("I am the true branch\n");
|
|
|
+
|
|
|
+ rcu_read_lock();
|
|
|
+ pid = task_tgid_vnr(rcu_dereference(current->real_parent));
|
|
|
+ rcu_read_unlock();
|
|
|
+
|
|
|
+ return pid;
|
|
|
+}
|
|
|
+
|
|
|
+The resulting instructions with jump labels generated by GCC is:
|
|
|
+
|
|
|
+ffffffff81044290 <sys_getppid>:
|
|
|
+ffffffff81044290: 55 push %rbp
|
|
|
+ffffffff81044291: 48 89 e5 mov %rsp,%rbp
|
|
|
+ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9>
|
|
|
+ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
|
|
|
+ffffffff810442a0: 00 00
|
|
|
+ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
|
|
|
+ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
|
|
|
+ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
|
|
|
+ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr>
|
|
|
+ffffffff810442bc: 5d pop %rbp
|
|
|
+ffffffff810442bd: 48 98 cltq
|
|
|
+ffffffff810442bf: c3 retq
|
|
|
+ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi
|
|
|
+ffffffff810442c7: 31 c0 xor %eax,%eax
|
|
|
+ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk>
|
|
|
+ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9>
|
|
|
+
|
|
|
+Without the jump label optimization it looks like:
|
|
|
+
|
|
|
+ffffffff810441f0 <sys_getppid>:
|
|
|
+ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key>
|
|
|
+ffffffff810441f6: 55 push %rbp
|
|
|
+ffffffff810441f7: 48 89 e5 mov %rsp,%rbp
|
|
|
+ffffffff810441fa: 85 c0 test %eax,%eax
|
|
|
+ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35>
|
|
|
+ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax
|
|
|
+ffffffff81044205: 00 00
|
|
|
+ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax
|
|
|
+ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax
|
|
|
+ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi
|
|
|
+ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr>
|
|
|
+ffffffff81044221: 5d pop %rbp
|
|
|
+ffffffff81044222: 48 98 cltq
|
|
|
+ffffffff81044224: c3 retq
|
|
|
+ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi
|
|
|
+ffffffff8104422c: 31 c0 xor %eax,%eax
|
|
|
+ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk>
|
|
|
+ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe>
|
|
|
+ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1)
|
|
|
+ffffffff8104423c: 00 00 00 00
|
|
|
+
|
|
|
+Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction
|
|
|
+vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched
|
|
|
+to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump
|
|
|
+label case adds:
|
|
|
+
|
|
|
+6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes.
|
|
|
+
|
|
|
+If we then include the padding bytes, the jump label code saves, 16 total bytes
|
|
|
+of instruction memory for this small fucntion. In this case the non-jump label
|
|
|
+function is 80 bytes long. Thus, we have have saved 20% of the instruction
|
|
|
+footprint. We can in fact improve this even further, since the 5-byte no-op
|
|
|
+really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp.
|
|
|
+However, we have not yet implemented optimal no-op sizes (they are currently
|
|
|
+hard-coded).
|
|
|
+
|
|
|
+Since there are a number of static key API uses in the scheduler paths,
|
|
|
+'pipe-test' (also known as 'perf bench sched pipe') can be used to show the
|
|
|
+performance improvement. Testing done on 3.3.0-rc2:
|
|
|
+
|
|
|
+jump label disabled:
|
|
|
+
|
|
|
+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
|
|
|
+
|
|
|
+ 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% )
|
|
|
+ 200,003 context-switches # 0.234 M/sec ( +- 0.00% )
|
|
|
+ 0 CPU-migrations # 0.000 M/sec ( +- 39.58% )
|
|
|
+ 487 page-faults # 0.001 M/sec ( +- 0.02% )
|
|
|
+ 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% )
|
|
|
+ <not supported> stalled-cycles-frontend
|
|
|
+ <not supported> stalled-cycles-backend
|
|
|
+ 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% )
|
|
|
+ 208,368,926 branches # 243.507 M/sec ( +- 0.06% )
|
|
|
+ 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% )
|
|
|
+
|
|
|
+ 1.601607384 seconds time elapsed ( +- 0.07% )
|
|
|
+
|
|
|
+jump label enabled:
|
|
|
+
|
|
|
+ Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs):
|
|
|
+
|
|
|
+ 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% )
|
|
|
+ 200,004 context-switches # 0.238 M/sec ( +- 0.00% )
|
|
|
+ 0 CPU-migrations # 0.000 M/sec ( +- 40.87% )
|
|
|
+ 487 page-faults # 0.001 M/sec ( +- 0.05% )
|
|
|
+ 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% )
|
|
|
+ <not supported> stalled-cycles-frontend
|
|
|
+ <not supported> stalled-cycles-backend
|
|
|
+ 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% )
|
|
|
+ 206,859,359 branches # 245.956 M/sec ( +- 0.04% )
|
|
|
+ 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% )
|
|
|
+
|
|
|
+ 1.579384366 seconds time elapsed
|
|
|
+
|
|
|
+The percentage of saved branches is .7%, and we've saved 12% on
|
|
|
+'branch-misses'. This is where we would expect to get the most savings, since
|
|
|
+this optimization is about reducing the number of branches. In addition, we've
|
|
|
+saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.
|