|
@@ -95,12 +95,15 @@ SECCOMP_RET_KILL:
|
|
|
|
|
|
SECCOMP_RET_TRAP:
|
|
|
Results in the kernel sending a SIGSYS signal to the triggering
|
|
|
- task without executing the system call. The kernel will
|
|
|
- rollback the register state to just before the system call
|
|
|
- entry such that a signal handler in the task will be able to
|
|
|
- inspect the ucontext_t->uc_mcontext registers and emulate
|
|
|
- system call success or failure upon return from the signal
|
|
|
- handler.
|
|
|
+ task without executing the system call. siginfo->si_call_addr
|
|
|
+ will show the address of the system call instruction, and
|
|
|
+ siginfo->si_syscall and siginfo->si_arch will indicate which
|
|
|
+ syscall was attempted. The program counter will be as though
|
|
|
+ the syscall happened (i.e. it will not point to the syscall
|
|
|
+ instruction). The return value register will contain an arch-
|
|
|
+ dependent value -- if resuming execution, set it to something
|
|
|
+ sensible. (The architecture dependency is because replacing
|
|
|
+ it with -ENOSYS could overwrite some useful information.)
|
|
|
|
|
|
The SECCOMP_RET_DATA portion of the return value will be passed
|
|
|
as si_errno.
|
|
@@ -123,6 +126,18 @@ SECCOMP_RET_TRACE:
|
|
|
the BPF program return value will be available to the tracer
|
|
|
via PTRACE_GETEVENTMSG.
|
|
|
|
|
|
+ The tracer can skip the system call by changing the syscall number
|
|
|
+ to -1. Alternatively, the tracer can change the system call
|
|
|
+ requested by changing the system call to a valid syscall number. If
|
|
|
+ the tracer asks to skip the system call, then the system call will
|
|
|
+ appear to return the value that the tracer puts in the return value
|
|
|
+ register.
|
|
|
+
|
|
|
+ The seccomp check will not be run again after the tracer is
|
|
|
+ notified. (This means that seccomp-based sandboxes MUST NOT
|
|
|
+ allow use of ptrace, even of other sandboxed processes, without
|
|
|
+ extreme care; ptracers can use this mechanism to escape.)
|
|
|
+
|
|
|
SECCOMP_RET_ALLOW:
|
|
|
Results in the system call being executed.
|
|
|
|
|
@@ -161,3 +176,50 @@ architecture supports both ptrace_event and seccomp, it will be able to
|
|
|
support seccomp filter with minor fixup: SIGSYS support and seccomp return
|
|
|
value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
|
|
|
to its arch-specific Kconfig.
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+Caveats
|
|
|
+-------
|
|
|
+
|
|
|
+The vDSO can cause some system calls to run entirely in userspace,
|
|
|
+leading to surprises when you run programs on different machines that
|
|
|
+fall back to real syscalls. To minimize these surprises on x86, make
|
|
|
+sure you test with
|
|
|
+/sys/devices/system/clocksource/clocksource0/current_clocksource set to
|
|
|
+something like acpi_pm.
|
|
|
+
|
|
|
+On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
|
|
|
+legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
|
|
|
+
|
|
|
+- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
|
|
|
+ the vsyscall entry for the given call and not the address after the
|
|
|
+ 'syscall' instruction. Any code which wants to restart the call
|
|
|
+ should be aware that (a) a ret instruction has been emulated and (b)
|
|
|
+ trying to resume the syscall will again trigger the standard vsyscall
|
|
|
+ emulation security checks, making resuming the syscall mostly
|
|
|
+ pointless.
|
|
|
+
|
|
|
+- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
|
|
|
+ but the syscall may not be changed to another system call using the
|
|
|
+ orig_rax register. It may only be changed to -1 order to skip the
|
|
|
+ currently emulated call. Any other change MAY terminate the process.
|
|
|
+ The rip value seen by the tracer will be the syscall entry address;
|
|
|
+ this is different from normal behavior. The tracer MUST NOT modify
|
|
|
+ rip or rsp. (Do not rely on other changes terminating the process.
|
|
|
+ They might work. For example, on some kernels, choosing a syscall
|
|
|
+ that only exists in future kernels will be correctly emulated (by
|
|
|
+ returning -ENOSYS).
|
|
|
+
|
|
|
+To detect this quirky behavior, check for addr & ~0x0C00 ==
|
|
|
+0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
|
|
|
+SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
|
|
|
+condition: future kernels may improve vsyscall emulation and current
|
|
|
+kernels in vsyscall=native mode will behave differently, but the
|
|
|
+instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
|
|
|
+cases.
|
|
|
+
|
|
|
+Note that modern systems are unlikely to use vsyscalls at all -- they
|
|
|
+are a legacy feature and they are considerably slower than standard
|
|
|
+syscalls. New code will use the vDSO, and vDSO-issued system calls
|
|
|
+are indistinguishable from normal system calls.
|