123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283 |
- Performance Counters for Linux
- ------------------------------
- Performance counters are special hardware registers available on most modern
- CPUs. These registers count the number of certain types of hw events: such
- as instructions executed, cachemisses suffered, or branches mis-predicted -
- without slowing down the kernel or applications. These registers can also
- trigger interrupts when a threshold number of events have passed - and can
- thus be used to profile the code that runs on that CPU.
- The Linux Performance Counter subsystem provides an abstraction of these
- hardware capabilities. It provides per task and per CPU counters, counter
- groups, and it provides event capabilities on top of those. It
- provides "virtual" 64-bit counters, regardless of the width of the
- underlying hardware counters.
- Performance counters are accessed via special file descriptors.
- There's one file descriptor per virtual counter used.
- The special file descriptor is opened via the perf_counter_open()
- system call:
- int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
- pid_t pid, int cpu, int group_fd,
- unsigned long flags);
- The syscall returns the new fd. The fd can be used via the normal
- VFS system calls: read() can be used to read the counter, fcntl()
- can be used to set the blocking mode, etc.
- Multiple counters can be kept open at a time, and the counters
- can be poll()ed.
- When creating a new counter fd, 'perf_counter_hw_event' is:
- /*
- * Event to monitor via a performance monitoring counter:
- */
- struct perf_counter_hw_event {
- __u64 event_config;
- __u64 irq_period;
- __u64 record_type;
- __u64 read_format;
- __u64 disabled : 1, /* off by default */
- nmi : 1, /* NMI sampling */
- inherit : 1, /* children inherit it */
- pinned : 1, /* must always be on PMU */
- exclusive : 1, /* only group on PMU */
- exclude_user : 1, /* don't count user */
- exclude_kernel : 1, /* ditto kernel */
- exclude_hv : 1, /* ditto hypervisor */
- exclude_idle : 1, /* don't count when idle */
- __reserved_1 : 55;
- __u32 extra_config_len;
- __u32 __reserved_4;
- __u64 __reserved_2;
- __u64 __reserved_3;
- };
- The 'event_config' field specifies what the counter should count. It
- is divided into 3 bit-fields:
- raw_type: 1 bit (most significant bit) 0x8000_0000_0000_0000
- type: 7 bits (next most significant) 0x7f00_0000_0000_0000
- event_id: 56 bits (least significant) 0x00ff_0000_0000_0000
- If 'raw_type' is 1, then the counter will count a hardware event
- specified by the remaining 63 bits of event_config. The encoding is
- machine-specific.
- If 'raw_type' is 0, then the 'type' field says what kind of counter
- this is, with the following encoding:
- enum perf_event_types {
- PERF_TYPE_HARDWARE = 0,
- PERF_TYPE_SOFTWARE = 1,
- PERF_TYPE_TRACEPOINT = 2,
- };
- A counter of PERF_TYPE_HARDWARE will count the hardware event
- specified by 'event_id':
- /*
- * Generalized performance counter event types, used by the hw_event.event_id
- * parameter of the sys_perf_counter_open() syscall:
- */
- enum hw_event_ids {
- /*
- * Common hardware events, generalized by the kernel:
- */
- PERF_COUNT_CPU_CYCLES = 0,
- PERF_COUNT_INSTRUCTIONS = 1,
- PERF_COUNT_CACHE_REFERENCES = 2,
- PERF_COUNT_CACHE_MISSES = 3,
- PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
- PERF_COUNT_BRANCH_MISSES = 5,
- PERF_COUNT_BUS_CYCLES = 6,
- };
- These are standardized types of events that work relatively uniformly
- on all CPUs that implement Performance Counters support under Linux,
- although there may be variations (e.g., different CPUs might count
- cache references and misses at different levels of the cache hierarchy).
- If a CPU is not able to count the selected event, then the system call
- will return -EINVAL.
- More hw_event_types are supported as well, but they are CPU-specific
- and accessed as raw events. For example, to count "External bus
- cycles while bus lock signal asserted" events on Intel Core CPUs, pass
- in a 0x4064 event_id value and set hw_event.raw_type to 1.
- A counter of type PERF_TYPE_SOFTWARE will count one of the available
- software events, selected by 'event_id':
- /*
- * Special "software" counters provided by the kernel, even if the hardware
- * does not support performance counters. These counters measure various
- * physical and sw events of the kernel (and allow the profiling of them as
- * well):
- */
- enum sw_event_ids {
- PERF_COUNT_CPU_CLOCK = 0,
- PERF_COUNT_TASK_CLOCK = 1,
- PERF_COUNT_PAGE_FAULTS = 2,
- PERF_COUNT_CONTEXT_SWITCHES = 3,
- PERF_COUNT_CPU_MIGRATIONS = 4,
- PERF_COUNT_PAGE_FAULTS_MIN = 5,
- PERF_COUNT_PAGE_FAULTS_MAJ = 6,
- };
- Counters come in two flavours: counting counters and sampling
- counters. A "counting" counter is one that is used for counting the
- number of events that occur, and is characterised by having
- irq_period = 0 and record_type = PERF_RECORD_SIMPLE. A read() on a
- counting counter simply returns the current value of the counter as
- an 8-byte number.
- A "sampling" counter is one that is set up to generate an interrupt
- every N events, where N is given by 'irq_period'. A sampling counter
- has irq_period > 0 and record_type != PERF_RECORD_SIMPLE. The
- record_type controls what data is recorded on each interrupt, and the
- available values are currently:
- /*
- * IRQ-notification data record type:
- */
- enum perf_counter_record_type {
- PERF_RECORD_SIMPLE = 0,
- PERF_RECORD_IRQ = 1,
- PERF_RECORD_GROUP = 2,
- };
- A record_type value of PERF_RECORD_IRQ will record the instruction
- pointer (IP) at which the interrupt occurred. A record_type value of
- PERF_RECORD_GROUP will record the event_config and counter value of
- all of the other counters in the group, and should only be used on a
- group leader (see below). Currently these two values are mutually
- exclusive, but record_type will become a bit-mask in future and
- support other values.
- A sampling counter has an event queue, into which an event is placed
- on each interrupt. A read() on a sampling counter will read the next
- event from the event queue. If the queue is empty, the read() will
- either block or return an EAGAIN error, depending on whether the fd
- has been set to non-blocking mode or not.
- The 'disabled' bit specifies whether the counter starts out disabled
- or enabled. If it is initially disabled, it can be enabled by ioctl
- or prctl (see below).
- The 'nmi' bit specifies, for hardware events, whether the counter
- should be set up to request non-maskable interrupts (NMIs) or normal
- interrupts. This bit is ignored if the user doesn't have
- CAP_SYS_ADMIN privilege (i.e. is not root) or if the CPU doesn't
- generate NMIs from hardware counters.
- The 'inherit' bit, if set, specifies that this counter should count
- events on descendant tasks as well as the task specified. This only
- applies to new descendents, not to any existing descendents at the
- time the counter is created (nor to any new descendents of existing
- descendents).
- The 'pinned' bit, if set, specifies that the counter should always be
- on the CPU if at all possible. It only applies to hardware counters
- and only to group leaders. If a pinned counter cannot be put onto the
- CPU (e.g. because there are not enough hardware counters or because of
- a conflict with some other event), then the counter goes into an
- 'error' state, where reads return end-of-file (i.e. read() returns 0)
- until the counter is subsequently enabled or disabled.
- The 'exclusive' bit, if set, specifies that when this counter's group
- is on the CPU, it should be the only group using the CPU's counters.
- In future, this will allow sophisticated monitoring programs to supply
- extra configuration information via 'extra_config_len' to exploit
- advanced features of the CPU's Performance Monitor Unit (PMU) that are
- not otherwise accessible and that might disrupt other hardware
- counters.
- The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a
- way to request that counting of events be restricted to times when the
- CPU is in user, kernel and/or hypervisor mode.
- The 'pid' parameter to the perf_counter_open() system call allows the
- counter to be specific to a task:
- pid == 0: if the pid parameter is zero, the counter is attached to the
- current task.
- pid > 0: the counter is attached to a specific task (if the current task
- has sufficient privilege to do so)
- pid < 0: all tasks are counted (per cpu counters)
- The 'cpu' parameter allows a counter to be made specific to a CPU:
- cpu >= 0: the counter is restricted to a specific CPU
- cpu == -1: the counter counts on all CPUs
- (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
- A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
- events of that task and 'follows' that task to whatever CPU the task
- gets schedule to. Per task counters can be created by any user, for
- their own tasks.
- A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
- all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
- The 'flags' parameter is currently unused and must be zero.
- The 'group_fd' parameter allows counter "groups" to be set up. A
- counter group has one counter which is the group "leader". The leader
- is created first, with group_fd = -1 in the perf_counter_open call
- that creates it. The rest of the group members are created
- subsequently, with group_fd giving the fd of the group leader.
- (A single counter on its own is created with group_fd = -1 and is
- considered to be a group with only 1 member.)
- A counter group is scheduled onto the CPU as a unit, that is, it will
- only be put onto the CPU if all of the counters in the group can be
- put onto the CPU. This means that the values of the member counters
- can be meaningfully compared, added, divided (to get ratios), etc.,
- with each other, since they have counted events for the same set of
- executed instructions.
- Counters can be enabled and disabled in two ways: via ioctl and via
- prctl. When a counter is disabled, it doesn't count or generate
- events but does continue to exist and maintain its count value.
- An individual counter or counter group can be enabled with
- ioctl(fd, PERF_COUNTER_IOC_ENABLE);
- or disabled with
- ioctl(fd, PERF_COUNTER_IOC_DISABLE);
- Enabling or disabling the leader of a group enables or disables the
- whole group; that is, while the group leader is disabled, none of the
- counters in the group will count. Enabling or disabling a member of a
- group other than the leader only affects that counter - disabling an
- non-leader stops that counter from counting but doesn't affect any
- other counter.
- A process can enable or disable all the counter groups that are
- attached to it, using prctl:
- prctl(PR_TASK_PERF_COUNTERS_ENABLE);
- prctl(PR_TASK_PERF_COUNTERS_DISABLE);
- This applies to all counters on the current process, whether created
- by this process or by another, and doesn't affect any counters that
- this process has created on other processes. It only enables or
- disables the group leaders, not any other members in the groups.
|