perf-counters.txt 4.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
  1. Performance Counters for Linux
  2. ------------------------------
  3. Performance counters are special hardware registers available on most modern
  4. CPUs. These registers count the number of certain types of hw events: such
  5. as instructions executed, cachemisses suffered, or branches mis-predicted -
  6. without slowing down the kernel or applications. These registers can also
  7. trigger interrupts when a threshold number of events have passed - and can
  8. thus be used to profile the code that runs on that CPU.
  9. The Linux Performance Counter subsystem provides an abstraction of these
  10. hardware capabilities. It provides per task and per CPU counters, counter
  11. groups, and it provides event capabilities on top of those.
  12. Performance counters are accessed via special file descriptors.
  13. There's one file descriptor per virtual counter used.
  14. The special file descriptor is opened via the perf_counter_open()
  15. system call:
  16. int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
  17. pid_t pid, int cpu, int group_fd);
  18. The syscall returns the new fd. The fd can be used via the normal
  19. VFS system calls: read() can be used to read the counter, fcntl()
  20. can be used to set the blocking mode, etc.
  21. Multiple counters can be kept open at a time, and the counters
  22. can be poll()ed.
  23. When creating a new counter fd, 'perf_counter_hw_event' is:
  24. /*
  25. * Hardware event to monitor via a performance monitoring counter:
  26. */
  27. struct perf_counter_hw_event {
  28. s64 type;
  29. u64 irq_period;
  30. u32 record_type;
  31. u32 disabled : 1, /* off by default */
  32. nmi : 1, /* NMI sampling */
  33. raw : 1, /* raw event type */
  34. __reserved_1 : 29;
  35. u64 __reserved_2;
  36. };
  37. /*
  38. * Generalized performance counter event types, used by the hw_event.type
  39. * parameter of the sys_perf_counter_open() syscall:
  40. */
  41. enum hw_event_types {
  42. /*
  43. * Common hardware events, generalized by the kernel:
  44. */
  45. PERF_COUNT_CYCLES = 0,
  46. PERF_COUNT_INSTRUCTIONS = 1,
  47. PERF_COUNT_CACHE_REFERENCES = 2,
  48. PERF_COUNT_CACHE_MISSES = 3,
  49. PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
  50. PERF_COUNT_BRANCH_MISSES = 5,
  51. /*
  52. * Special "software" counters provided by the kernel, even if
  53. * the hardware does not support performance counters. These
  54. * counters measure various physical and sw events of the
  55. * kernel (and allow the profiling of them as well):
  56. */
  57. PERF_COUNT_CPU_CLOCK = -1,
  58. PERF_COUNT_TASK_CLOCK = -2,
  59. /*
  60. * Future software events:
  61. */
  62. /* PERF_COUNT_PAGE_FAULTS = -3,
  63. PERF_COUNT_CONTEXT_SWITCHES = -4, */
  64. };
  65. These are standardized types of events that work uniformly on all CPUs
  66. that implements Performance Counters support under Linux. If a CPU is
  67. not able to count branch-misses, then the system call will return
  68. -EINVAL.
  69. More hw_event_types are supported as well, but they are CPU
  70. specific and are enumerated via /sys on a per CPU basis. Raw hw event
  71. types can be passed in under hw_event.type if hw_event.raw is 1.
  72. For example, to count "External bus cycles while bus lock signal asserted"
  73. events on Intel Core CPUs, pass in a 0x4064 event type value and set
  74. hw_event.raw to 1.
  75. 'record_type' is the type of data that a read() will provide for the
  76. counter, and it can be one of:
  77. /*
  78. * IRQ-notification data record type:
  79. */
  80. enum perf_counter_record_type {
  81. PERF_RECORD_SIMPLE = 0,
  82. PERF_RECORD_IRQ = 1,
  83. PERF_RECORD_GROUP = 2,
  84. };
  85. a "simple" counter is one that counts hardware events and allows
  86. them to be read out into a u64 count value. (read() returns 8 on
  87. a successful read of a simple counter.)
  88. An "irq" counter is one that will also provide an IRQ context information:
  89. the IP of the interrupted context. In this case read() will return
  90. the 8-byte counter value, plus the Instruction Pointer address of the
  91. interrupted context.
  92. The parameter 'hw_event_period' is the number of events before waking up
  93. a read() that is blocked on a counter fd. Zero value means a non-blocking
  94. counter.
  95. The 'pid' parameter allows the counter to be specific to a task:
  96. pid == 0: if the pid parameter is zero, the counter is attached to the
  97. current task.
  98. pid > 0: the counter is attached to a specific task (if the current task
  99. has sufficient privilege to do so)
  100. pid < 0: all tasks are counted (per cpu counters)
  101. The 'cpu' parameter allows a counter to be made specific to a full
  102. CPU:
  103. cpu >= 0: the counter is restricted to a specific CPU
  104. cpu == -1: the counter counts on all CPUs
  105. (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)
  106. A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
  107. events of that task and 'follows' that task to whatever CPU the task
  108. gets schedule to. Per task counters can be created by any user, for
  109. their own tasks.
  110. A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
  111. all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
  112. Group counters are created by passing in a group_fd of another counter.
  113. Groups are scheduled at once and can be used with PERF_RECORD_GROUP
  114. to record multi-dimensional timestamps.