|
@@ -3,9 +3,9 @@
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
This file documents the mmap() facility available with the PACKET
|
|
|
-socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
|
|
|
-capture network traffic with utilities like tcpdump or any other that needs
|
|
|
-raw access to network interface.
|
|
|
+socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
|
|
|
+i) capture network traffic with utilities like tcpdump, ii) transmit network
|
|
|
+traffic, or any other that needs raw access to network interface.
|
|
|
|
|
|
You can find the latest version of this document at:
|
|
|
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
|
|
@@ -21,19 +21,18 @@ Please send your comments to
|
|
|
+ Why use PACKET_MMAP
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
-In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very
|
|
|
-inefficient. It uses very limited buffers and requires one system call
|
|
|
-to capture each packet, it requires two if you want to get packet's
|
|
|
-timestamp (like libpcap always does).
|
|
|
+In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
|
|
|
+inefficient. It uses very limited buffers and requires one system call to
|
|
|
+capture each packet, it requires two if you want to get packet's timestamp
|
|
|
+(like libpcap always does).
|
|
|
|
|
|
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
|
|
|
configurable circular buffer mapped in user space that can be used to either
|
|
|
send or receive packets. This way reading packets just needs to wait for them,
|
|
|
most of the time there is no need to issue a single system call. Concerning
|
|
|
transmission, multiple packets can be sent through one system call to get the
|
|
|
-highest bandwidth.
|
|
|
-By using a shared buffer between the kernel and the user also has the benefit
|
|
|
-of minimizing packet copies.
|
|
|
+highest bandwidth. By using a shared buffer between the kernel and the user
|
|
|
+also has the benefit of minimizing packet copies.
|
|
|
|
|
|
It's fine to use PACKET_MMAP to improve the performance of the capture and
|
|
|
transmission process, but it isn't everything. At least, if you are capturing
|
|
@@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the
|
|
|
device driver of your network interface card supports some sort of interrupt
|
|
|
load mitigation or (even better) if it supports NAPI, also make sure it is
|
|
|
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
|
|
|
-supported by devices of your network.
|
|
|
+supported by devices of your network. CPU IRQ pinning of your network interface
|
|
|
+card can also be an advantage.
|
|
|
|
|
|
--------------------------------------------------------------------------------
|
|
|
+ How to use mmap() to improve capture process
|
|
@@ -87,9 +87,7 @@ the following process:
|
|
|
socket creation and destruction is straight forward, and is done
|
|
|
the same way with or without PACKET_MMAP:
|
|
|
|
|
|
-int fd;
|
|
|
-
|
|
|
-fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
|
|
|
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
|
|
|
|
|
|
where mode is SOCK_RAW for the raw interface were link level
|
|
|
information can be captured or SOCK_DGRAM for the cooked
|
|
@@ -180,7 +178,6 @@ and the PACKET_TX_HAS_OFF option.
|
|
|
+ PACKET_MMAP settings
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
|
|
-
|
|
|
To setup PACKET_MMAP from user level code is done with a call like
|
|
|
|
|
|
- Capture process
|
|
@@ -214,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true
|
|
|
|
|
|
frames_per_block * tp_block_nr == tp_frame_nr
|
|
|
|
|
|
-
|
|
|
Lets see an example, with the following values:
|
|
|
|
|
|
tp_block_size= 4096
|
|
@@ -240,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into
|
|
|
account when choosing the frame_size. See "Mapping and use of the circular
|
|
|
buffer (ring)".
|
|
|
|
|
|
-
|
|
|
--------------------------------------------------------------------------------
|
|
|
+ PACKET_MMAP setting constraints
|
|
|
--------------------------------------------------------------------------------
|
|
@@ -277,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and
|
|
|
The pagesize can also be determined dynamically with the getpagesize (2)
|
|
|
system call.
|
|
|
|
|
|
-
|
|
|
Block number limit
|
|
|
--------------------
|
|
|
|
|
@@ -297,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated.
|
|
|
v block #2
|
|
|
block #1
|
|
|
|
|
|
-
|
|
|
kmalloc allocates any number of bytes of physically contiguous memory from
|
|
|
a pool of pre-determined sizes. This pool of memory is maintained by the slab
|
|
|
allocator which is at the end the responsible for doing the allocation and
|
|
@@ -312,7 +305,6 @@ pointers to blocks is
|
|
|
|
|
|
131072/4 = 32768 blocks
|
|
|
|
|
|
-
|
|
|
PACKET_MMAP buffer size calculator
|
|
|
------------------------------------
|
|
|
|
|
@@ -353,7 +345,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield
|
|
|
and hence the buffer will have a 262144 MiB size. So it can hold
|
|
|
262144 MiB / 2048 bytes = 134217728 frames
|
|
|
|
|
|
-
|
|
|
Actually, this buffer size is not possible with an i386 architecture.
|
|
|
Remember that the memory is allocated in kernel space, in the case of
|
|
|
an i386 kernel's memory size is limited to 1GiB.
|
|
@@ -385,7 +376,6 @@ the following (from include/linux/if_packet.h):
|
|
|
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
|
|
|
- Pad to align to TPACKET_ALIGNMENT=16
|
|
|
*/
|
|
|
-
|
|
|
|
|
|
The following are conditions that are checked in packet_set_ring
|
|
|
|
|
@@ -426,7 +416,6 @@ and the following flags apply:
|
|
|
#define TP_STATUS_LOSING 4
|
|
|
#define TP_STATUS_CSUMNOTREADY 8
|
|
|
|
|
|
-
|
|
|
TP_STATUS_COPY : This flag indicates that the frame (and associated
|
|
|
meta information) has been truncated because it's
|
|
|
larger than tp_frame_size. This packet can be
|
|
@@ -475,7 +464,6 @@ packets are in the ring:
|
|
|
It doesn't incur in a race condition to first check the status value and
|
|
|
then poll for frames.
|
|
|
|
|
|
-
|
|
|
++ Transmission process
|
|
|
Those defines are also used for transmission:
|
|
|
|
|
@@ -506,6 +494,196 @@ The user can also use poll() to check if a buffer is available:
|
|
|
pfd.events = POLLOUT;
|
|
|
retval = poll(&pfd, 1, timeout);
|
|
|
|
|
|
+-------------------------------------------------------------------------------
|
|
|
++ What TPACKET versions are available and when to use them?
|
|
|
+-------------------------------------------------------------------------------
|
|
|
+
|
|
|
+ int val = tpacket_version;
|
|
|
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
|
|
|
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
|
|
|
+
|
|
|
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
|
|
|
+
|
|
|
+TPACKET_V1:
|
|
|
+ - Default if not otherwise specified by setsockopt(2)
|
|
|
+ - RX_RING, TX_RING available
|
|
|
+ - VLAN metadata information available for packets
|
|
|
+ (TP_STATUS_VLAN_VALID)
|
|
|
+
|
|
|
+TPACKET_V1 --> TPACKET_V2:
|
|
|
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
|
|
|
+ structures, thus this also works on 64 bit kernel with 32 bit
|
|
|
+ userspace and the like
|
|
|
+ - Timestamp resolution in nanoseconds instead of microseconds
|
|
|
+ - RX_RING, TX_RING available
|
|
|
+ - How to switch to TPACKET_V2:
|
|
|
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
|
|
|
+ 2. Query header len and save
|
|
|
+ 3. Set protocol version to 2, set up ring as usual
|
|
|
+ 4. For getting the sockaddr_ll,
|
|
|
+ use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
|
|
|
+ (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
|
|
|
+
|
|
|
+TPACKET_V2 --> TPACKET_V3:
|
|
|
+ - Flexible buffer implementation:
|
|
|
+ 1. Blocks can be configured with non-static frame-size
|
|
|
+ 2. Read/poll is at a block-level (as opposed to packet-level)
|
|
|
+ 3. Added poll timeout to avoid indefinite user-space wait
|
|
|
+ on idle links
|
|
|
+ 4. Added user-configurable knobs:
|
|
|
+ 4.1 block::timeout
|
|
|
+ 4.2 tpkt_hdr::sk_rxhash
|
|
|
+ - RX Hash data available in user space
|
|
|
+ - Currently only RX_RING available
|
|
|
+
|
|
|
+-------------------------------------------------------------------------------
|
|
|
++ AF_PACKET fanout mode
|
|
|
+-------------------------------------------------------------------------------
|
|
|
+
|
|
|
+In the AF_PACKET fanout mode, packet reception can be load balanced among
|
|
|
+processes. This also works in combination with mmap(2) on packet sockets.
|
|
|
+
|
|
|
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
|
|
|
+"./test eth0 lb", etc.):
|
|
|
+
|
|
|
+#include <stddef.h>
|
|
|
+#include <stdlib.h>
|
|
|
+#include <stdio.h>
|
|
|
+#include <string.h>
|
|
|
+
|
|
|
+#include <sys/types.h>
|
|
|
+#include <sys/wait.h>
|
|
|
+#include <sys/socket.h>
|
|
|
+#include <sys/ioctl.h>
|
|
|
+
|
|
|
+#include <unistd.h>
|
|
|
+
|
|
|
+#include <linux/if_ether.h>
|
|
|
+#include <linux/if_packet.h>
|
|
|
+
|
|
|
+#include <net/if.h>
|
|
|
+
|
|
|
+static const char *device_name;
|
|
|
+static int fanout_type;
|
|
|
+static int fanout_id;
|
|
|
+
|
|
|
+#ifndef PACKET_FANOUT
|
|
|
+# define PACKET_FANOUT 18
|
|
|
+# define PACKET_FANOUT_HASH 0
|
|
|
+# define PACKET_FANOUT_LB 1
|
|
|
+#endif
|
|
|
+
|
|
|
+static int setup_socket(void)
|
|
|
+{
|
|
|
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
|
|
|
+ struct sockaddr_ll ll;
|
|
|
+ struct ifreq ifr;
|
|
|
+ int fanout_arg;
|
|
|
+
|
|
|
+ if (fd < 0) {
|
|
|
+ perror("socket");
|
|
|
+ return EXIT_FAILURE;
|
|
|
+ }
|
|
|
+
|
|
|
+ memset(&ifr, 0, sizeof(ifr));
|
|
|
+ strcpy(ifr.ifr_name, device_name);
|
|
|
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
|
|
|
+ if (err < 0) {
|
|
|
+ perror("SIOCGIFINDEX");
|
|
|
+ return EXIT_FAILURE;
|
|
|
+ }
|
|
|
+
|
|
|
+ memset(&ll, 0, sizeof(ll));
|
|
|
+ ll.sll_family = AF_PACKET;
|
|
|
+ ll.sll_ifindex = ifr.ifr_ifindex;
|
|
|
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
|
|
|
+ if (err < 0) {
|
|
|
+ perror("bind");
|
|
|
+ return EXIT_FAILURE;
|
|
|
+ }
|
|
|
+
|
|
|
+ fanout_arg = (fanout_id | (fanout_type << 16));
|
|
|
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
|
|
|
+ &fanout_arg, sizeof(fanout_arg));
|
|
|
+ if (err) {
|
|
|
+ perror("setsockopt");
|
|
|
+ return EXIT_FAILURE;
|
|
|
+ }
|
|
|
+
|
|
|
+ return fd;
|
|
|
+}
|
|
|
+
|
|
|
+static void fanout_thread(void)
|
|
|
+{
|
|
|
+ int fd = setup_socket();
|
|
|
+ int limit = 10000;
|
|
|
+
|
|
|
+ if (fd < 0)
|
|
|
+ exit(fd);
|
|
|
+
|
|
|
+ while (limit-- > 0) {
|
|
|
+ char buf[1600];
|
|
|
+ int err;
|
|
|
+
|
|
|
+ err = read(fd, buf, sizeof(buf));
|
|
|
+ if (err < 0) {
|
|
|
+ perror("read");
|
|
|
+ exit(EXIT_FAILURE);
|
|
|
+ }
|
|
|
+ if ((limit % 10) == 0)
|
|
|
+ fprintf(stdout, "(%d) \n", getpid());
|
|
|
+ }
|
|
|
+
|
|
|
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
|
|
|
+
|
|
|
+ close(fd);
|
|
|
+ exit(0);
|
|
|
+}
|
|
|
+
|
|
|
+int main(int argc, char **argp)
|
|
|
+{
|
|
|
+ int fd, err;
|
|
|
+ int i;
|
|
|
+
|
|
|
+ if (argc != 3) {
|
|
|
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
|
|
|
+ return EXIT_FAILURE;
|
|
|
+ }
|
|
|
+
|
|
|
+ if (!strcmp(argp[2], "hash"))
|
|
|
+ fanout_type = PACKET_FANOUT_HASH;
|
|
|
+ else if (!strcmp(argp[2], "lb"))
|
|
|
+ fanout_type = PACKET_FANOUT_LB;
|
|
|
+ else {
|
|
|
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
|
|
|
+ exit(EXIT_FAILURE);
|
|
|
+ }
|
|
|
+
|
|
|
+ device_name = argp[1];
|
|
|
+ fanout_id = getpid() & 0xffff;
|
|
|
+
|
|
|
+ for (i = 0; i < 4; i++) {
|
|
|
+ pid_t pid = fork();
|
|
|
+
|
|
|
+ switch (pid) {
|
|
|
+ case 0:
|
|
|
+ fanout_thread();
|
|
|
+
|
|
|
+ case -1:
|
|
|
+ perror("fork");
|
|
|
+ exit(EXIT_FAILURE);
|
|
|
+ }
|
|
|
+ }
|
|
|
+
|
|
|
+ for (i = 0; i < 4; i++) {
|
|
|
+ int status;
|
|
|
+
|
|
|
+ wait(&status);
|
|
|
+ }
|
|
|
+
|
|
|
+ return 0;
|
|
|
+}
|
|
|
+
|
|
|
-------------------------------------------------------------------------------
|
|
|
+ PACKET_TIMESTAMP
|
|
|
-------------------------------------------------------------------------------
|
|
@@ -532,6 +710,13 @@ the networking stack is used (the behavior before this setting was added).
|
|
|
See include/linux/net_tstamp.h and Documentation/networking/timestamping
|
|
|
for more information on hardware timestamps.
|
|
|
|
|
|
+-------------------------------------------------------------------------------
|
|
|
++ Miscellaneous bits
|
|
|
+-------------------------------------------------------------------------------
|
|
|
+
|
|
|
+- Packet sockets work well together with Linux socket filters, thus you also
|
|
|
+ might want to have a look at Documentation/networking/filter.txt
|
|
|
+
|
|
|
--------------------------------------------------------------------------------
|
|
|
+ THANKS
|
|
|
--------------------------------------------------------------------------------
|