NAPI_HOWTO.txt 27 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766
  1. HISTORY:
  2. February 16/2002 -- revision 0.2.1:
  3. COR typo corrected
  4. February 10/2002 -- revision 0.2:
  5. some spell checking ;->
  6. January 12/2002 -- revision 0.1
  7. This is still work in progress so may change.
  8. To keep up to date please watch this space.
  9. Introduction to NAPI
  10. ====================
  11. NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
  12. to improve network performance on Linux. For more details please
  13. read that paper.
  14. NAPI provides a "inherent mitigation" which is bound by system capacity
  15. as can be seen from the following data collected by Robert on Gigabit
  16. ethernet (e1000):
  17. Psize Ipps Tput Rxint Txint Done Ndone
  18. ---------------------------------------------------------------
  19. 60 890000 409362 17 27622 7 6823
  20. 128 758150 464364 21 9301 10 7738
  21. 256 445632 774646 42 15507 21 12906
  22. 512 232666 994445 241292 19147 241192 1062
  23. 1024 119061 1000003 872519 19258 872511 0
  24. 1440 85193 1000003 946576 19505 946569 0
  25. Legend:
  26. "Ipps" stands for input packets per second.
  27. "Tput" == packets out of total 1M that made it out.
  28. "txint" == transmit completion interrupts seen
  29. "Done" == The number of times that the poll() managed to pull all
  30. packets out of the rx ring. Note from this that the lower the
  31. load the more we could clean up the rxring
  32. "Ndone" == is the converse of "Done". Note again, that the higher
  33. the load the more times we couldn't clean up the rxring.
  34. Observe that:
  35. when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
  36. The system cant handle the processing at 1 interrupt/packet at that load level.
  37. At lower rates on the other hand, rx interrupts go up and therefore the
  38. interrupt/packet ratio goes up (as observable from that table). So there is
  39. possibility that under low enough input, you get one poll call for each
  40. input packet caused by a single interrupt each time. And if the system
  41. cant handle interrupt per packet ratio of 1, then it will just have to
  42. chug along ....
  43. 0) Prerequisites:
  44. ==================
  45. A driver MAY continue using the old 2.4 technique for interfacing
  46. to the network stack and not benefit from the NAPI changes.
  47. NAPI additions to the kernel do not break backward compatibility.
  48. NAPI, however, requires the following features to be available:
  49. A) DMA ring or enough RAM to store packets in software devices.
  50. B) Ability to turn off interrupts or maybe events that send packets up
  51. the stack.
  52. NAPI processes packet events in what is known as dev->poll() method.
  53. Typically, only packet receive events are processed in dev->poll().
  54. The rest of the events MAY be processed by the regular interrupt handler
  55. to reduce processing latency (justified also because there are not that
  56. many of them).
  57. Note, however, NAPI does not enforce that dev->poll() only processes
  58. receive events.
  59. Tests with the tulip driver indicated slightly increased latency if
  60. all of the interrupt handler is moved to dev->poll(). Also MII handling
  61. gets a little trickier.
  62. The example used in this document is to move the receive processing only
  63. to dev->poll(); this is shown with the patch for the tulip driver.
  64. For an example of code that moves all the interrupt driver to
  65. dev->poll() look at the ported e1000 code.
  66. There are caveats that might force you to go with moving everything to
  67. dev->poll(). Different NICs work differently depending on their status/event
  68. acknowledgement setup.
  69. There are two types of event register ACK mechanisms.
  70. I) what is known as Clear-on-read (COR).
  71. when you read the status/event register, it clears everything!
  72. The natsemi and sunbmac NICs are known to do this.
  73. In this case your only choice is to move all to dev->poll()
  74. II) Clear-on-write (COW)
  75. i) you clear the status by writing a 1 in the bit-location you want.
  76. These are the majority of the NICs and work the best with NAPI.
  77. Put only receive events in dev->poll(); leave the rest in
  78. the old interrupt handler.
  79. ii) whatever you write in the status register clears every thing ;->
  80. Cant seem to find any supported by Linux which do this. If
  81. someone knows such a chip email us please.
  82. Move all to dev->poll()
  83. C) Ability to detect new work correctly.
  84. NAPI works by shutting down event interrupts when there's work and
  85. turning them on when there's none.
  86. New packets might show up in the small window while interrupts were being
  87. re-enabled (refer to appendix 2). A packet might sneak in during the period
  88. we are enabling interrupts. We only get to know about such a packet when the
  89. next new packet arrives and generates an interrupt.
  90. Essentially, there is a small window of opportunity for a race condition
  91. which for clarity we'll refer to as the "rotting packet".
  92. This is a very important topic and appendix 2 is dedicated for more
  93. discussion.
  94. Locking rules and environmental guarantees
  95. ==========================================
  96. -Guarantee: Only one CPU at any time can call dev->poll(); this is because
  97. only one CPU can pick the initial interrupt and hence the initial
  98. netif_rx_schedule(dev);
  99. - The core layer invokes devices to send packets in a round robin format.
  100. This implies receive is totally lockless because of the guarantee that only
  101. one CPU is executing it.
  102. - contention can only be the result of some other CPU accessing the rx
  103. ring. This happens only in close() and suspend() (when these methods
  104. try to clean the rx ring);
  105. ****guarantee: driver authors need not worry about this; synchronization
  106. is taken care for them by the top net layer.
  107. -local interrupts are enabled (if you dont move all to dev->poll()). For
  108. example link/MII and txcomplete continue functioning just same old way.
  109. This improves the latency of processing these events. It is also assumed that
  110. the receive interrupt is the largest cause of noise. Note this might not
  111. always be true.
  112. [according to Manfred Spraul, the winbond insists on sending one
  113. txmitcomplete interrupt for each packet (although this can be mitigated)].
  114. For these broken drivers, move all to dev->poll().
  115. For the rest of this text, we'll assume that dev->poll() only
  116. processes receive events.
  117. new methods introduce by NAPI
  118. =============================
  119. a) netif_rx_schedule(dev)
  120. Called by an IRQ handler to schedule a poll for device
  121. b) netif_rx_schedule_prep(dev)
  122. puts the device in a state which allows for it to be added to the
  123. CPU polling list if it is up and running. You can look at this as
  124. the first half of netif_rx_schedule(dev) above; the second half
  125. being c) below.
  126. c) __netif_rx_schedule(dev)
  127. Add device to the poll list for this CPU; assuming that _prep above
  128. has already been called and returned 1.
  129. d) netif_rx_reschedule(dev, undo)
  130. Called to reschedule polling for device specifically for some
  131. deficient hardware. Read Appendix 2 for more details.
  132. e) netif_rx_complete(dev)
  133. Remove interface from the CPU poll list: it must be in the poll list
  134. on current cpu. This primitive is called by dev->poll(), when
  135. it completes its work. The device cannot be out of poll list at this
  136. call, if it is then clearly it is a BUG(). You'll know ;->
  137. All of the above methods are used below, so keep reading for clarity.
  138. Device driver changes to be made when porting NAPI
  139. ==================================================
  140. Below we describe what kind of changes are required for NAPI to work.
  141. 1) introduction of dev->poll() method
  142. =====================================
  143. This is the method that is invoked by the network core when it requests
  144. for new packets from the driver. A driver is allowed to send upto
  145. dev->quota packets by the current CPU before yielding to the network
  146. subsystem (so other devices can also get opportunity to send to the stack).
  147. dev->poll() prototype looks as follows:
  148. int my_poll(struct net_device *dev, int *budget)
  149. budget is the remaining number of packets the network subsystem on the
  150. current CPU can send up the stack before yielding to other system tasks.
  151. *Each driver is responsible for decrementing budget by the total number of
  152. packets sent.
  153. Total number of packets cannot exceed dev->quota.
  154. dev->poll() method is invoked by the top layer, the driver just sends if it
  155. can to the stack the packet quantity requested.
  156. more on dev->poll() below after the interrupt changes are explained.
  157. 2) registering dev->poll() method
  158. ===================================
  159. dev->poll should be set in the dev->probe() method.
  160. e.g:
  161. dev->open = my_open;
  162. .
  163. .
  164. /* two new additions */
  165. /* first register my poll method */
  166. dev->poll = my_poll;
  167. /* next register my weight/quanta; can be overridden in /proc */
  168. dev->weight = 16;
  169. .
  170. .
  171. dev->stop = my_close;
  172. 3) scheduling dev->poll()
  173. =============================
  174. This involves modifying the interrupt handler and the code
  175. path which takes the packet off the NIC and sends them to the
  176. stack.
  177. it's important at this point to introduce the classical D Becker
  178. interrupt processor:
  179. ------------------
  180. static irqreturn_t
  181. netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
  182. {
  183. struct net_device *dev = (struct net_device *)dev_instance;
  184. struct my_private *tp = (struct my_private *)dev->priv;
  185. int work_count = my_work_count;
  186. status = read_interrupt_status_reg();
  187. if (status == 0)
  188. return IRQ_NONE; /* Shared IRQ: not us */
  189. if (status == 0xffff)
  190. return IRQ_HANDLED; /* Hot unplug */
  191. if (status & error)
  192. do_some_error_handling()
  193. do {
  194. acknowledge_ints_ASAP();
  195. if (status & link_interrupt) {
  196. spin_lock(&tp->link_lock);
  197. do_some_link_stat_stuff();
  198. spin_lock(&tp->link_lock);
  199. }
  200. if (status & rx_interrupt) {
  201. receive_packets(dev);
  202. }
  203. if (status & rx_nobufs) {
  204. make_rx_buffs_avail();
  205. }
  206. if (status & tx_related) {
  207. spin_lock(&tp->lock);
  208. tx_ring_free(dev);
  209. if (tx_died)
  210. restart_tx();
  211. spin_unlock(&tp->lock);
  212. }
  213. status = read_interrupt_status_reg();
  214. } while (!(status & error) || more_work_to_be_done);
  215. return IRQ_HANDLED;
  216. }
  217. ----------------------------------------------------------------------
  218. We now change this to what is shown below to NAPI-enable it:
  219. ----------------------------------------------------------------------
  220. static irqreturn_t
  221. netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
  222. {
  223. struct net_device *dev = (struct net_device *)dev_instance;
  224. struct my_private *tp = (struct my_private *)dev->priv;
  225. status = read_interrupt_status_reg();
  226. if (status == 0)
  227. return IRQ_NONE; /* Shared IRQ: not us */
  228. if (status == 0xffff)
  229. return IRQ_HANDLED; /* Hot unplug */
  230. if (status & error)
  231. do_some_error_handling();
  232. do {
  233. /************************ start note *********************************/
  234. acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
  235. /************************ end note *********************************/
  236. if (status & link_interrupt) {
  237. spin_lock(&tp->link_lock);
  238. do_some_link_stat_stuff();
  239. spin_unlock(&tp->link_lock);
  240. }
  241. /************************ start note *********************************/
  242. if (status & rx_interrupt || (status & rx_nobuffs)) {
  243. if (netif_rx_schedule_prep(dev)) {
  244. /* disable interrupts caused
  245. * by arriving packets */
  246. disable_rx_and_rxnobuff_ints();
  247. /* tell system we have work to be done. */
  248. __netif_rx_schedule(dev);
  249. } else {
  250. printk("driver bug! interrupt while in poll\n");
  251. /* FIX by disabling interrupts */
  252. disable_rx_and_rxnobuff_ints();
  253. }
  254. }
  255. /************************ end note note *********************************/
  256. if (status & tx_related) {
  257. spin_lock(&tp->lock);
  258. tx_ring_free(dev);
  259. if (tx_died)
  260. restart_tx();
  261. spin_unlock(&tp->lock);
  262. }
  263. status = read_interrupt_status_reg();
  264. /************************ start note *********************************/
  265. } while (!(status & error) || more_work_to_be_done(status));
  266. /************************ end note note *********************************/
  267. return IRQ_HANDLED;
  268. }
  269. ---------------------------------------------------------------------
  270. We note several things from above:
  271. I) Any interrupt source which is caused by arriving packets is now
  272. turned off when it occurs. Depending on the hardware, there could be
  273. several reasons that arriving packets would cause interrupts; these are the
  274. interrupt sources we wish to avoid. The two common ones are a) a packet
  275. arriving (rxint) b) a packet arriving and finding no DMA buffers available
  276. (rxnobuff) .
  277. This means also acknowledge_ints_ASAP() will not clear the status
  278. register for those two items above; clearing is done in the place where
  279. proper work is done within NAPI; at the poll() and refill_rx_ring()
  280. discussed further below.
  281. netif_rx_schedule_prep() returns 1 if device is in running state and
  282. gets successfully added to the core poll list. If we get a zero value
  283. we can _almost_ assume are already added to the list (instead of not running.
  284. Logic based on the fact that you shouldn't get interrupt if not running)
  285. We rectify this by disabling rx and rxnobuf interrupts.
  286. II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
  287. These functionalities are still around actually......
  288. infact, receive_packets(dev) is very close to my_poll() and
  289. make_rx_buffs_avail() is invoked from my_poll()
  290. 4) converting receive_packets() to dev->poll()
  291. ===============================================
  292. We need to convert the classical D Becker receive_packets(dev) to my_poll()
  293. First the typical receive_packets() below:
  294. -------------------------------------------------------------------
  295. /* this is called by interrupt handler */
  296. static void receive_packets (struct net_device *dev)
  297. {
  298. struct my_private *tp = (struct my_private *)dev->priv;
  299. rx_ring = tp->rx_ring;
  300. cur_rx = tp->cur_rx;
  301. int entry = cur_rx % RX_RING_SIZE;
  302. int received = 0;
  303. int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
  304. while (rx_ring_not_empty) {
  305. u32 rx_status;
  306. unsigned int rx_size;
  307. unsigned int pkt_size;
  308. struct sk_buff *skb;
  309. /* read size+status of next frame from DMA ring buffer */
  310. /* the number 16 and 4 are just examples */
  311. rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
  312. rx_size = rx_status >> 16;
  313. pkt_size = rx_size - 4;
  314. /* process errors */
  315. if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
  316. (!(rx_status & RxStatusOK))) {
  317. netdrv_rx_err (rx_status, dev, tp, ioaddr);
  318. return;
  319. }
  320. if (--rx_work_limit < 0)
  321. break;
  322. /* grab a skb */
  323. skb = dev_alloc_skb (pkt_size + 2);
  324. if (skb) {
  325. .
  326. .
  327. netif_rx (skb);
  328. .
  329. .
  330. } else { /* OOM */
  331. /*seems very driver specific ... some just pass
  332. whatever is on the ring already. */
  333. }
  334. /* move to the next skb on the ring */
  335. entry = (++tp->cur_rx) % RX_RING_SIZE;
  336. received++ ;
  337. }
  338. /* store current ring pointer state */
  339. tp->cur_rx = cur_rx;
  340. /* Refill the Rx ring buffers if they are needed */
  341. refill_rx_ring();
  342. .
  343. .
  344. }
  345. -------------------------------------------------------------------
  346. We change it to a new one below; note the additional parameter in
  347. the call.
  348. -------------------------------------------------------------------
  349. /* this is called by the network core */
  350. static int my_poll (struct net_device *dev, int *budget)
  351. {
  352. struct my_private *tp = (struct my_private *)dev->priv;
  353. rx_ring = tp->rx_ring;
  354. cur_rx = tp->cur_rx;
  355. int entry = cur_rx % RX_BUF_LEN;
  356. /* maximum packets to send to the stack */
  357. /************************ note note *********************************/
  358. int rx_work_limit = dev->quota;
  359. /************************ end note note *********************************/
  360. do { // outer beginning loop starts here
  361. clear_rx_status_register_bit();
  362. while (rx_ring_not_empty) {
  363. u32 rx_status;
  364. unsigned int rx_size;
  365. unsigned int pkt_size;
  366. struct sk_buff *skb;
  367. /* read size+status of next frame from DMA ring buffer */
  368. /* the number 16 and 4 are just examples */
  369. rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
  370. rx_size = rx_status >> 16;
  371. pkt_size = rx_size - 4;
  372. /* process errors */
  373. if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
  374. (!(rx_status & RxStatusOK))) {
  375. netdrv_rx_err (rx_status, dev, tp, ioaddr);
  376. return 1;
  377. }
  378. /************************ note note *********************************/
  379. if (--rx_work_limit < 0) { /* we got packets, but no quota */
  380. /* store current ring pointer state */
  381. tp->cur_rx = cur_rx;
  382. /* Refill the Rx ring buffers if they are needed */
  383. refill_rx_ring(dev);
  384. goto not_done;
  385. }
  386. /********************** end note **********************************/
  387. /* grab a skb */
  388. skb = dev_alloc_skb (pkt_size + 2);
  389. if (skb) {
  390. .
  391. .
  392. /************************ note note *********************************/
  393. netif_receive_skb (skb);
  394. /********************** end note **********************************/
  395. .
  396. .
  397. } else { /* OOM */
  398. /*seems very driver specific ... common is just pass
  399. whatever is on the ring already. */
  400. }
  401. /* move to the next skb on the ring */
  402. entry = (++tp->cur_rx) % RX_RING_SIZE;
  403. received++ ;
  404. }
  405. /* store current ring pointer state */
  406. tp->cur_rx = cur_rx;
  407. /* Refill the Rx ring buffers if they are needed */
  408. refill_rx_ring(dev);
  409. /* no packets on ring; but new ones can arrive since we last
  410. checked */
  411. status = read_interrupt_status_reg();
  412. if (rx status is not set) {
  413. /* If something arrives in this narrow window,
  414. an interrupt will be generated */
  415. goto done;
  416. }
  417. /* done! at least that's what it looks like ;->
  418. if new packets came in after our last check on status bits
  419. they'll be caught by the while check and we go back and clear them
  420. since we havent exceeded our quota */
  421. } while (rx_status_is_set);
  422. done:
  423. /************************ note note *********************************/
  424. dev->quota -= received;
  425. *budget -= received;
  426. /* If RX ring is not full we are out of memory. */
  427. if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
  428. goto oom;
  429. /* we are happy/done, no more packets on ring; put us back
  430. to where we can start processing interrupts again */
  431. netif_rx_complete(dev);
  432. enable_rx_and_rxnobuf_ints();
  433. /* The last op happens after poll completion. Which means the following:
  434. * 1. it can race with disabling irqs in irq handler (which are done to
  435. * schedule polls)
  436. * 2. it can race with dis/enabling irqs in other poll threads
  437. * 3. if an irq raised after the beginning of the outer beginning
  438. * loop (marked in the code above), it will be immediately
  439. * triggered here.
  440. *
  441. * Summarizing: the logic may result in some redundant irqs both
  442. * due to races in masking and due to too late acking of already
  443. * processed irqs. The good news: no events are ever lost.
  444. */
  445. return 0; /* done */
  446. not_done:
  447. if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
  448. tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
  449. refill_rx_ring(dev);
  450. if (!received) {
  451. printk("received==0\n");
  452. received = 1;
  453. }
  454. dev->quota -= received;
  455. *budget -= received;
  456. return 1; /* not_done */
  457. oom:
  458. /* Start timer, stop polling, but do not enable rx interrupts. */
  459. start_poll_timer(dev);
  460. return 0; /* we'll take it from here so tell core "done"*/
  461. /************************ End note note *********************************/
  462. }
  463. -------------------------------------------------------------------
  464. From above we note that:
  465. 0) rx_work_limit = dev->quota
  466. 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
  467. it does the work.
  468. 2) We have a done and not_done state.
  469. 3) instead of netif_rx() we call netif_receive_skb() to pass the skb.
  470. 4) we have a new way of handling oom condition
  471. 5) A new outer for (;;) loop has been added. This serves the purpose of
  472. ensuring that if a new packet has come in, after we are all set and done,
  473. and we have not exceeded our quota that we continue sending packets up.
  474. -----------------------------------------------------------
  475. Poll timer code will need to do the following:
  476. a)
  477. if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
  478. tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
  479. refill_rx_ring(dev);
  480. /* If RX ring is not full we are still out of memory.
  481. Restart the timer again. Else we re-add ourselves
  482. to the master poll list.
  483. */
  484. if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
  485. restart_timer();
  486. else netif_rx_schedule(dev); /* we are back on the poll list */
  487. 5) dev->close() and dev->suspend() issues
  488. ==========================================
  489. The driver writer needn't worry about this; the top net layer takes
  490. care of it.
  491. 6) Adding new Stats to /proc
  492. =============================
  493. In order to debug some of the new features, we introduce new stats
  494. that need to be collected.
  495. TODO: Fill this later.
  496. APPENDIX 1: discussion on using ethernet HW FC
  497. ==============================================
  498. Most chips with FC only send a pause packet when they run out of Rx buffers.
  499. Since packets are pulled off the DMA ring by a softirq in NAPI,
  500. if the system is slow in grabbing them and we have a high input
  501. rate (faster than the system's capacity to remove packets), then theoretically
  502. there will only be one rx interrupt for all packets during a given packetstorm.
  503. Under low load, we might have a single interrupt per packet.
  504. FC should be programmed to apply in the case when the system cant pull out
  505. packets fast enough i.e send a pause only when you run out of rx buffers.
  506. Note FC in itself is a good solution but we have found it to not be
  507. much of a commodity feature (both in NICs and switches) and hence falls
  508. under the same category as using NIC based mitigation. Also, experiments
  509. indicate that it's much harder to resolve the resource allocation
  510. issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness
  511. proved harder. In any case, FC works even better with NAPI but is not
  512. necessary.
  513. APPENDIX 2: the "rotting packet" race-window avoidance scheme
  514. =============================================================
  515. There are two types of associations seen here
  516. 1) status/int which honors level triggered IRQ
  517. If a status bit for receive or rxnobuff is set and the corresponding
  518. interrupt-enable bit is not on, then no interrupts will be generated. However,
  519. as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
  520. generated. [assuming the status bit was not turned off].
  521. Generally the concept of level triggered IRQs in association with a status and
  522. interrupt-enable CSR register set is used to avoid the race.
  523. If we take the example of the tulip:
  524. "pending work" is indicated by the status bit(CSR5 in tulip).
  525. the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
  526. the CSR5 will continue to be turned on with new packet arrivals even if
  527. we clear it the first time)
  528. Very important is the fact that if we turn on the interrupt bit on when
  529. status is set that an immediate irq is triggered.
  530. If we cleared the rx ring and proclaimed there was "no more work
  531. to be done" and then went on to do a few other things; then when we enable
  532. interrupts, there is a possibility that a new packet might sneak in during
  533. this phase. It helps to look at the pseudo code for the tulip poll
  534. routine:
  535. --------------------------
  536. do {
  537. ACK;
  538. while (ring_is_not_empty()) {
  539. work-work-work
  540. if quota is exceeded: exit, no touching irq status/mask
  541. }
  542. /* No packets, but new can arrive while we are doing this*/
  543. CSR5 := read
  544. if (CSR5 is not set) {
  545. /* If something arrives in this narrow window here,
  546. * where the comments are ;-> irq will be generated */
  547. unmask irqs;
  548. exit poll;
  549. }
  550. } while (rx_status_is_set);
  551. ------------------------
  552. CSR5 bit of interest is only the rx status.
  553. If you look at the last if statement:
  554. you just finished grabbing all the packets from the rx ring .. you check if
  555. status bit says there are more packets just in ... it says none; you then
  556. enable rx interrupts again; if a new packet just came in during this check,
  557. we are counting that CSR5 will be set in that small window of opportunity
  558. and that by re-enabling interrupts, we would actually trigger an interrupt
  559. to register the new packet for processing.
  560. [The above description nay be very verbose, if you have better wording
  561. that will make this more understandable, please suggest it.]
  562. 2) non-capable hardware
  563. These do not generally respect level triggered IRQs. Normally,
  564. irqs may be lost while being masked and the only way to leave poll is to do
  565. a double check for new input after netif_rx_complete() is invoked
  566. and re-enable polling (after seeing this new input).
  567. Sample code:
  568. ---------
  569. .
  570. .
  571. restart_poll:
  572. while (ring_is_not_empty()) {
  573. work-work-work
  574. if quota is exceeded: exit, not touching irq status/mask
  575. }
  576. .
  577. .
  578. .
  579. enable_rx_interrupts()
  580. netif_rx_complete(dev);
  581. if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
  582. disable_rx_and_rxnobufs()
  583. goto restart_poll
  584. } while (rx_status_is_set);
  585. ---------
  586. Basically netif_rx_complete() removes us from the poll list, but because a
  587. new packet which will never be caught due to the possibility of a race
  588. might come in, we attempt to re-add ourselves to the poll list.
  589. APPENDIX 3: Scheduling issues.
  590. ==============================
  591. As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
  592. general solution to schedule softirq's to run before next interrupt and by putting
  593. them under scheduler control. Also this prevents consecutive softirq's from
  594. monopolize the CPU. This also have the effect that the priority of ksoftirq needs
  595. to be considered when running very CPU-intensive applications and networking to
  596. get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
  597. (eventually more) is reported cure problems with low network performance at high
  598. CPU load.
  599. Most used processes in a GIGE router:
  600. USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
  601. root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
  602. root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
  603. --------------------------------------------------------------------
  604. relevant sites:
  605. ==================
  606. ftp://robur.slu.se/pub/Linux/net-development/NAPI/
  607. --------------------------------------------------------------------
  608. TODO: Write net-skeleton.c driver.
  609. -------------------------------------------------------------
  610. Authors:
  611. ========
  612. Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
  613. Jamal Hadi Salim <hadi@cyberus.ca>
  614. Robert Olsson <Robert.Olsson@data.slu.se>
  615. Acknowledgements:
  616. ================
  617. People who made this document better:
  618. Lennert Buytenhek <buytenh@gnu.org>
  619. Andrew Morton <akpm@zip.com.au>
  620. Manfred Spraul <manfred@colorfullife.com>
  621. Donald Becker <becker@scyld.com>
  622. Jeff Garzik <jgarzik@pobox.com>