netdev - r8169 : always copying the rx buffer to new skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <COL116-W30238D0821823E764BBB53A3570@phx.gbl>
Date:	Mon, 27 Jun 2011 18:54:11 -0400
From:	John Lumby <johnlumby@...mail.com>
To:	<netdev@...r.kernel.org>
Subject: r8169 :  always copying the rx buffer to new skb


Summary of some results since previous posts in April :

Previously I suggested re-introducing the rx_copybreak parameter to provide the option of un-hooking the receive buffer rather than copying it,  in order to save the overhead of the memcpy,   which shows as the highest tick-count in oprofile.  All buffer memcpy'ing is done on CPU0 on my system.

I then found that,  without the memcpy,  the driver and net stack consume other overhead elsewhere,  particularly in too-frequent polling/interrupting.

Eric D pointed out that :
            Doing the copy of data and building an exact size skb has benefit of
            providing 'right' skb->truesize (might reduce RCVBUF contention and
            avoid backlog drops) and already cached data (hot in cpu caches).
            Next 'copy' is almost free (L1 cache access)

There was also some discussion off-line about using larger MTU size.

Since then,  I have explored some ideas for dealing with the too-frequent polling/interrupting and the cache aspect,  with some success on the first and no success on the second.   In summary of results:
   .  With MTU of 1500 and "normal" workload,   I see an improvement of between 4% - 6% in throughput,  depending on kernel release and kernel .config.    Specifically,  with the heaviest workload and most tuned kernel .config:
       no changes  -  ~  1440 Megabits/sec bi-directional
     with changes  -  ~  1530 Megabits/sec bi-directional
   (same .config for each of course)
      All 4 of my atom 330's (2 physical x 2SMt per physical) were at 100% on both without and with changes for this workload,  but with very different profiles.
      These throughput numbers are higher than I reported before,   and % improvement lower,  because of the tuning to the base system and workload.

   .  With MTU of 6144,  I see a more dramatic effect  -  the same workload runs at 1725 Megabit/sec on both kernels,  (which may be a practical hardware limit on one of the adapters,  since it hits exactly this rate almost every time no matter what else I change),  but overall CPU utilization drops from ~ 80% without changes to ~60% with changes.    I feel this is significant but of course its use limited to networks that can support this segment size everywhere.

Notes on the changes:

 Too-frequent polling/interrupting:
 These two are highly interrelated by NAPI.
     Too-frequent polling:
         The NAPI weight is a double-duty parameter,  controlling both the dynamic choice between continuing a NAPI polling session versus leaving and resuming interrupts,  and also the maximum number of receive buffers to be passed up per poll.   It's also not configurable (set to 64).    I split it into two numbers,  one for each purpose,  and made them configurable,  and tried tuning them.    A good value for the poll/int choice was 16,   while the max-size number was best left at 64.    This helps a bit,  but polling is still too frequent.
         I then made an interface up into the softirqd to let the driver tell the softirqd :
              "keep my napi session alive but sched_yield to other runnable processes before running another poll"
         I added a check to __do_softirq that if the *only* pending softirq is NET_RX_SOFTIRQ and the rx_action routine requested this, then it exits and tells the deamon to yield.
         I borrowed a bit in local_softirq_pending for this.  This helped a lot for certain workloads.    I saw considerable drop in system CPU% on CPU0 and higher user CPU% there.

     Too-frequent interrupting:
         I made use of the r8169's Interrupt Mitigation feature,   setting it to the maximum multiplied by a factor between 0-1 based inversely on tx queue size  (large qsize,  short delay and vice versa).     This also helped a lot.  The current driver sets these registers but only once per "up" session,  during rtl8169_open of the NIC;  But Hayes explained that the regs must be set on each enabling of interrupts.    This is the one case where (I think) I corrected a bug present in the current driver.   Harmless but not doing what was intended.

      The effect of these two changes was to reduce the rate of hardware interrupts down to less than 1/20 of before,  and also hold the polling rate down (around 4-5 packets per poll on average on a typical run,  sometimes much higher).


  memory and caching:
  Here I failed to achieve anything.    Based on Eric's point about memcpy giving a "free" next copy, I thought possibly memory prefetching might provide something equivalent.   Specifically,  prefetch the skb and its databuff immediately after un-dma'ing.
  For example,   with my changes and no memcpy,  I see eth_type_trans() high in oprofile tick score on CPU0.
  This small function does very little work but is the first (I think) to access a field in the skb->data buffer - the ethernet header.  Prefetching ought to do better than memcpy'ing since only one copy of the data will enter L1,  not two.    But my attempts at this achieved nothing or negative.
  Note  -  the current driver does issue a prefetch of the original buffer prior to the memcpy.  But,  on my system (atom CPUs),  gdb of the object file r8169.o indicates no prefetch instructions are generated,   only lea of the address to be prefetched.     I tried changing the prefetch call to an asm generated prefetcht0/prefetchnta instruction with disappointing results.   I noticed some discussion of memory prefetch in this list earlier and maybe it is not useful.

  I tried to explore Eric's other point about skb->truesize but ran out of time researching.     I guess my current results are negatively impacted by these memory and skb issues that Eric mentions,  but I could not find any answer.


  There was a question of how this changed driver handles memory pressure :
  Along with the rx_copybreak change,  I made the number of rx and tx ring buffers configurable and dynamically replenishable.    The changed driver can tolerate occasional or even bursty alloc failures without exposing any effects outside itself,    whereas the current driver drops packets.   However,  under extreme consecutive failures,  the changed driver will eventually run too low and stop completely,  whereas the current driver will (I assume) stay up.     I was unable to cause either of these in my tests.    Measurements with concurrent memory hogs confirmed this but did show heavy drop in throughput for the changed driver.


  I've tried these changes out on all kernel release levels from 2.6.36 to 3.0-rc3 and see roughly comparable deltas on all,  but with slightly different tuning required to hit the optimum,  and some variability on all after 2.6.37.     2.6.37 seemed to be slightly the "best".   Not sure why although I see some relevant changes to the scheduler between 2.6.37 - 38.    There is also a strange effect with the old RTC in 3.0  -  I had to remove it from the kernel to get good results,    whereas it was a module in 2.6 levels (which I did not load for the tests).     I don't need it on my system except for one ancient utility.     I also found major impact from iptables and cleared all tables for the tests.    That is the one item that would normally be needed in a production setup that I turned off.  The overhead of iptables is presumably highly dependent on how many rules in the filter chains.    (I have rather a lot in INPUT)


I don't plan to do any more on this but can provide my patch (currently one monolithic one based on DaveM 3.0.0-rc1netnext-110615) and detailed results if anyone wants.

Cheers,   John Lumby
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html