netdev - Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 15 Jun 2008 19:59:18 -0700
From:	Stephen Hemminger <shemminger@...tta.com>
To:	Ben Hutchings <bhutchings@...arflare.com>
Cc:	Denys Fedoryshchenko <denys@...p.net.lb>, netdev@...r.kernel.org
Subject: Re: NAPI, rx_no_buffer_count, e1000, r8169 and other actors

On Mon, 16 Jun 2008 00:46:22 +0100
Ben Hutchings <bhutchings@...arflare.com> wrote:

> Denys Fedoryshchenko wrote:
> > Hi
> > 
> > Since i am using PC routers for my network, and i reach significant numbers
> > (for me significant) i start noticing minor problems. So all this talk about
> > networking performance in my case.
> > 
> > For example.
> > Sun server, AMD based (two CPU -  AMD Opteron(tm) Processor 248).
> > e1000 connected over PCI-X ([    4.919249] e1000: 0000:01:01.0: e1000_probe:
> > (PCI-X:100MHz:64-bit) 00:14:4f:20:89:f4)
> > 
> > All traffic processed over eth0, 5 VLAN, 1 second average around 110-200Mbps
> 
> Currently TX checksum offload does not work for VLAN devices, which may
> be a serious performance hit if there is a lot of traffic routed between
> VLANs.  This should change in 2.6.27 for some drivers, which I think will
> include e1000.
> 
> > of traffic. Host running also conntrack (max 1000000 entries, when packetloss
> > happen - around 256k entries). Around 1300 routes (FIB_TRIE) running. What is
> > worrying me, that ok, i win time by increasing rx descriptors from 256 to
> > 4096, but how much time i win? if it "cracks" on 100 Mbps RX, it means by
> > interpolating descriptors increase from 256 to 4096 (4 times), i cannot
> > process more than 400Mbps RX?

You are CPU limited because of the overhead of firewalling. When this happens
packets get backlogged.

> Increasing the RX descriptor ring size should give the driver and stack
> more time to catch up after handling some packets that take unusually
> long.  It may also allow you to increase interrupt moderation, which
> will reduce the per-packet cost.

No if the receive side is CPU limited, you just end up eating more memory.
A bigger queue may actually make performance worse (less cache hits).

> > The CPU is not so busy after all... maybe there is a way to change some
> > parameter to force NAPI poll interface more often?
> 
> NAPI polling is not time-based, except indirectly though interrupt
> moderation.

How are you measuring CPU? You need to do something like measure the available
cycles left for applications. Don't believe top or other measures that may
not reflect I/O overhead and bus usage. 

> > I tried nice, changing realtime priority to FIFO, changing kernel to
> > preemptible... no luck, except increasing descriptors.
> > 
> > Router-Dora ~ # mpstat -P ALL 1
> > Linux 2.6.26-rc6-git2-build-0029 (Router-Dora)  06/15/08
> > 
> > 22:51:02     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > %idle    intr/s
> > 22:51:03     all    1.00    0.00    0.00    0.00    2.50   29.00    0.00  
> > 67.50  12927.00
> > 22:51:03       0    2.00    0.00    0.00    0.00    4.00   59.00    0.00  
> > 35.00  11935.00
> > 22:51:03       1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
> > 100.00    993.00
> > 22:51:03       2    0.00    0.00    0.00    0.00    0.00    0.00    0.00   
> > 0.00      0.00
>  
> You might do better with a NIC that supports MSI-X.  This allows the use of
> two RX queues with their own IRQs, each handled by a different processor.
> As it is, one CPU is completely idle.  However, I don't know how well the
> other work of routing scales to multiple processors.


Routing and firewalling should scale well. The deadlock is probably going
to be some hot lock like the transmit lock.

> [...]
> > I have another host running, Core 2 Duo, e1000e+3 x e100, also conntrack, same
> > kernel configuration and similar amount of traffic, higher load (ifb + plenty
> > of shapers running) - almost no errors on default settings.
> > Linux 2.6.26-rc6-git2-build-0029 (Kup)  06/16/08
> > 
> > 07:00:27     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal  
> > %idle    intr/s
> > 07:00:28     all    0.00    0.00    0.50    0.00    4.00   31.50    0.00  
> > 64.00  32835.00
> > 07:00:29     all    0.00    0.00    0.50    0.00    2.50   29.00    0.00  
> > 68.00  33164.36
> > 
> > Third host r8169 (PCI! This is important, seems i am running out of PCI
> > capacity),
> 
> Gigabit Ethernet on plain old PCI is not ideal.  If each card has a
> separate route to the south bridge then you might be able to get a fair
> fraction of a gigabit between them though.
> 
> > 400Mbit/s rx+tx summary load, e1000e interface also - around
> > 200Mbps load. What is worrying me - interrupts rate, it seems generated by
> > realtek card... is there any way to drop it down? 
> [...]
> 
> ethtool -C lets you change interrupt moderation.  I don't know anything
> about this driver or NIC's capabilities but it does seem to be in the
> cheapest GbE cards so I wouldn't expect outstanding performance.
> 
> Ben.
> 

The bigger issues is available memory bandwidth. Different processors
and busses have different overheads. PCI is much worse than PCI-express,
and CPU's with integrated memory controllers do much better than CPU's
with separate memory controller (like Core 2).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html