lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 2 May 2020 09:24:19 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Julian Wiedmann <jwi@...ux.ibm.com>,
        Eric Dumazet <edumazet@...gle.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>, Luigi Rizzo <lrizzo@...gle.com>,
        Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature



On 5/2/20 9:10 AM, Julian Wiedmann wrote:
> On 02.05.20 17:40, Eric Dumazet wrote:
>> On Sat, May 2, 2020 at 7:56 AM Julian Wiedmann <jwi@...ux.ibm.com> wrote:
>>>
>>> On 22.04.20 18:13, Eric Dumazet wrote:
>>>> Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer")
>>>> we added the ability to arm one high resolution timer, that we used
>>>> to keep not-complete packets in GRO engine a bit longer, hoping that further
>>>> frames might be added to them.
>>>>
>>>> Since then, we added the napi_complete_done() interface, and commit
>>>> 364b6055738b ("net: busy-poll: return busypolling status to drivers")
>>>> allowed drivers to avoid re-arming NIC interrupts if we made a promise
>>>> that their NAPI poll() handler would be called in the near future.
>>>>
>>>> This infrastructure can be leveraged, thanks to a new device parameter,
>>>> which allows to arm the napi hrtimer, instead of re-arming the device
>>>> hard IRQ.
>>>>
>>>> We have noticed that on some servers with 32 RX queues or more, the chit-chat
>>>> between the NIC and the host caused by IRQ delivery and re-arming could hurt
>>>> throughput by ~20% on 100Gbit NIC.
>>>>
>>>> In contrast, hrtimers are using local (percpu) resources and might have lower
>>>> cost.
>>>>
>>>> The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
>>>> than gro_flush_timeout (/sys/class/net/ethX/)
>>>>
>>>
>>> Hi Eric,
>>> could you please add some Documentation for this new sysfs tunable? Thanks!
>>> Looks like gro_flush_timeout is missing the same :).
>>
>>
>> Yes. I was planning adding this in
>> Documentation/networking/scaling.rst, once our fires are extinguished.
>>
>>>
>>>
>>>> By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
>>>>
>>>> This patch does not change the prior behavior of gro_flush_timeout
>>>> if used alone : NIC hard irqs should be rearmed as before.
>>>>
>>>> One concrete usage can be :
>>>>
>>>> echo 20000 >/sys/class/net/eth1/gro_flush_timeout
>>>> echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
>>>>
>>>> If at least one packet is retired, then we will reset napi counter
>>>> to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
>>>> of the queue.
>>>>
>>>> On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
>>>> avoidance was only possible if napi->poll() was exhausting its budget
>>>> and not call napi_complete_done().
>>>>
>>>
>>> I was confused here for a second, so let me just clarify how this is intended
>>> to look like for pure TX completion IRQs:
>>>
>>> napi->poll() calls napi_complete_done() with an accurate work_done value, but
>>> then still returns 0 because TX completion work doesn't consume NAPI budget.
>>
>>
>> If the napi budget was consumed, the driver does _not_ call
>> napi_complete() or napi_complete_done() anyway.
>>
> 
> I was thinking of "TX completions are cheap and don't consume _any_ NAPI budget, ever"
> as the current consensus, but looking at the mlx4 code that evidently isn't true
> for all drivers.

TX completions are not cheap in many cases.

Doing the unmap stuff can be costly in IOMMU world, and freeing skb
can be also expensive.
Add to this that TCP stack might be called back (via skb->destructor()) to add more packets to the qdisc/device.

So using effectively the budget as a limit might help in some stress situations,
by not re-enabling NIC interrupts, even before napi_defer_hard_irqs addition.

> 
>> If the budget is consumed, then napi_complete_done(napi, X>0) allows
>> napi_complete_done()
>> to return 0 if napi_defer_hard_irqs is not 0
>>
>> This means that the NIC hard irq will stay disabled for at least one more round.
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ