netdev - Re: [PATCH net] net: stmmac: Use hrtimer for TX coalescing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <44ebe3fd-5898-4e48-a642-ee7457c0c032@nbd.name>
Date: Fri, 1 Sep 2023 13:53:23 +0200
From: Felix Fietkau <nbd@....name>
To: Vincent Whitchurch <Vincent.Whitchurch@...s.com>,
 "joabreu@...opsys.com" <joabreu@...opsys.com>,
 "kuba@...nel.org" <kuba@...nel.org>,
 "davem@...emloft.net" <davem@...emloft.net>,
 "alexandre.torgue@...com" <alexandre.torgue@...com>,
 "peppe.cavallaro@...com" <peppe.cavallaro@...com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
 kernel <kernel@...s.com>
Subject: Re: [PATCH net] net: stmmac: Use hrtimer for TX coalescing

On 01.09.23 13:31, Vincent Whitchurch wrote:
> On Wed, 2023-08-30 at 23:06 +0200, Felix Fietkau wrote:
>> On 30.08.23 16:55, Vincent Whitchurch wrote:
>> > I looked at it some more and the continuous postponing behaviour strikes
>> > me as quite odd.  For example, if you set tx-frames coalescing to 0 then
>> > cleanups could happen much later than the specified tx-usecs period, in
>> > the absence of RX traffic.  Also, if we'd have to have a shared
>> > timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer
>> > to preserve this continuous postponing behaviour, then we'd need to
>> > introduce some locking between the timer expiry and those functions, to
>> > avoid race conditions.
>> 
>> I just spent some time digging through the history of the timer code, 
>> figuring out the intention behind the continuous postponing behavior.
>> 
>> It's an interrupt mitigation scheme where DMA descriptors are configured 
>> to only generate a completion event every 25 packets, and the only 
>> purpose of the timer is to avoid keeping packets in the queue for too 
>> long after tx activity has stopped.
>> Based on that design, I believe that the continuous postponing actually 
>> makes sense and the patches that eliminate it are misguided. When there 
>> is constant activity, there will be tx completion interrupts that 
>> trigger cleanup.
> 
> The tx-frames value (25) can be controlled from user space.  If it is
> set to zero then the driver should still coalesce interrupts based on
> the interval specified in the tx-usecs setting, but the driver fails to
> do so because it keeps postponing the cleanup and increasing the latency
> of the cleanups far beyond the programmed period.
> 
>> That said, I did even more digging and I found out that the timer code 
>> was added at a time when the driver didn't even disable tx and rx 
>> interrupts individually, which means that it could not take advantage of 
>> interrupt mitigation via NAPI scheduling + IRQ disable/enable.
>> 
>> I have a hunch that given the changes made to the driver over time, the 
>> timer based interrupt mitigation strategy might just be completely 
>> useless and actively harmful now. It certainly messes with things like 
>> TSQ and BQL in a nasty way.
>> 
>> I suspect that the best and easiest way to deal with this issue is to 
>> simply rip out all that timer nonsense, rely on tx IRQs entirely and 
>> just let NAPI do its thing.
> 
> If you want an interrupt for every packet, you can turn off coalescing
> by setting tx-frames to 1 and tx-usecs 0.  Currently, the driver does
> not turn off the timer even if tx-usecs is set to zero, but that is
> trivially fixed.  With such a fix and that setting, the result is a 30x
> increase in the number of interrupts in a tx-only test.
> 
> And tx-frames 25 tx-usecs 0 is of course also bad for throughput since
> cleanups will not happen when fewer than 25 frames are transmitted.
If the CPU is not saturated, the increase in interrupts is expected, 
given that NAPI doesn't do any timer based mitigation.

I guess for devices operating on battery, the timer based coalescing 
might conserve some power, unless the power drain caused by the CPU 
overhead of the timer rescheduling is bigger than the power drain from 
extra wakeups.

For a devices where performance matters more (e.g. the ones used with 
OpenWrt), I believe that the increase in the number of IRQs won't 
matter, because mitigation should kick in under load.

Could you please post your fix? I think in order to avoid accidental 
breakage, we should make tx-usecs=0 imply tx-frames=1.

Thanks,

- Felix