[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <69451eb5-a36e-4443-8e34-7a06627b087d@kernel.org>
Date: Wed, 12 Nov 2025 22:58:23 +0100
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, Toke Høiland-Jørgensen
<toke@...e.dk>, Eric Dumazet <eric.dumazet@...il.com>,
"David S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>,
ihor.solodrai@...ux.dev, "Michael S. Tsirkin" <mst@...hat.com>,
makita.toshiaki@....ntt.co.jp, toshiaki.makita1@...il.com,
bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, kernel-team@...udflare.com
Subject: Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting
stalled TXQs
On 08/11/2025 02.54, Jakub Kicinski wrote:
> On Fri, 7 Nov 2025 14:42:58 +0100 Jesper Dangaard Brouer wrote:
>>> I think this belongs in net-next.. Fail safe is not really a bug fix.
>>> I'm slightly worried we're missing a corner case and will cause
>>> timeouts to get printed for someone's config.
>>
>> This is a recovery fix. If the race condition fix isn't 100% then this
>> patch will allow veth to recover. Thus, to me it makes sense to group
>> these two patches together.
>>
>> I'm more worried that we we're missing a corner case that we cannot
>> recover from. Than triggering timeouts to get printed, for a config
>> where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
>> max 64 packets this needs to consume packets at a rate less than 12.8
>> pps). It might be good to get some warnings if the system is operating
>> this slow.
>>
>> Also remember this is not the default config that most people use.
>> The code is only activated if attaching a qdisc to veth, which isn't
>> default. Plus, NAPI mode need to be activated, where in normal NAPI mode
>> the producer and consumer usually runs on the same CPU, which makes it
>> impossible to overflow the ptr_ring. The veth backpressure is primarily
>> needed when running with threaded-NAPI, where it is natural that
>> producer and consumer runs on different CPUs. In our production setup
>> the consumer is always slower than the producer (as the product inside
>> the namespace have installed too many nftables rules).
>
> I understand all of this, but IMO the fix is in patch 2.
> This is a resiliency improvement, not a fix.
As maintainer you have the final say, so I send a [V4]. Notice that
doing it this way will cause a merge conflict once net and net-next gets
merged.
[V4]
https://lore.kernel.org/all/176295319819.307447.6162285688886096284.stgit@firesoul/
--Jesper
Powered by blists - more mailing lists