[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef8057dc-28f9-483a-9885-35879ad84b56@intel.com>
Date: Wed, 12 Jun 2024 11:34:43 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Josh Hay <joshua.a.hay@...el.com>
CC: David Decotigny <ddecotig@...il.com>, <intel-wired-lan@...ts.osuosl.org>,
<netdev@...r.kernel.org>, Sridhar Samudrala <sridhar.samudrala@...el.com>
Subject: Re: [Intel-wired-lan] [PATCH iwl-net] idpf: extend tx watchdog
timeout
From: Josh Hay <joshua.a.hay@...el.com>
Date: Tue, 11 Jun 2024 11:13:53 -0700
>
>
> On 6/11/2024 3:44 AM, Alexander Lobakin wrote:
>> From: David Decotigny <ddecotig@...il.com>
>> Date: Tue, 4 Jun 2024 16:34:48 -0700
[...]
>> Note that there are several patches fixing Tx (incl. timeouts) in my
>> tree, including yours (Joshua's) which you somehow didn't send yet ._.
>> Maybe start from them first?
>
> I believe it was you who specifically asked our team to defer pushing
> any upstream patches while you were working on the XDP series "to avoid
> having to rebase", which was a reasonable request at the time. We also
It was only related to the virtchnl refactoring and later I cancelled
that when I realized it will go earlier than our series.
> had no reason to believe the existing upstream idpf implementation was
> experiencing timeouts (it is being tested by numerous validation teams).
> So there was no urgency to get those patches upstream. Which patches in
> your tree do you believe fix specific timeout situations? It appears you
[0][1][2]
> pulled in some of the changes from the out-of-tree driver, but those
> were all enhancements. It wasn't until the workload that David mentioned
No, there are all fixes.
[0] is your from the OOT, extended.
[1] is mine and never was in the OOT.
[2] is your from the OOT, extended by MichaĆ.
They really do help.
Note that there's one more Tx timeout patch from you in the OOT, but it
actually broke Tx xD
> was run on the current driver that we had any indication there were
> timeout issues.
>
>>
>> I don't buy 30 seconds, at least for now. Maybe I'm missing something.
>>
>> Nacked-by: Alexander Lobakin <aleksander.lobakin@...el.com>
>
>
> In the process of debugging the newly discovered timeout, our
> architecture team made it clear that 5 seconds is too low for this type
> of device, with a non deterministic pipeline where packets can take a
> number of exception/slow paths. Admittedly, we don't know the exact
Slowpath which takes 30 seconds to complete, seriously?
> number, so the solution for the time being was to bump it up with a
> comfortable buffer. As we tune things and debug with various workloads,
> we can bring it back down. As David mentioned, there is precedent for an
> extended timeout for smartnics. Why is it suddenly unacceptable for
> Intel's device?
I don't know where this "suddenly" comes from.
Because even 5 seconds is too much.
HW usually send packets in microseconds if not faster. Extending the
timeout will hide real issues and make debugging more difficult.
I suspect this all is for OOO packets with an explicit sending timestamp
passed from the userspace, but as I said, you need to teach the kernel
watchdog to account them.
Otherwise, I can ask the driver to send a packet in 31 seconds, what
then, there will be a timeout and you will send a patch to extend it to
60 seconds?
>
>>
>> Thanks,
>> Olek
>
> Thanks,
> Josh
[0] https://github.com/alobakin/linux/commit/aad547037598
[1] https://github.com/alobakin/linux/commit/50f4c27ba13e
[2] https://github.com/alobakin/linux/commit/4a9b6c5d0ee8
Thanks,
Olek
Powered by blists - more mailing lists