netdev - Re: [PATCH net] selftests: net: packetdrill: xfail all problems on slow machines

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <88a51699-e913-4dba-992d-e923509ec754@kernel.org>
Date: Mon, 4 Aug 2025 11:58:54 +0200
From: Matthieu Baerts <matttbe@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>, willemb@...gle.com
Cc: netdev@...r.kernel.org, edumazet@...gle.com, pabeni@...hat.com,
 andrew+netdev@...n.ch, horms@...nel.org, shuah@...nel.org,
 linux-kselftest@...r.kernel.org, davem@...emloft.net
Subject: Re: [PATCH net] selftests: net: packetdrill: xfail all problems on
 slow machines

Hi Jakub, Willem,

On 01/08/2025 20:16, Jakub Kicinski wrote:
> We keep seeing flakes on packetdrill on debug kernels, while
> non-debug kernels are stable, not a single flake in 200 runs.
> Time to give up, debug kernels appear to suffer from 10msec
> latency spikes and any timing-sensitive test is bound to flake.

Thank you for the patch!

Another solution might be to increase the tolerance, but I don't think
it will fix all issues. I quickly looked at the last 100 runs, and I
think most failures might be fixed by a higher tolerance, e.g.

> # tcp_ooo-before-and-after-accept.pkt:19: timing error: expected inbound packet at 0.101619 sec but happened at 0.115894 sec; tolerance 0.014000 sec

(0.275ms above the limit!)

On MPTCP, we used to have a very high tolerance with debug kernels
(>0.5s) when public CIs were very limited in terms of CPU resources. I
guess having a tolerance of 0.1s would be enough, but for these MPTCP
packetdrill tests, I put 0.2s for the tolerance with a debug kernel,
just to be on the safe side.

Still, I think increasing the tolerance would not fix all issues. On
MPTCP side, the latency introduced by debug kernel caused unexpected
retransmissions due to too low RTO. I took the time to make sure
injected packets were always done with enough delay, but with the TCP
packetdrill tests here, it is possibly not enough to do that when I look
at some recent errors, e.g.

> tcp_zerocopy_batch.pkt:26: error handling packet: live packet payload: expected 4000 bytes vs actual 5000 bytes
At the end, and as previously mentioned, these adaptations for debug
kernel are perhaps not worth it: in this environment, it is probably
enough to ignore packetdrill results and focus on kernel warnings.

Acked-by: Matthieu Baerts (NGI0) <matttbe@...nel.org>

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.