linux-kernel - Re: Kernel panic in netif_rx_internal after v6 pings between netns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f96b33ab-56d5-4a43-a1ff-2e68e2c55ac2@kernel.org>
Date: Mon, 22 Jan 2024 19:22:42 +0100
From: Matthieu Baerts <matttbe@...nel.org>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Eric Dumazet <edumazet@...gle.com>, Netdev <netdev@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>
Subject: Re: Kernel panic in netif_rx_internal after v6 pings between netns

Hi Jakub,

On 22/01/2024 18:28, Jakub Kicinski wrote:

(...)

> Somewhat related. What do you do currently to ignore crashes?

I was wondering why you wanted to ignore crashes :) ... but then I saw
the new "Test ignored" and "Crashes ignored" sections on the status
page. Just to be sure: you don't want to report issues that have not
been introduced by the new patches, right?

We don't need to do that on MPTCP side:
- either it is a new crash with patches that are in reviewed and that's
not impacting others → we test each series individually, not a batch of
series.
- or there are issues with recent patches, not in netdev yet → we fix,
or revert.
- or there is an issue elsewhere, like the kernel panic we reported
here: usually I try to quickly apply a workaround, e.g. applying a fix,
or a revert. I don't think we ever had an issue really impacting us
where we couldn't find a quick solution in one or two days. With the
panic we reported here, ~15% of the tests had an issue, that's "OK" to
have that for a few days/weeks

With fewer tests and a smaller community, it is easier for us to just
say on the ML and weekly meetings: "this is a known issue, please ignore
for the moment". But if possible, I try to add a workaround/fix in our
repo used by the CI and devs (not upstreamed).

For NIPA CI, do you want to do like with the build and compare with a
reference? Or multiple ones to take into account unstable tests? Or
maintain a list of known issues (I think you started to do that,
probably safer/easier for the moment)?

> I was seeing a lot of:
> https://netdev-2.bots.linux.dev/vmksft-net-mp/results/431181/vm-crash-thr0-2
> 
> So I hacked up this function to filter the crash from NIPA CI:
> https://github.com/kuba-moo/nipa/blob/master/contest/remote/lib/vm.py#L50
> It tries to get first 5 function names from the stack, to form 
> a "fingerprint". But I seem to recall a discussion at LPC's testing
> track that there are existing solutions for generating fingerprints.
> Are you aware of any?

No, sorry. But I guess they are using that with syzkaller, no?

I have to admit that crashes (or warnings) are quite rare, so there was
no need to have an automation there. But if it is easy to have a
fingerprint, I will be interested as well, it can help for the tracking:
to find occurrences of crashes/warnings that are very hard to reproduce.

> (FWIW the crash from above seems to be gone on latest linux.git,
> this night's CIs run are crash-free.)

Good it was quickly fixed!

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.