linux-kernel - Re: Splat in kernel RT while processing incoming network packets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230703142908.RcxjjF_E@linutronix.de>
Date:   Mon, 3 Jul 2023 16:29:08 +0200
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Wander Lairson Costa <wander@...hat.com>
Cc:     linux-kernel@...r.kernel.org, linux-rt-users@...r.kernel.org,
        juri.lelli@...hat.com
Subject: Re: Splat in kernel RT while processing incoming network packets

On 2023-07-03 09:47:26 [-0300], Wander Lairson Costa wrote:
> Dear all,
Hi,

> I am writing to report a splat issue we encountered while running the
> Real-Time (RT) kernel in conjunction with Network RPS (Receive Packet
> Steering).
> 
> During some testing of the RT kernel version 6.4.0 with Network RPS enabled,
> we observed a splat occurring in the SoftIRQ subsystem. The splat message is as
> follows:
> 
> [   37.168920] ------------[ cut here ]------------
> [   37.168925] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:291 do_softirq_post_smp_call_flush+0x2d/0x60
…
> [   37.169060] ---[ end trace 0000000000000000 ]---
> 
> It comes from [1].
> 
> The issue lies in the mechanism of RPS to defer network packets processing to
> other CPUs. It sends an IPI to the to the target CPU. The registered callback
> is rps_trigger_softirq, which will raise a softirq, leading to the following
> scenario:
> 
> CPU0                                    CPU1
> | netif_rx()                            |
> | | enqueue_to_backlog(cpu=1)           |
> | | | net_rps_send_ipi()                |
> |                                       | flush_smp_call_function_queue()
> |                                       | | was_pending = local_softirq_pending()
> |                                       | | __flush_smp_call_function_queue()
> |                                       | | rps_trigger_softirq()
> |                                       | | | __raise_softirq_irqoff()
> |                                       | | do_softirq_post_smp_call_flush()
> 
> That has the undesired side effect of raising a softirq in a function call,
> leading to the aforementioned splat.

correct.

> The kernel version is kernel-ark [1], os-build-rt branch. It is essentially the
> upstream kernel with the PREEMPT_RT patches, and with RHEL configs. I can
> provide the .config.

It is fine, I see it.

> The only solution I imagined so far was to modify RPS to process packtes in a
> kernel thread in RT. But I wonder how would be that be different than processing
> them in ksoftirqd.
> 
> Any inputs on the issue?

Not sure how to proceed. One thing you could do is a hack similar like
net-Avoid-the-IPI-to-free-the.patch which does it for defer_csd.
On the other hand we could drop net-Avoid-the-IPI-to-free-the.patch and
remove the warning because we have now commit
   d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")

Prior that, raising softirq from hardirq would wake ksoftirqd which in
turn would collect all pending softirqs. As a consequence all following
softirqs (networking, …) would run as SCHED_OTHER and compete with
SCHED_OTHER tasks for resources. Not good because the networking work is
no longer processed within the networking interrupt thread. Also not a
DDoS kind of situation where one could want to delay processing.

With that change, this isn't the case anymore. Only an "unrelated" IRQ
thread could pick up the networking work which is less then ideal. That
is because the global softirq set is added, ksoftirq is marked for a
wakeup and could be delayed because other tasks are busy. Then the disk
interrupt (for instance) could pick it up as part of its threaded
interrupt.

Now that I think about, we could make the backlog pseudo device a
thread. NAPI threading enables one thread but here we would need one
thread per-CPU. So it would remain kind of special. But we would avoid
clobbering the global state and delay everything to ksoftird. Processing
it in ksoftirqd might not be ideal from performance point of view.

> [1] https://elixir.bootlin.com/linux/latest/source/kernel/softirq.c#L306
> 
> Cheers,
> Wander

Sebastian