netdev - Re: [PATCH net] net: lwtunnel: disable preemption when required

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3cee5141-c525-4e83-830e-bf21828aed51@uliege.be>
Date: Tue, 15 Apr 2025 11:10:01 +0200
From: Justin Iurman <justin.iurman@...ege.be>
To: Andrea Mayer <andrea.mayer@...roma2.it>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>,
 Sebastian Sewior <bigeasy@...utronix.de>,
 Stanislav Fomichev <stfomichev@...il.com>,
 Network Development <netdev@...r.kernel.org>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Simon Horman <horms@...nel.org>, Kuniyuki Iwashima <kuniyu@...zon.com>,
 bpf <bpf@...r.kernel.org>, Stefano Salsano <stefano.salsano@...roma2.it>,
 Paolo Lungaroni <paolo.lungaroni@...roma2.it>
Subject: Re: [PATCH net] net: lwtunnel: disable preemption when required

On 4/15/25 02:54, Andrea Mayer wrote:> I have been looking into the 
behavior of the lwtunnel_xmit() function in both
> task and softirq contexts. To facilitate this investigation, I have written a
> simple eBPF program that only prints messages to the trace pipe. This program
> is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
> with a destination address (DA) pointing to an external node, referred to as
> x.x.x.x, within my testbed network.
> 
> To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
> to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
> is then processed through the forwarding path, eventually leading to the
> ip_output() function. Processing ends with a call to ip_finish_output2(), which
> then calls lwtunnel_xmit().
> 
> Below is the stack trace from my testing machine, highlighting the key
> functions involved in this processing path:
> 
>   ============================================
>    <IRQ>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    NF_HOOK.constprop.0+0x7e/0x2f0
>    ip_rcv+0x4bf/0x4d0
>    __netif_receive_skb_one_core+0x11c/0x130
>    process_backlog+0x277/0x980
>    __napi_poll.constprop.0+0x58/0x260
>    net_rx_action+0x396/0x6e0
>    handle_softirqs+0x116/0x640
>    do_softirq+0xa9/0xe0
>    </IRQ>
>   ============================================
> 
> Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
> x.x.x.x on the same testing node. Below is the corresponding stack trace:
> 
>   ============================================
>    <TASK>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    ip_push_pending_frames+0x17a/0x200
>    raw_sendmsg+0x9fa/0x1060
>    __sys_sendto+0x294/0x2e0
>    __x64_sys_sendto+0x6d/0x80
>    do_syscall_64+0x64/0x140
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    </TASK>
>   ============================================
> 
> So also for the lwtunnel_xmit(), we need to make sure that the functions
> dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
> are protected, i.e. inside a local_bh_{disable/enable} block.

That's correct, and I ended up with the same conclusion as yours on the 
possible paths for lwtunnel_xmit() depending on the context (task vs 
irq). Based on your description, we're using a similar approach with 
eBPF :-) Note that paths are similar for lwtunnel_output() (see below).

> In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
> in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
> function. On my x86 architecture, preempt_count() accesses the per-CPU
> __preempt_count variable.
> 
> If in_softirq() returns 0, it indicates that no softirqs are currently being
> processed on the local CPU and BH are not disabled. Therefore, following the
> logic above, we disable bottom halves (BH) on that particular CPU.
> 
> However, there is my opinion an issue that can occur: between the check on
> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> another CPU. As a result, the check on in_softirq() becomes ineffective because
> we may end up disabling BH on a CPU that is not the one we just checked (with
> if (in_softirq()) { ... }).

Hmm, I think it's correct... good catch. I went for this solution to (i) 
avoid useless nested BHs disable calls; and (ii) avoid ending up with a 
spaghetti graph of possible paths with or without BHs disabled (i.e., 
with single entry points, namely lwtunnel_xmit() and lwtunnel_output()), 
which otherwise makes it hard to maintain the code IMO.

So, if we want to follow what Alexei suggests (see his last response), 
we'd need to disable BHs in both ip_local_out() and ip6_local_out(). 
These are the common functions which are closest in depth, and so for 
both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of 
disabling BHs even when it may not be required. Indeed, ip_local_out() 
and ip6_local_out() both call dst_output(), which one is usually not 
lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call 
either).

The other solution is to always call local_bh_disable() in both 
lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when 
they were already. Which was basically -v1 and received a NACK from Alexei.

At the moment, I'm not sure what's best.