[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250602092754eucms1p1b99e467d1483531491c5b43b23495e14@eucms1p1>
Date: Mon, 02 Jun 2025 11:27:54 +0200
From: Eryk Kubanski <e.kubanski@...tner.samsung.com>
To: Stanislav Fomichev <stfomichev@...il.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"bjorn@...nel.org" <bjorn@...nel.org>, "magnus.karlsson@...el.com"
<magnus.karlsson@...el.com>, "maciej.fijalkowski@...el.com"
<maciej.fijalkowski@...el.com>, "jonathan.lemon@...il.com"
<jonathan.lemon@...il.com>
Subject: RE: Re: [PATCH bpf v2] xsk: Fix out of order segment free in
__xsk_generic_xmit()
> I'm not sure I understand what's the issue here. If you're using the
> same XSK from different CPUs, you should take care of the ordering
> yourself on the userspace side?
It's not a problem with user-space Completion Queue READER side.
Im talking exclusively about kernel-space Completion Queue WRITE side.
This problem can occur when multiple sockets are bound to the same
umem, device, queue id. In this situation Completion Queue is shared.
This means it can be accessed by multiple threads on kernel-side.
Any use is indeed protected by spinlock, however any write sequence
(Acquire write slot as writer, write to slot, submit write slot to reader)
isn't atomic in any way and it's possible to submit not-yet-sent packet
descriptors back to user-space as TX completed.
Up untill now, all write-back operations had two phases, each phase
locks the spinlock and unlocks it:
1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
2) Submit slot to the reader (increase writer by N)
Slot submission was solely based on the timing. Let's consider situation,
where two different threads issue a syscall for two different AF_XDP sockets
that are bound to the same umem, dev, queue-id.
AF_XDP setup:
kernel-space
Write Read
+--+ +--+
| | | |
| | | |
| | | |
Completion | | | | Fill
Queue | | | | Queue
| | | |
| | | |
| | | |
| | | |
+--+ +--+
Read Write
user-space
+--------+ +--------+
| AF_XDP | | AF_XDP |
+--------+ +--------+
Possible out-of-order scenario:
writer cached_writer1 cached_writer2
| | |
| | |
| | |
| | |
+--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
| | | | | | | | |
Completion Queue | | | | | | | | |
| | | | | | | | |
+--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
| | |
| | |
|-----------------| |
A) T1 syscall | |
writes 2 | |
descriptors |-----------------------------------|
B) T2 syscall writes 4 descriptors
Notes:
1) T1 and T2 AF_XDP sockets are two different sockets,
__xsk_generic_xmit will obtain two different mutexes.
2) T1 and T2 can be executed simultaneously, there is no
critical section whatsoever between them.
3) T1 and T2 will obtain Completion Queue Lock for acquire + write,
only slot acquire + write are under lock.
4) T1 and T2 completion (skb destructor)
doesn't need to be the same order as A) and B).
5) What if T1 fails after T2 acquires slots?
cached_writer will be decreased by 2, T2 will
submit failed descriptors of T1 (they shall be
retransmitted in next TX).
Submission of writer will move writer by 4 slots
2 of these slots have failed T1 values. Last two
slots of T2 will be missing, descriptor leak.
6) What if T2 completes before T1? writer will be
moved by 4 slots. 2 of them are slots filled by T1.
T2 will complete 2 own slots and 2 slots of T1, It's bad.
T1 will complete last 2 slots of T2, also bad.
This out-of-order completion can effectively cause User-space <-> Kernel-space
data race. This patch solves that, by only acquiring cached_writer first and
do the completion (sumission (write + increase writer)) after. This is the only
way to make that bulletproof for multithreaded access, failures and
out-of-order skb completions.
> This is definitely a no-go (sk_buff and skb_shared_info space is
> precious).
Okay so where should I store It? Can you give me some advice?
I left that there, because there is every information related to
skb desctruction. Additionally this is the only place in skb related
code that defines anything related to xsk: metadata, number of descriptors.
SKBUFF doesn't. I need to hold this information somewhere, and skbuff or
skb_shared_info are the only place I can store it. This need to be invariant
across all skb fragments, and be released after skb completes.
Powered by blists - more mailing lists