netdev - RE: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250602092754eucms1p1b99e467d1483531491c5b43b23495e14@eucms1p1>
Date: Mon, 02 Jun 2025 11:27:54 +0200
From: Eryk Kubanski <e.kubanski@...tner.samsung.com>
To: Stanislav Fomichev <stfomichev@...il.com>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"bjorn@...nel.org" <bjorn@...nel.org>, "magnus.karlsson@...el.com"
	<magnus.karlsson@...el.com>, "maciej.fijalkowski@...el.com"
	<maciej.fijalkowski@...el.com>, "jonathan.lemon@...il.com"
	<jonathan.lemon@...il.com>
Subject: RE: Re: [PATCH bpf v2] xsk: Fix out of order segment free in
 __xsk_generic_xmit()

> I'm not sure I understand what's the issue here. If you're using the
> same XSK from different CPUs, you should take care of the ordering
> yourself on the userspace side?

It's not a problem with user-space Completion Queue READER side.
Im talking exclusively about kernel-space Completion Queue WRITE side.

This problem can occur when multiple sockets are bound to the same
umem, device, queue id. In this situation Completion Queue is shared.
This means it can be accessed by multiple threads on kernel-side.
Any use is indeed protected by spinlock, however any write sequence
(Acquire write slot as writer, write to slot, submit write slot to reader)
isn't atomic in any way and it's possible to submit not-yet-sent packet
descriptors back to user-space as TX completed.

Up untill now, all write-back operations had two phases, each phase
locks the spinlock and unlocks it:
1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
2) Submit slot to the reader (increase writer by N)

Slot submission was solely based on the timing. Let's consider situation,
where two different threads issue a syscall for two different AF_XDP sockets
that are bound to the same umem, dev, queue-id.

AF_XDP setup:
                                                            
                             kernel-space                   
                                                            
           Write   Read                                     
            +--+   +--+                                     
            |  |   |  |                                     
            |  |   |  |                                     
            |  |   |  |                                     
 Completion |  |   |  | Fill                                
 Queue      |  |   |  | Queue                               
            |  |   |  |                                     
            |  |   |  |                                     
            |  |   |  |                                     
            |  |   |  |                                     
            +--+   +--+                                     
            Read   Write                                    
                             user-space                     
                                                            
                                                            
   +--------+         +--------+                            
   | AF_XDP |         | AF_XDP |                            
   +--------+         +--------+                            
                                                            
                                                            
                                                            
                                                            

Possible out-of-order scenario:
                                                                                                                                       
                                                                                                                                       
                              writer         cached_writer1                      cached_writer2                                        
                                 |                 |                                   |                                               
                                 |                 |                                   |                                               
                                 |                 |                                   |                                               
                                 |                 |                                   |                                               
                  +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
                  |              |        |        |        |        |        |        |                                              |
 Completion Queue |              |        |        |        |        |        |        |                                              |
                  |              |        |        |        |        |        |        |                                              |
                  +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
                                 |                 |                                   |                                               
                                 |                 |                                   |                                               
                                 |-----------------|                                   |                                               
                                  A) T1 syscall    |                                   |                                               
                                  writes 2         |                                   |                                               
                                  descriptors      |-----------------------------------|                                               
                                                    B) T2 syscall writes 4 descriptors                                                 
                                                                                                                                       
                                                                                                                                       
                                                                                                                                       
                                                                                                                                       
                 Notes:                                                                                                                
                 1) T1 and T2 AF_XDP sockets are two different sockets,                                                                
                    __xsk_generic_xmit will obtain two different mutexes.                                                              
                 2) T1 and T2 can be executed simultaneously, there is no                                                              
                    critical section whatsoever between them.                                                                          
                 3) T1 and T2 will obtain Completion Queue Lock for acquire + write,                                                   
                    only slot acquire + write are under lock.                                                                          
                 4) T1 and T2 completion (skb destructor)                                                                              
                    doesn't need to be the same order as A) and B).                                                                    
                 5) What if T1 fails after T2 acquires slots?                                                                          
                    cached_writer will be decreased by 2, T2 will                                                                      
                    submit failed descriptors of T1 (they shall be
                    retransmitted in next TX).                                                                                   
                    Submission of writer will move writer by 4 slots                                                                   
                    2 of these slots have failed T1 values. Last two
                    slots of T2 will be missing, descriptor leak.                                                                            
                 6) What if T2 completes before T1? writer will be                                                                     
                    moved by 4 slots. 2 of them are slots filled by T1.                                                                
                    T2 will complete 2 own slots and 2 slots of T1, It's bad.                                                          
                    T1 will complete last 2 slots of T2, also bad.                                                                     

This out-of-order completion can effectively cause User-space <-> Kernel-space
data race. This patch solves that, by only acquiring cached_writer first and
do the completion (sumission (write + increase writer)) after. This is the only
way to make that bulletproof for multithreaded access, failures and
out-of-order skb completions.

> This is definitely a no-go (sk_buff and skb_shared_info space is
> precious).

Okay so where should I store It? Can you give me some advice?

I left that there, because there is every information related to
skb desctruction. Additionally this is the only place in skb related
code that defines anything related to xsk: metadata, number of descriptors.
SKBUFF doesn't. I need to hold this information somewhere, and skbuff or
skb_shared_info are the only place I can store it. This need to be invariant
across all skb fragments, and be released after skb completes.