linux-kernel - Re: Re: [PATCH] xsk: Fix race condition in AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAJ8uoz2DtSGWy3EcQoAUz7BmJW4znAcb+xhfP7rE__Ddy71W9g@mail.gmail.com>
Date: Thu, 10 Apr 2025 09:52:10 +0200
From: Magnus Karlsson <magnus.karlsson@...il.com>
To: "e.kubanski" <e.kubanski@...tner.samsung.com>
Cc: magnus.karlsson@...el.com, linux-kernel@...r.kernel.org, 
	netdev@...r.kernel.org, bjorn@...nel.org, maciej.fijalkowski@...el.com, 
	jonathan.lemon@...il.com
Subject: Re: Re: [PATCH] xsk: Fix race condition in AF_XDP generic RX path

On Wed, 9 Apr 2025 at 16:21, e.kubanski <e.kubanski@...tner.samsung.com> wrote:
>
> > I do not fully understand what you are doing in user space. Could you
> > please provide a user-space code example that will trigger this
> > problem?
>
> We want to scale single hardware queue AF_XDP setup to
> receive packets on multiple threads through RPS mechanisms.
> The problem arises when RPS is enabled in the kernel.
> In this situation single hardware queue flow can scale across
> multiple CPU cores. Then we perform XDP/eBPF load-balancing
> to multiple sockets, by using CPU_ID of issued XDP call.
>
> Every socket is binded to queue number 0, device has single queue.
>
> User-space socket setup looks more-or-less like that (with libxdp):
> ```
> xsk_ring_prod fq{};
> xsk_ring_cons cq{};
>
> xsk_umem_config umem_cfg{ ... };
> xsk_umem* umem;
> auto result = xsk_umem__create(&umem, umem_memory, pool_size_bytes, &fq, &cq, &umem_cfg);
>
> ...
>
> xsk_socket_config xsk_cfg{
>     ...
>     .xdp_flags = XDP_FLAGS_SKB_MODE,
>     ...
> };
>
> xsk_socket* sock1{nullptr};
> xsk_ring_cons rq1{};
> xsk_ring_prod tq1{};
> auto result = xsk_socket__create_shared(
>     &sock1,
>     device_name,
>     0,
>     &rq1,
>     &tq1,
>     &fq,
>     &cq,
>     &cfg
> );
>
> xsk_socket* sock2{nullptr};
> xsk_ring_cons rq2{};
> xsk_ring_prod tq2{};
> auto result = xsk_socket__create_shared(
>     &sock2,
>     device_name,
>     0,
>     &rq2,
>     &tq2,
>     &fq,
>     &cq,
>     &cfg
> );
>
> ...
> ```
>
> We're working on cloud native deploymetns, where
> it's not possible to scale RX through RSS mechanism only.
>
> That's why we wanted to use RPS to scale not only
> user-space processing but also XDP processing.
>
> This patch effectively allows us to use RPS to scale XDP
> in Generic mode.
>
> The same goes for RPS disabled, where we use MACVLAN
>
> child device attached to parent device with multiple queues.
> In this situation MACVLAN allows for multi-core kernel-side
> processing, but xsk_buff_pool isn't protected.
>
> We can't do any passthrough in this situation, we must rely
> on MACVLAN with single RX/TX queue pair.
>
> Of course this is not a problem in situation where every device
> packet is processed on single core.

Thanks, this really helped. You are correct that there is a race and
that the previous fix (6 years ago) did not take into account this
shared umem case. I am fine with your fix, just a few things you need
to add to the patch and resubmit a v2.

* Write [PATCH bpf] in your subject and base your code on the bpf
tree, if you did not do that already.

* You need to add a Fixes tag.

And yes, let us worry about performance at some later stage, if it
needs addressing at all. The important thing for this patch is to make
the code solid.

Thank you for spotting this. Highly appreciated.

> > Please note that if you share an Rx ring or the fill ring between
> > processes/threads, then you have to take care about mutual exclusion
> > in user space.
>
> Of course, RX/TX/FILL/COMP are SPSC queues, we included mutual
> exclusion for FILL/COMP because RX/TX are accessed by single thread.
> Im doing single process deployment with multiple threads, where every
> thread has it's own AF_XDP socket and pool is shared across threads.

Good. Just wanted to make sure.

> > If you really want to do this, it is usually a better
> > idea to use the other shared umem mode in which each process gets its
> > own rx and fill ring, removing the need for mutual exclusion.
>
> If I understand AF_XDP architecture correctly it's not possible for single
> queue deployment, or maybe Im missing something? We need to maintain
> single FILL/COMP pair per device queue.

This is correct. You cannot use that mode in your setup.

>