netdev - Re: Possible race with xsk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y5u4dA01y9RjjdAW@sbohrer-cf-dell>
Date:   Thu, 15 Dec 2022 18:14:44 -0600
From:   Shawn Bohrer <sbohrer@...udflare.com>
To:     Magnus Karlsson <magnus.karlsson@...il.com>
Cc:     netdev@...r.kernel.org, bpf@...r.kernel.org, bjorn@...nel.org,
        magnus.karlsson@...el.com, kernel-team@...udflare.com
Subject: Re: Possible race with xsk_flush

On Thu, Dec 15, 2022 at 11:22:05AM +0100, Magnus Karlsson wrote:
> Thanks Shawn for your detailed bug report. The rings between user
> space and kernel space are single-producer/single-consumer, so in your
> case when both CPU0 and CPU2 are working on the fill ring and the Rx
> ring at the same time, this will indeed produce races. The intended
> design principle to protect against this is that only one NAPI context
> can access any given ring at any point in time and that the NAPI logic
> should prevent two instances of the same NAPI instance from running at
> the same time. So if that is not true for some reason, we would get a
> race like this. Another option is that one of the CPUs should really
> process another fill ring instead of the same.
> 
> Do you see the second socket being worked on when this happens?
> 
> Could you please share how you set up the two AF_XDP sockets?

Alex Forster sent more details on the configuration but just to
reiterate there are actually 8 AF_XDP sockets in this test setup.
There are two veth interfaces and each interface has four receive
queues.  We create one socket per interface/queue pair.  Our XDP
program redirects each packet to the correct AF_XDP socket based on
the queue number.

Yes there is often activity on other sockets near the time when the
bug occurs.  This is why I'm printing xs/fq, the socket address and
fill queue address, and printing the ingress/egress device name and
queue number in my prints.  This allows to match up the user space and
kernel space prints.  Additionally we are using a shared UMEM so
descriptors could move around between sockets though I've tried to
minimize this and in every case I've seen so far the mystery
descriptor was last used on the same socket and has also been in the
fill queue just not next in line.

> Are you using XDP_DRV mode in your tests?
> 
> > A couple more notes:
> > * The ftrace print order and timestamps seem to indicate that the CPU
> >   2 napi_poll is running before the CPU 0 xsk_flush().  I don't know
> >   if these timestamps can be trusted but it does imply that maybe this
> >   can race as I described.  I've triggered this twice with xsk_flush
> >   probes and both show the order above.
> > * In the 3 times I've triggered this it has occurred right when the
> >   softirq processing switches CPUs
> 
> This is interesting. Could you check, in some way, if you only have
> one core working on the fill ring before the softirq switching and
> then after that you have two? And if you have two, is that period
> transient?

I think what you are asking is why does the softirq processing switch
CPUs?  There is still a lot I don't fully understand here but I've
tried to understand this, if only to try to make it happen more
frequently and make this easier to reproduce.

In this test setup there is no hardware IRQ.  iperf2 sends the packet
and the CPU where iperf is running runs the veth softirq.  I'm not
sure how it picks which veth receive queue receives the packets, but
they end up distributed across the veth qeueus.  Additionally
__veth_xdp_flush() calls __napi_schedule().  This is called from
veth_xdp_xmit() which I think means that transmitting packets from
AF_XDP also schedules the softirq on the current CPU for that veth
queue.  What I definitely see is that if I pin both iperf and my
application to a single CPU all softirqs of all queues run on that
single CPU.  If I pin iperf2 to one core and my application to another
core I get softirqs for all veth queues on both cores.

In our test setup we aren't applying any cpu affinity.  iperf2 is
multi-threaded and can run on all 4 cores, and our application is
multithreaded and can run on all 4 cores.  The napi scheduling seems
to be per veth queue and yes I see those softirqs move and switch
between CPUs.  I don't however have anything that clearly shows it
running concurrently on two CPUs (The stretches of __xsk_rcv_zc are
all on one core before it switches).  The closest I have is the
several microseconds where it appears xsk_flush() overlaps at the end
of my traces.  I would think that if the napi locking didn't work at
all you'd see clear overlap.

>From my experiments with CPU affinity I've updated my test setup to
frequently change the CPU affinity of iperf and our application on one
of my test boxes with hopes that it helps to reproduce but I have no
results so far.

> > * I've also triggered this before I added the xsk_flush() probe and
> >   in that case saw the kernel side additionally fill in the next
> >   expected descriptor, which in the example above would be 0xfe4100.
> >   This seems to indicate that my tracking is all still sane.
> > * This is fairly reproducible, but I've got 10 test boxes running and
> >   I only get maybe bug a day.
> >
> > Any thoughts on if the bug I described is actually possible,
> > alternative theories, or other things to test/try would be welcome.
> 
> I thought this would be impossible, but apparently not :-). We are
> apparently doing something wrong in the AF_XDP code or have the wrong
> assumptions in some situation, but I just do not know what at this
> point in time. Maybe it is veth that breaks some of our assumptions,
> who knows. But let us dig into it. I need your help here, because I
> think it will be hard for me to reproduce the issue.

Yeah if you have ideas on what to test I'll do my best to try them.

I've additionally updated my application to put a bad "cookie"
descriptor address back in the RX ring before updating the consumer
pointer.  My hope is that if we then ever receive that cookie it
proves the kernel raced and failed to update the correct address.

Thanks,
Shawn Bohrer