netdev - Re: Possible race with xsk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz1D3WY=joXqMo80a5Vqx+3N=5YX6Lh=KC1=coM5zDb-dA@mail.gmail.com>
Date:   Tue, 20 Dec 2022 10:06:48 +0100
From:   Magnus Karlsson <magnus.karlsson@...il.com>
To:     Shawn Bohrer <sbohrer@...udflare.com>
Cc:     netdev@...r.kernel.org, bpf@...r.kernel.org, bjorn@...nel.org,
        magnus.karlsson@...el.com, kernel-team@...udflare.com
Subject: Re: Possible race with xsk_flush

On Tue, Dec 20, 2022 at 2:32 AM Shawn Bohrer <sbohrer@...udflare.com> wrote:
>
> On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> > To summarize, we are expecting this ordering:
> >
> > CPU 0 __xsk_rcv_zc()
> > CPU 0 __xsk_map_flush()
> > CPU 2 __xsk_rcv_zc()
> > CPU 2 __xsk_map_flush()
> >
> > But we are seeing this order:
> >
> > CPU 0 __xsk_rcv_zc()
> > CPU 2 __xsk_rcv_zc()
> > CPU 0 __xsk_map_flush()
> > CPU 2 __xsk_map_flush()
> >
> > Here is the veth NAPI poll loop:
> >
> > static int veth_poll(struct napi_struct *napi, int budget)
> > {
> >     struct veth_rq *rq =
> >     container_of(napi, struct veth_rq, xdp_napi);
> >     struct veth_stats stats = {};
> >     struct veth_xdp_tx_bq bq;
> >     int done;
> >
> >     bq.count = 0;
> >
> >     xdp_set_return_frame_no_direct();
> >     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> >
> >     if (done < budget && napi_complete_done(napi, done)) {
> >         /* Write rx_notify_masked before reading ptr_ring */
> >        smp_store_mb(rq->rx_notify_masked, false);
> >        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
> >            if (napi_schedule_prep(&rq->xdp_napi)) {
> >                WRITE_ONCE(rq->rx_notify_masked, true);
> >                __napi_schedule(&rq->xdp_napi);
> >             }
> >         }
> >     }
> >
> >     if (stats.xdp_tx > 0)
> >         veth_xdp_flush(rq, &bq);
> >     if (stats.xdp_redirect > 0)
> >         xdp_do_flush();
> >     xdp_clear_return_frame_no_direct();
> >
> >     return done;
> > }
> >
> > Something I have never seen before is that there is
> > napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> > Let us check if this has something to do with it. So two suggestions
> > to be executed separately:
> >
> > * Put a probe at the __napi_schedule() above and check if it gets
> > triggered before this problem
> > * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> > before "if (done < budget && napi_complete_done(napi, done)) {"
> >
> > This might provide us some hints on what is going on.
>
> After staring at this code for way too long I finally made a
> breakthrough!  I could not understand how this race could occur when
> napi_poll() calls netpoll_poll_lock().  Here is netpoll_poll_lock():
>
> ```
>   static inline void *netpoll_poll_lock(struct napi_struct *napi)
>   {
>     struct net_device *dev = napi->dev;
>
>     if (dev && dev->npinfo) {
>       int owner = smp_processor_id();
>
>       while (cmpxchg(&napi->poll_owner, -1, owner) != -1)
>         cpu_relax();
>
>       return napi;
>     }
>     return NULL;
>   }
> ```
> If dev or dev->npinfo are NULL then it doesn't acquire a lock at all!
> Adding some more trace points I see:
>
> ```
>   iperf2-1325    [002] ..s1. 264246.626880: __napi_poll: (__napi_poll+0x0/0x150) n=0xffff91c885bff000 poll_owner=-1 dev=0xffff91c881d4e000 npinfo=0x0
>   iperf2-1325    [002] d.Z1. 264246.626882: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x3b/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
>   iperf2-1325    [002] d.Z1. 264246.626883: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x42/0xc0) addr=0x1503100 len=0x42 xs=0xffff91c8bfe77000 fq=0xffff91c8c1a43f80 dev=0xffff91c881d4e000
>   iperf2-1325    [002] d.Z1. 264246.626884: xsk_flush: (__xsk_map_flush+0x32/0xb0) xs=0xffff91c8bfe77000
> ```
>
> Here you can see that poll_owner=-1 meaning the lock was never
> acquired because npinfo is NULL.  This means that the same veth rx
> queue can be napi_polled from multiple CPU and nothing stops it from
> running concurrently.  They all look like this, just most of the time
> there aren't concurrent napi_polls running for the same queue.  They
> do however move around CPUs as I explained earlier.
>
> I'll note that I've ran with your suggested change of moving
> xdp_do_flush() before napi_complete_done() all weekend and I have not
> reproduced the issue.  I don't know if that truly means the issue is
> fixed by that change or not.  I suspect it does fix the issue because
> it prevents the napi_struct from being scheduled again before the
> first poll has completed, and nap_schedule_prep() ensures that only
> one instance is ever running.

Thanks Shawn! Good news that the patch seems to fix the problem. To
me, napi_schedule_prep() makes sure that only one NAPI instance is
running. Netpoll is an optional feature and I do not even have it
compiled into my kernel. At least I do not have it defined in my
.config and I cannot find any netpoll symbols with a readelf command.
If netpoll is not used, I would suspect that npinfo == NULL. So to me,
it is still a mystery why this is happening.

To validate or disprove that there are two instances of the same napi
running, could you please put a probe in veth_poll() right after
veth_xdp_rcv() and one at xdp_do_flush() and dump the napi id in both
cases? And run the original code with the bug :-). It would be good to
know what exactly happens in the code between these points when this
bug occurs. Maybe we could do this by enabling the trace buffer and
dumping it when the bug occurs.

> If we think this is the correct fix I'll let it run for another day or
> two and prepare a patch.

There is one more advantage to this bug fix and that is performance.
By calling __xdp_map_flush() immediately after the receive function,
user space can start to work on the packets quicker so performance
will improve.

> Thanks,
> Shawn Bohrer