netdev - Re: [PATCH bpf] xsk: fix race in SKB mode transmit with shared cq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJ8uoz3RORXKp-Dvv0OtvSNoLVeu5ZqH0UeNU=PhJ=UwhgrXfA@mail.gmail.com>
Date:   Mon, 14 Dec 2020 10:00:42 +0100
From:   Magnus Karlsson <magnus.karlsson@...il.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        Björn Töpel <bjorn.topel@...el.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        "Fijalkowski, Maciej" <maciej.fijalkowski@...el.com>,
        Maciej Fijalkowski <maciejromanfijalkowski@...il.com>,
        Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH bpf] xsk: fix race in SKB mode transmit with shared cq

On Fri, Dec 11, 2020 at 8:47 AM Magnus Karlsson
<magnus.karlsson@...il.com> wrote:
>
> On Thu, Dec 10, 2020 at 10:53 PM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 7:36 AM Magnus Karlsson
> > <magnus.karlsson@...il.com> wrote:
> > >
> > > From: Magnus Karlsson <magnus.karlsson@...el.com>
> > >
> > > Fix a race when multiple sockets are simultaneously calling sendto()
> > > when the completion ring is shared in the SKB case. This is the case
> > > when you share the same netdev and queue id through the
> > > XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
> > > be in xsk_generic_xmit() and call the backpressure mechanism in
> > > xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
> > > specific scenario, a race might occur since the rings are
> > > single-producer single-consumer.
> > >
> > > Fix this by moving the tx_completion_lock from the socket to the pool
> > > as the pool is shared between the sockets that share the completion
> > > ring. (The pool is not shared when this is not the case.) And then
> > > protect the accesses to xskq_prod_reserve() with this lock. The
> > > tx_completion_lock is renamed cq_lock to better reflect that it
> > > protects accesses to the potentially shared completion ring.
> > >
> > > Fixes: 35fcde7f8deb ("xsk: support for Tx")
> > > Fixes: a9744f7ca200 ("xsk: fix potential race in SKB TX completion code")
> > > Signed-off-by: Magnus Karlsson <magnus.karlsson@...el.com>
> > > Reported-by: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
> > > ---
> > >  include/net/xdp_sock.h      | 4 ----
> > >  include/net/xsk_buff_pool.h | 5 +++++
> > >  net/xdp/xsk.c               | 9 ++++++---
> > >  net/xdp/xsk_buff_pool.c     | 1 +
> > >  4 files changed, 12 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index 4f4e93bf814c..cc17bc957548 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -58,10 +58,6 @@ struct xdp_sock {
> > >
> > >         struct xsk_queue *tx ____cacheline_aligned_in_smp;
> > >         struct list_head tx_list;
> > > -       /* Mutual exclusion of NAPI TX thread and sendmsg error paths
> > > -        * in the SKB destructor callback.
> > > -        */
> > > -       spinlock_t tx_completion_lock;
> > >         /* Protects generic receive. */
> > >         spinlock_t rx_lock;
> > >
> > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > index 01755b838c74..eaa8386dbc63 100644
> > > --- a/include/net/xsk_buff_pool.h
> > > +++ b/include/net/xsk_buff_pool.h
> > > @@ -73,6 +73,11 @@ struct xsk_buff_pool {
> > >         bool dma_need_sync;
> > >         bool unaligned;
> > >         void *addrs;
> > > +       /* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect:
> > > +        * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when
> > > +        * sockets share a single cq when the same netdev and queue id is shared.
> > > +        */
> > > +       spinlock_t cq_lock;
> > >         struct xdp_buff_xsk *free_heads[];
> > >  };
> > >
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 62504471fd20..42cb5f94d49e 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -364,9 +364,9 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > >         struct xdp_sock *xs = xdp_sk(skb->sk);
> > >         unsigned long flags;
> > >
> > > -       spin_lock_irqsave(&xs->tx_completion_lock, flags);
> > > +       spin_lock_irqsave(&xs->pool->cq_lock, flags);
> > >         xskq_prod_submit_addr(xs->pool->cq, addr);
> > > -       spin_unlock_irqrestore(&xs->tx_completion_lock, flags);
> > > +       spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
> > >
> > >         sock_wfree(skb);
> > >  }
> > > @@ -378,6 +378,7 @@ static int xsk_generic_xmit(struct sock *sk)
> > >         bool sent_frame = false;
> > >         struct xdp_desc desc;
> > >         struct sk_buff *skb;
> > > +       unsigned long flags;
> > >         int err = 0;
> > >
> > >         mutex_lock(&xs->mutex);
> > > @@ -409,10 +410,13 @@ static int xsk_generic_xmit(struct sock *sk)
> > >                  * if there is space in it. This avoids having to implement
> > >                  * any buffering in the Tx path.
> > >                  */
> > > +               spin_lock_irqsave(&xs->pool->cq_lock, flags);
> > >                 if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> > > +                       spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
> > >                         kfree_skb(skb);
> > >                         goto out;
> > >                 }
> > > +               spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
> >
> > Lock/unlock for every packet?
> > Do you have any performance concerns?
>
> I have measured and the performance impact of this is in the noise.
> There is unfortunately already a lot of code being executed per packet
> in this path. On the positive side, once this bug fix trickles down to
> bpf-next, I will submit a rather simple patch to this function that
> improves throughput by 15% for the txpush microbenchmark. This will
> more than compensate for any locking introduced.

Please ignore/drop this patch. I will include this one as patch 1 in a
patch set of 2 that I will submit. The second patch will be a fix to
the reservation problem spotted by Xuan and it needs the locking of
the first patch to be able to work.