netdev - Re: [PATCH bpf] xsk: fix race in SKB mode transmit with shared cq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz15TKR7Tpx_brUut7XwHcEMCq90xOh7xFHhp-4+7BOMTQ@mail.gmail.com>
Date:   Fri, 11 Dec 2020 08:47:31 +0100
From:   Magnus Karlsson <magnus.karlsson@...il.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        Björn Töpel <bjorn.topel@...el.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        bpf <bpf@...r.kernel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        "Fijalkowski, Maciej" <maciej.fijalkowski@...el.com>,
        Maciej Fijalkowski <maciejromanfijalkowski@...il.com>,
        Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH bpf] xsk: fix race in SKB mode transmit with shared cq

On Thu, Dec 10, 2020 at 10:53 PM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Thu, Dec 10, 2020 at 7:36 AM Magnus Karlsson
> <magnus.karlsson@...il.com> wrote:
> >
> > From: Magnus Karlsson <magnus.karlsson@...el.com>
> >
> > Fix a race when multiple sockets are simultaneously calling sendto()
> > when the completion ring is shared in the SKB case. This is the case
> > when you share the same netdev and queue id through the
> > XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
> > be in xsk_generic_xmit() and call the backpressure mechanism in
> > xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
> > specific scenario, a race might occur since the rings are
> > single-producer single-consumer.
> >
> > Fix this by moving the tx_completion_lock from the socket to the pool
> > as the pool is shared between the sockets that share the completion
> > ring. (The pool is not shared when this is not the case.) And then
> > protect the accesses to xskq_prod_reserve() with this lock. The
> > tx_completion_lock is renamed cq_lock to better reflect that it
> > protects accesses to the potentially shared completion ring.
> >
> > Fixes: 35fcde7f8deb ("xsk: support for Tx")
> > Fixes: a9744f7ca200 ("xsk: fix potential race in SKB TX completion code")
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@...el.com>
> > Reported-by: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
> > ---
> >  include/net/xdp_sock.h      | 4 ----
> >  include/net/xsk_buff_pool.h | 5 +++++
> >  net/xdp/xsk.c               | 9 ++++++---
> >  net/xdp/xsk_buff_pool.c     | 1 +
> >  4 files changed, 12 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index 4f4e93bf814c..cc17bc957548 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -58,10 +58,6 @@ struct xdp_sock {
> >
> >         struct xsk_queue *tx ____cacheline_aligned_in_smp;
> >         struct list_head tx_list;
> > -       /* Mutual exclusion of NAPI TX thread and sendmsg error paths
> > -        * in the SKB destructor callback.
> > -        */
> > -       spinlock_t tx_completion_lock;
> >         /* Protects generic receive. */
> >         spinlock_t rx_lock;
> >
> > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > index 01755b838c74..eaa8386dbc63 100644
> > --- a/include/net/xsk_buff_pool.h
> > +++ b/include/net/xsk_buff_pool.h
> > @@ -73,6 +73,11 @@ struct xsk_buff_pool {
> >         bool dma_need_sync;
> >         bool unaligned;
> >         void *addrs;
> > +       /* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect:
> > +        * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when
> > +        * sockets share a single cq when the same netdev and queue id is shared.
> > +        */
> > +       spinlock_t cq_lock;
> >         struct xdp_buff_xsk *free_heads[];
> >  };
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 62504471fd20..42cb5f94d49e 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -364,9 +364,9 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> >         struct xdp_sock *xs = xdp_sk(skb->sk);
> >         unsigned long flags;
> >
> > -       spin_lock_irqsave(&xs->tx_completion_lock, flags);
> > +       spin_lock_irqsave(&xs->pool->cq_lock, flags);
> >         xskq_prod_submit_addr(xs->pool->cq, addr);
> > -       spin_unlock_irqrestore(&xs->tx_completion_lock, flags);
> > +       spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
> >
> >         sock_wfree(skb);
> >  }
> > @@ -378,6 +378,7 @@ static int xsk_generic_xmit(struct sock *sk)
> >         bool sent_frame = false;
> >         struct xdp_desc desc;
> >         struct sk_buff *skb;
> > +       unsigned long flags;
> >         int err = 0;
> >
> >         mutex_lock(&xs->mutex);
> > @@ -409,10 +410,13 @@ static int xsk_generic_xmit(struct sock *sk)
> >                  * if there is space in it. This avoids having to implement
> >                  * any buffering in the Tx path.
> >                  */
> > +               spin_lock_irqsave(&xs->pool->cq_lock, flags);
> >                 if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> > +                       spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
> >                         kfree_skb(skb);
> >                         goto out;
> >                 }
> > +               spin_unlock_irqrestore(&xs->pool->cq_lock, flags);
>
> Lock/unlock for every packet?
> Do you have any performance concerns?

I have measured and the performance impact of this is in the noise.
There is unfortunately already a lot of code being executed per packet
in this path. On the positive side, once this bug fix trickles down to
bpf-next, I will submit a rather simple patch to this function that
improves throughput by 15% for the txpush microbenchmark. This will
more than compensate for any locking introduced.