lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACGkMEtG40J1zZG4nSvviw4MqX+RPOVuHG9PHR-PiYcZLj38CQ@mail.gmail.com>
Date: Tue, 12 Mar 2024 14:07:30 +0800
From: Jason Wang <jasowang@...hat.com>
To: wangyunjian <wangyunjian@...wei.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>, Paolo Abeni <pabeni@...hat.com>, 
	"willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>, "kuba@...nel.org" <kuba@...nel.org>, 
	"bjorn@...nel.org" <bjorn@...nel.org>, "magnus.karlsson@...el.com" <magnus.karlsson@...el.com>, 
	"maciej.fijalkowski@...el.com" <maciej.fijalkowski@...el.com>, 
	"jonathan.lemon@...il.com" <jonathan.lemon@...il.com>, "davem@...emloft.net" <davem@...emloft.net>, 
	"bpf@...r.kernel.org" <bpf@...r.kernel.org>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"virtualization@...ts.linux.dev" <virtualization@...ts.linux.dev>, xudingke <xudingke@...wei.com>, 
	"liwei (DT)" <liwei395@...wei.com>
Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

On Mon, Mar 11, 2024 at 9:28 PM wangyunjian <wangyunjian@...wei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jason Wang [mailto:jasowang@...hat.com]
> > Sent: Monday, March 11, 2024 12:01 PM
> > To: wangyunjian <wangyunjian@...wei.com>
> > Cc: Michael S. Tsirkin <mst@...hat.com>; Paolo Abeni <pabeni@...hat.com>;
> > willemdebruijn.kernel@...il.com; kuba@...nel.org; bjorn@...nel.org;
> > magnus.karlsson@...el.com; maciej.fijalkowski@...el.com;
> > jonathan.lemon@...il.com; davem@...emloft.net; bpf@...r.kernel.org;
> > netdev@...r.kernel.org; linux-kernel@...r.kernel.org; kvm@...r.kernel.org;
> > virtualization@...ts.linux.dev; xudingke <xudingke@...wei.com>; liwei (DT)
> > <liwei395@...wei.com>
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> >
> > On Mon, Mar 4, 2024 at 9:45 PM wangyunjian <wangyunjian@...wei.com>
> > wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin [mailto:mst@...hat.com]
> > > > Sent: Friday, March 1, 2024 7:53 PM
> > > > To: wangyunjian <wangyunjian@...wei.com>
> > > > Cc: Paolo Abeni <pabeni@...hat.com>;
> > > > willemdebruijn.kernel@...il.com; jasowang@...hat.com;
> > > > kuba@...nel.org; bjorn@...nel.org; magnus.karlsson@...el.com;
> > > > maciej.fijalkowski@...el.com; jonathan.lemon@...il.com;
> > > > davem@...emloft.net; bpf@...r.kernel.org; netdev@...r.kernel.org;
> > > > linux-kernel@...r.kernel.org; kvm@...r.kernel.org;
> > > > virtualization@...ts.linux.dev; xudingke <xudingke@...wei.com>;
> > > > liwei (DT) <liwei395@...wei.com>
> > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > support
> > > >
> > > > On Fri, Mar 01, 2024 at 11:45:52AM +0000, wangyunjian wrote:
> > > > > > -----Original Message-----
> > > > > > From: Paolo Abeni [mailto:pabeni@...hat.com]
> > > > > > Sent: Thursday, February 29, 2024 7:13 PM
> > > > > > To: wangyunjian <wangyunjian@...wei.com>; mst@...hat.com;
> > > > > > willemdebruijn.kernel@...il.com; jasowang@...hat.com;
> > > > > > kuba@...nel.org; bjorn@...nel.org; magnus.karlsson@...el.com;
> > > > > > maciej.fijalkowski@...el.com; jonathan.lemon@...il.com;
> > > > > > davem@...emloft.net
> > > > > > Cc: bpf@...r.kernel.org; netdev@...r.kernel.org;
> > > > > > linux-kernel@...r.kernel.org; kvm@...r.kernel.org;
> > > > > > virtualization@...ts.linux.dev; xudingke <xudingke@...wei.com>;
> > > > > > liwei (DT) <liwei395@...wei.com>
> > > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > > > support
> > > > > >
> > > > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > > > >         }
> > > > > > >  }
> > > > > > >
> > > > > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > > > > +       struct xsk_buff_pool *pool;
> > > > > > > +       u32 i, batch, budget;
> > > > > > > +       void *frame;
> > > > > > > +
> > > > > > > +       if (!ptr_ring_empty(&tfile->tx_ring))
> > > > > > > +               return;
> > > > > > > +
> > > > > > > +       spin_lock(&tfile->pool_lock);
> > > > > > > +       pool = tfile->xsk_pool;
> > > > > > > +       if (!pool) {
> > > > > > > +               spin_unlock(&tfile->pool_lock);
> > > > > > > +               return;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       if (tfile->nb_descs) {
> > > > > > > +               xsk_tx_completed(pool, tfile->nb_descs);
> > > > > > > +               if (xsk_uses_need_wakeup(pool))
> > > > > > > +                       xsk_set_tx_need_wakeup(pool);
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       spin_lock(&tfile->tx_ring.producer_lock);
> > > > > > > +       budget = min_t(u32, tfile->tx_ring.size,
> > > > > > > + TUN_XDP_BATCH);
> > > > > > > +
> > > > > > > +       batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > > > > +       if (!batch) {
> > > > > >
> > > > > > This branch looks like an unneeded "optimization". The generic
> > > > > > loop below should have the same effect with no measurable perf
> > > > > > delta - and
> > > > smaller code.
> > > > > > Just remove this.
> > > > > >
> > > > > > > +               tfile->nb_descs = 0;
> > > > > > > +               spin_unlock(&tfile->tx_ring.producer_lock);
> > > > > > > +               spin_unlock(&tfile->pool_lock);
> > > > > > > +               return;
> > > > > > > +       }
> > > > > > > +
> > > > > > > +       tfile->nb_descs = batch;
> > > > > > > +       for (i = 0; i < batch; i++) {
> > > > > > > +               /* Encode the XDP DESC flag into lowest bit
> > > > > > > + for consumer to
> > > > differ
> > > > > > > +                * XDP desc from XDP buffer and sk_buff.
> > > > > > > +                */
> > > > > > > +               frame = tun_xdp_desc_to_ptr(&pool->tx_descs[i]);
> > > > > > > +               /* The budget must be less than or equal to
> > tx_ring.size,
> > > > > > > +                * so enqueuing will not fail.
> > > > > > > +                */
> > > > > > > +               __ptr_ring_produce(&tfile->tx_ring, frame);
> > > > > > > +       }
> > > > > > > +       spin_unlock(&tfile->tx_ring.producer_lock);
> > > > > > > +       spin_unlock(&tfile->pool_lock);
> > > > > >
> > > > > > More related to the general design: it looks wrong. What if
> > > > > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With
> > > > > > no more incoming packets, later peek will return 0 and it looks
> > > > > > like that the half-processed packets will stay in the ring forever???
> > > > > >
> > > > > > I think the 'ring produce' part should be moved into tun_do_read().
> > > > >
> > > > > Currently, the vhost-net obtains a batch descriptors/sk_buffs from
> > > > > the ptr_ring and enqueue the batch descriptors/sk_buffs to the
> > > > > virtqueue'queue, and then consumes the descriptors/sk_buffs from
> > > > > the virtqueue'queue in sequence. As a result, TUN does not know
> > > > > whether the batch descriptors have been used up, and thus does not
> > > > > know when to
> > > > return the batch descriptors.
> > > > >
> > > > > So, I think it's reasonable that when vhost-net checks ptr_ring is
> > > > > empty, it calls peek_len to get new xsk's descs and return the descriptors.
> > > > >
> > > > > Thanks
> > > >
> > > > What you need to think about is that if you peek, another call in
> > > > parallel can get the same value at the same time.
> > >
> > > Thank you. I have identified a problem. The tx_descs array was created within
> > xsk's pool.
> > > When xsk is freed, the pool and tx_descs are also freed. Howerver,
> > > some descs may remain in the virtqueue'queue, which could lead to a
> > use-after-free scenario.
> >
> > This can probably solving by when xsk pool is disabled, signal the vhost_net to
> > drop those descriptors.
>
> I think TUN can notify vhost_net to drop these descriptors through netdev events.

Great, actually, the "issue" described above exist in this patch as
well. For example, you did:

                        spin_lock(&tfile->pool_lock);
                        if (tfile->pool) {
                              ret = tun_put_user_desc(tun, tfile,
&tfile->desc, to);

You did copy_to_user() under spinlock which is actually a bug.

> However, there is a potential concurrency problem. When handling netdev events
> and packets, vhost_net preempts the 'vq->mutex_lock', leading to unstable performance.

I think we don't need to care the perf in this case.

And we gain a lot:

1) no trick in peek
2) batching support
...

Thanks

>
> Thanks
> >
> > Thanks
> >
> > > Currently,
> > > I do not have an idea to solve this concurrency problem and believe
> > > this scenario may not be appropriate for reusing the ptr_ring.
> > >
> > > Thanks
> > >
> > > >
> > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Paolo
> > > > >
> > >
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ