[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180412105431.41bd8d0a@redhat.com>
Date: Thu, 12 Apr 2018 10:54:31 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: "Karlsson, Magnus" <magnus.karlsson@...el.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Björn Töpel <bjorn.topel@...il.com>,
"Duyck, Alexander H" <alexander.h.duyck@...el.com>,
"alexander.duyck@...il.com" <alexander.duyck@...il.com>,
"john.fastabend@...il.com" <john.fastabend@...il.com>,
"ast@...com" <ast@...com>,
"willemdebruijn.kernel@...il.com" <willemdebruijn.kernel@...il.com>,
"daniel@...earbox.net" <daniel@...earbox.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"michael.lundkvist@...csson.com" <michael.lundkvist@...csson.com>,
"Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
"Singhai, Anjali" <anjali.singhai@...el.com>,
"Zhang, Qi Z" <qi.z.zhang@...el.com>,
"ravineet.singh@...csson.com" <ravineet.singh@...csson.com>,
brouer@...hat.com
Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
On Thu, 12 Apr 2018 07:38:25 +0000
"Karlsson, Magnus" <magnus.karlsson@...el.com> wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@...hat.com]
> > Sent: Thursday, April 12, 2018 4:16 AM
> > To: Björn Töpel <bjorn.topel@...il.com>
> > Cc: Karlsson, Magnus <magnus.karlsson@...el.com>; Duyck, Alexander H
> > <alexander.h.duyck@...el.com>; alexander.duyck@...il.com;
> > john.fastabend@...il.com; ast@...com; brouer@...hat.com;
> > willemdebruijn.kernel@...il.com; daniel@...earbox.net;
> > netdev@...r.kernel.org; michael.lundkvist@...csson.com; Brandeburg,
> > Jesse <jesse.brandeburg@...el.com>; Singhai, Anjali
> > <anjali.singhai@...el.com>; Zhang, Qi Z <qi.z.zhang@...el.com>;
> > ravineet.singh@...csson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> >
> > On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:
> > > @@ -30,4 +31,18 @@ struct xdp_umem_reg {
> > > __u32 frame_headroom; /* Frame head room */ };
> > >
> > > +/* Pgoff for mmaping the rings */
> > > +#define XDP_UMEM_PGOFF_FILL_QUEUE 0x100000000
> > > +
> > > +struct xdp_queue {
> > > + __u32 head_idx __attribute__((aligned(64)));
> > > + __u32 tail_idx __attribute__((aligned(64))); };
> > > +
> > > +/* Used for the fill and completion queues for buffers */ struct
> > > +xdp_umem_queue {
> > > + struct xdp_queue ptrs;
> > > + __u32 desc[0] __attribute__((aligned(64))); };
> > > +
> > > #endif /* _LINUX_IF_XDP_H */
> >
> > So IIUC it's a head/tail ring of 32 bit descriptors.
> >
> > In my experience (from implementing ptr_ring) this implies that head/tail
> > cache lines bounce a lot between CPUs. Caching will help some. You are also
> > forced to use barriers to check validity which is slow on some architectures.
> >
> > If instead you can use a special descriptor value (e.g. 0) as a valid signal,
> > things work much better:
> >
> > - you read descriptor atomically, if it's not 0 it's fine
> > - same with write - write 0 to pass it to the other side
> > - there is a data dependency so no need for barriers (except on dec alpha)
> > - no need for power of 2 limitations, you can make it any size you like
> > - easy to resize too
> >
> > architecture (if not implementation) would be shared with ptr_ring so some
> > of the optimization ideas like batched updates could be lifted from there.
> >
> > When I was building ptr_ring, any head/tail design underperformed storing
> > valid flag with data itself. YMMV.
I fully agree with MST here. This is also my experience. I even
dropped my own Array-based Lock-Free (ALF) queue implementation[1] in
favor of ptr_ring. (Where I try to amortize this cost by bulking, but
this cause the queue to become non-wait-free)
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h
> I think you are definitely right in that there are ways in which
> we can improve performance here. That said, the current queue
> performs slightly better than the previous one we had that was
> more or less a copy of one of your first virtio 1.1 proposals
> from little over a year ago. It had bidirectional queues and a
> valid flag in the descriptor itself. The reason we abandoned this
> was not poor performance (it was good), but a need to go to
> unidirectional queues. Maybe I should have only changed that
> aspect and kept the valid flag.
>
> Anyway, I will take a look at ptr_ring and run some experiments
> along the lines of what you propose to get some
> numbers. Considering your experience with these kind of
> structures, you are likely right. I just need to convince
> myself :-).
When benchmarking, be careful that you don't measure the "wrong"
queue situation. When doing this kind of "overload" benchmarking, you
will likely create a situation where the queue is always full (which
hopefully isn't a production use-case). In the almost/always full
queue situation, using the element values to sync-on (like MST propose)
will still cause the cache-line bouncing (that we want to avoid).
MST explain and have addressed this situation for ptr_ring in:
commit fb9de9704775 ("ptr_ring: batch ring zeroing")
https://git.kernel.org/torvalds/c/fb9de9704775
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists