netdev - Re: [RFC bpf-next 0/7] busy poll support for AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz24HWGfGBNhz4c-kZjYELJQ+G3FcELVEo205xd1CirpqQ@mail.gmail.com>
Date:   Wed, 8 May 2019 14:10:19 +0200
From:   Magnus Karlsson <magnus.karlsson@...il.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Magnus Karlsson <magnus.karlsson@...el.com>,
        Björn Töpel <bjorn.topel@...el.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        bpf@...r.kernel.org, Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Jonathan Lemon <bsd@...com>
Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets

On Tue, May 7, 2019 at 8:24 PM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> > On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> > <alexei.starovoitov@...il.com> wrote:
> > >
> > > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > > busy-poll, the driver is executed in process context by calling the
> > > > poll() syscall. The main advantage with this is that all processing
> > > > occurs on a single core. This eliminates the core-to-core cache
> > > > transfers that occur between the application and the softirqd
> > > > processing on another core, that occurs without busy-poll. From a
> > > > systems point of view, it also provides an advatage that we do not
> > > > have to provision extra cores in the system to handle
> > > > ksoftirqd/softirq processing, as all processing is done on the single
> > > > core that executes the application. The drawback of busy-poll is that
> > > > max throughput seen from a single application will be lower (due to
> > > > the syscall), but on a per core basis it will often be higher as
> > > > the normal mode runs on two cores and busy-poll on a single one.
> > > >
> > > > The semantics of busy-poll from the application point of view are the
> > > > following:
> > > >
> > > > * The application is required to call poll() to drive rx and tx
> > > >   processing. There is no guarantee that softirq and interrupts will
> > > >   do this for you. This is in contrast with the current
> > > >   implementations of busy-poll that are opportunistic in the sense
> > > >   that packets might be received/transmitted by busy-poll or
> > > >   softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > > >   loads just as the current opportunistic implementations, but I would
> > > >   like to get to a point where this is not the case for busy-poll
> > > >   enabled XDP sockets, as this slows down performance considerably and
> > > >   starts to use one more core for the softirq processing. The end goal
> > > >   is for only poll() to drive the napi loop when busy-poll is enabled
> > > >   on an AF_XDP socket. More about this later.)
> > > >
> > > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > > >   the XDP socket busy-poll will not care about the current
> > > >   /proc/sys/net/core/busy_poll and busy_read global enablement
> > > >   mechanisms.
> > > >
> > > > * The batch size (how many packets that are processed every time the
> > > >   napi function in the driver is called, i.e. the weight parameter)
> > > >   should be configurable. Currently, the busy-poll size of AF_INET
> > > >   sockets is set to 8, but for AF_XDP sockets this is too small as the
> > > >   amount of processing per packet is much smaller with AF_XDP. This
> > > >   should be configurable on a per socket basis.
> > > >
> > > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > > >   call the napi contexts of all of them should be executed. This is in
> > > >   contrast to the AF_INET busy-poll that quits after the fist one that
> > > >   finds any packets. We need all napi contexts to be executed due to
> > > >   the first requirement in this list. The behaviour we want is much more
> > > >   like regular sockets in that all of them are checked in the poll
> > > >   call.
> > > >
> > > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > > >   sockets including busy-poll AF_INET ones in a single poll() call
> > > >   without any change to semantics or the behaviour of any of those
> > > >   socket types.
> > > >
> > > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > > >   mode return POLLERR if the fill ring is empty or the completion
> > > >   queue is full.
> > > >
> > > > Busy-poll support is enabled by calling a new setsockopt called
> > > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > > any other value will return an error.
> > > >
> > > > A typical packet processing rxdrop loop with busy-poll will look something
> > > > like this:
> > > >
> > > > for (i = 0; i < num_socks; i++) {
> > > >     fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > > >     fds[i].events = POLLIN;
> > > > }
> > > >
> > > > for (;;) {
> > > >     ret = poll(fds, num_socks, 0);
> > > >     if (ret <= 0)
> > > >             continue;
> > > >
> > > >     for (i = 0; i < num_socks; i++)
> > > >         rx_drop(xsks[i], fds); /* The actual application */
> > > > }
> > > >
> > > > Need some advice around this issue please:
> > > >
> > > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > > render the busy poll support useless as the execution is now happening
> > > > in the same way as without busy-poll support. Everything works from an
> > > > application perspective but this defeats the purpose of the support
> > > > and also consumes an extra core. What I would like to accomplish when
> > >
> > > Not sure what you mean by 'extra core' .
> > > The above poll+rx_drop is executed for every af_xdp socket
> > > and there are N cpus processing exactly N af_xdp sockets.
> > > Where is 'extra core'?
> > > Are you suggesting a model where single core will be busy-polling
> > > all af_xdp sockets? and then waking processing threads?
> > > or single core will process all sockets?
> > > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > > experience, but it should be optimized for 'ideal' user that
> > > does the-right-thing from max packet-per-second point of view.
> > > I thought we've already converged on the model where af_xdp hw rx queues
> > > bind one-to-one to af_xdp sockets and user space pins processing
> > > threads one-to-one to af_xdp sockets on corresponding cpus...
> > > If so that's the model to optimize for on the kernel side
> > > while keeping all other user configurations functional.
> > >
> > > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > > invoked for the traffic that goes to this socket. This way, we would
> > > > get better performance on a per core basis and also get the same
> > > > behaviour independent of load.
> > >
> > > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > > with and without busy-poll exactly because of 'high load' case
> > > you've described.
> > > If we do this additional rx-kthread model why differentiate
> > > between busy-polling and polling?
> > >
> > > af_xdp rx queue is completely different form stack rx queue because
> > > of target dma address setup.
> > > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > > the fairness issues. Isn't it better to have separate kthreads for them
> > > and let scheduler deal with fairness among af_xdp processing and stack?
> >
> > When using ordinary poll() on an AF_XDP socket, the application will
> > run on one core and the driver processing will run on another in
> > softirq/ksoftirqd context. (Either due to explicit core and irq
> > pinning or due to the scheduler or irqbalance moving the two threads
> > apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> > application and the driver processing to occur on a single core, thus
> > there is no "extra" driver core involved that need to be taken into
> > account when sizing and/or provisioning the system. The napi context
> > is in this mode invoked from syscall context when executing the poll
> > syscall from the application.
> >
> > Executing the app and the driver on the same core could of course be
> > accomplished already today by pinning the application and the driver
> > interrupt to the same core, but that would not be that efficient due
> > to context switching between the two.
>
> Have you benchmarked it?
> I don't think context switch will be that noticable when kpti is off.
> napi processes 64 packets descriptors and switches back to user to
> do payload processing of these packets.
> I would think that the same job is on two different cores would be
> a bit more performant with user code consuming close to 100%
> and softirq is single digit %. Say it's 10%.
> I believe combining the two on single core is not 100 + 10 since
> there is no cache bouncing. So Mpps from two cores setup will
> reduce by 2-3% instead of 10%.
> There is a cost of going to sleep and being woken up from poll(),
> but 64 packets is probably large enough number to amortize.
> If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
> Busy-poll avoids sleep/wakeup overhead and probably can make
> this scheme work with lower batching (like 64), but fundamentally
> they're the same thing.
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.
>
> > A more efficient way would be to
> > call the napi loop from within the poll() syscall when you are inside
> > the kernel anyway. This is what the classical busy-poll mechanism
> > operating on AF_INET sockets does. Inside the poll() call, it executes
> > the napi context of the driver until it finds a packet (if it is rx)
> > and then returns to the application that then processes the packets. I
> > would like to adopt something quite similar for AF_XDP sockets. (Some
> > of the differences can be found at the top of the original post.)
> >
> > From an API point of view with busy-poll of AF_XDP sockets, the user
> > would bind to a queue number and taskset its application to a specific
> > core and both the app and the driver execution would only occur on
> > that core. This is in my mind simpler than with regular poll or AF_XDP
> > using no syscalls on rx (i.e. current state), in which you bind to a
> > queue, taskset your application to a core and then you also have to
> > take care to route the interrupt of the queue you bound to to another
> > core that will execute the driver part in the kernel. So the model is
> > in both cases still one core - one socket - one napi. (Users can of
> > course create multiple sockets in an app if they desire.)
> >
> > The main reasons I would like to introduce busy-poll for AF_XDP
> > sockets are:
> >
> > * It is simpler to provision, see arguments above. Both application
> >   and driver runs efficiently on the same core.
> >
> > * It is faster (on a per core basis) since we do not have any core to
> >   core communication. All header and descriptor transfers between
> >   kernel and application are core local which is much
> >   faster. Scalability will also be better. E.g., 64 bytes desc + 64
> >   bytes packet header = 128 bytes per packet less on the interconnect
> >   between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
> >   this will be ~400Gbit/s of interconnect traffic less with busy-poll.
>
> exactly. don't make cpu do this core-to-core stuff.
> pin one rx to one core.
>
> > * It provides a way to seamlessly replace user-space drivers in DPDK
> >   with Linux drivers in kernel space. (Do not think I have to argue
> >   why this is a good idea on this list ;-).) The DPDK model is that
> >   application and driver run on the same core since they are both in
> >   user space. If we can provide the same model (both running
> >   efficiently on the same core, NOT drivers in user-space) with
> >   AF_XDP, it is easy for DPDK users to make the switch. Compare this
> >   to the current way where there are both application cores and
> >   driver/ksoftirqd cores. If a systems builder had 12 cores in his
> >   appliance box and they had 12 instances of a DPDK app, one on each
> >   core, how would he/she reason when repartitioning between
> >   application and driver cores? 8 application cores and 4 driver
> >   cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
> >   simpler to migrate if we had an efficient way to run both of them on
> >   the same core.
> >
> > Why no interrupt? That should have been: no interrupts enabled to
> > start with. We would like to avoid interrupts as much as possible
> > since when they trigger, we will revert to the non busy-poll model,
> > i.e. processing on two separate cores, and the advantages from above
> > will disappear. How to accomplish this?
> >
> > * One way would be to create a napi context with the queue we have
> >   bound to but with no interrupt associated with it, or it being
> >   disabled. The socket would in that case only be able to receive and
> >   send packets when calling the poll() syscall. If you do not call
> >   poll(), you do not get any packets, nor are any packets sent. It
> >   would only be possible to support this with a poll() timeout value
> >   of zero. This would have the best performance
> >
> > * Maybe we could support timeout values >0 by re-enabling the interrupt
> >   at some point. When calling poll(), the core would invoke the napi
> >   context repeatedly with the interrupt of that napi disabled until it
> >   found a packet, but max for a period of time up until the busy poll
> >   timeout (like regular busy poll today does). If that times out, we
> >   go up to the regular timeout of the poll() call and enable
> >   interrupts of the queue associated with the napi and put the process
> >   to sleep. Once woken up by an interrupt, the interrupt of the napi
> >   would be disabled again and control returned to the application. We
> >   would with this scheme process the vast majority of packets locally
> >   on a core with interrupts disabled and with good performance and
> >   only when we have low load and are sleeping/waiting in poll would we
> >   process some packets using interrupts on the core that the
> >   interrupt has been bound to.
>
> I think both 'no interrupt' solutions are challenging for users.
> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user task.
> If busy-poll drained the queue then new kthread napi has no work to do.
> No irq approach could be marginally faster, but more error prone.
> With new kthread the user space will still work in all configuration.
> Even when single user task is processing many af_xdp sockets.
>
> I'm proposing new kthread only partially for performance reasons, but
> mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
> Currently we share softirqd for stack napis for all NICs in the system,
> but af_xdp depends on isolated processing.
> Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
> The same host will be receive traffic on both.
> Even if we rss stack queues to one set of cpus and af_xdp on another cpus
> softirqds are doing work on all cpus.
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.
> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
>   (its also ok to pick the same cpu X for different af_xdp socket
>   then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
>   since ethtool api is a mess I propose to use af_xdp api to do this rss config
>
> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.
> How crazy does it sound?

Actually, it sounds remarkably sane :-). It will create something
quite similar to what I have been wanting, but you take it at least
two steps further. Did not think about introducing a separate kthread
as a potential solution, and the user space configuration of RSS (and
maybe other flow steering mechanisms) from AF_XDP Björn and I have
only been loosely talking about. Anyway, I am producing performance
numbers for the options that we have talked about. I will get back to
you with them as soon as I have them and we can continue the
discussions based on those.

Thanks: Magnus