[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6ce758d1-e646-c7c2-bc02-6911c9b7d6ce@intel.com>
Date: Thu, 16 May 2019 16:50:11 -0700
From: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To: Magnus Karlsson <magnus.karlsson@...il.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Magnus Karlsson <magnus.karlsson@...el.com>,
Björn Töpel <bjorn.topel@...el.com>,
Daniel Borkmann <daniel@...earbox.net>,
Network Development <netdev@...r.kernel.org>,
bpf@...r.kernel.org, Jakub Kicinski <jakub.kicinski@...ronome.com>,
Jonathan Lemon <bsd@...com>,
Maciej Fijalkowski <maciejromanfijalkowski@...il.com>
Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
On 5/16/2019 5:37 AM, Magnus Karlsson wrote:
>
> After a number of surprises and issues in the driver here are now the
> first set of results. 64 byte packets at 40Gbit/s line rate. All
> results in Mpps. Note that I just used my local system and kernel build
> for these numbers so they are not performance tuned. Jesper would
> likely get better results on his setup :-). Explanation follows after
> the table.
>
> Applications
> method cores irqs txpush rxdrop l2fwd
> ---------------------------------------------------------------
> r-t-c 2 y 35.9 11.2 8.6
> poll 2 y 34.2 9.4 8.3
> r-t-c 1 y 18.1 N/A 6.2
> poll 1 y 14.6 8.4 5.9
> busypoll 2 y 31.9 10.5 7.9
> busypoll 1 y 21.5 8.7 6.2
> busypoll 1 n 22.0 10.3 7.3
>
> r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
> and only spin on the pointers in the ring.
> poll = Use the regular syscall poll()
> busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
> sent out.
>
> cores == 2 means that softirq/ksoftirqd is one a different core from
> the application. 2 cores are consumed in total.
> cores == 1 means that both softirq/ksoftirqd and the application runs
> on the same core. Only 1 core is used in total.
>
> irqs == 'y' is the normal case. irqs == 'n' means that I have created a
> new napi context with the AF_XDP queues inside that does not
> have any interrupts associated with it. No other traffic goes
> to this napi context.
>
> N/A = This combination does not make sense since the application will
> not yield due to run-to-completion without any syscalls
> whatsoever. It works, but it crawls in the 30 Kpps
> range. Creating huge rings would help, but did not do that.
>
> The applications are the ones from the xdpsock sample application in
> samples/bpf/.
>
> Some things I had to do to get these results:
>
> * The current buffer allocation scheme in i40e where we continuously
> try to access the fill queue until we find some entries, is not
> effective if we are on a single core. Instead, we try once and call
> a function that sets a flag. This flag is then checked in the xsk
> poll code, and if it is set we schedule napi so that it can try to
> allocate some buffers from the fill ring again. Note that this flag
> has to propagate all the way to user space so that the application
> knows that it has to call poll(). I currently set a flag in the Rx
> ring to indicate that the application should call poll() to resume
> the driver. This is similar to what the io_uring in the storage
> subsystem does. It is not enough to return POLLERR from poll() as
> that will only work for the case when we are using poll(). But I do
> that as well.
>
> * Implemented Sridhar's suggestion on adding busy_loop_end callbacks
> that terminate the busy poll loop if the Rx queue is empty or the Tx
> queue is full.
>
> * There is a race in the setup code in i40e when it is used with
> busy-poll. The fact that busy-poll calls the napi_busy_loop code
> before interrupts have been registered and enabled seems to trigger
> some bug where nothing gets transmitted. This only happens for
> busy-poll. Poll and run-to-completion only enters the napi loop of
> i40e by interrupts and only then after interrupts have been enabled,
> which is the last thing that is done after setup. I have just worked
> around it by introducing a sleep(1) in the application for these
> experiments. Ugly, but should not impact the numbers, I believe.
>
> * The 1 core case is sensitive to the amount of work done reported
> from the driver. This was not correct in the XDP code of i40e and
> let to bad performance. Now it reports the correct values for
> Rx. Note that i40e does not honor the napi budget on Tx and sets
> that to 256, and these are not reported back to the napi
> library.
>
> Some observations:
>
> * Cannot really explain the drop in performance for txpush when going
> from 2 cores to 1. As stated before, the reporting of Tx work is not
> really propagated to the napi infrastructure. Tried reporting this
> in a correct manner (completely ignoring Rx for this experiment) but
> the results were the same. Will dig deeper into this to screen out
> any stupid mistakes.
>
> * With the fixes above, all my driver processing is in softirq for 1
> core. It never goes over to ksoftirqd. Previously when work was
> reported incorrectly, this was the case. I would have liked
> ksoftirqd to take over as that would have been more like a separate
> thread. How to accomplish this? There might still be some reporting
> problem in the driver that hinders this, but actually think it is
> more correct now.
>
> * Looking at the current results for a single core, busy poll provides
> a 40% boost for Tx but only 5% for Rx. But if I instead create a
> napi context without any interrupt associated with it and drive that
> from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
> increases only marginally from the 40% improvement as there are few
> interrupts on Tx due to the completion interrupt bit being set quite
> infrequently. One question I have is: what am I breaking by creating
> a napi context not used by anyone else, only AF_XDP, that does not
> have an interrupt associated with it?
>
> Todo:
>
> * Explain the drop in Tx push when going from 2 cores to 1.
>
> * Really run a separate thread for kernel processing instead of softirq.
>
> * What other experiments would you like to see?
Thanks for sharing the results.
For busypoll tests, i guess you may have increased the busypoll budget
to 64.
What is the busypoll timeout you are using?
Can you try a test that skips calling bpf program for queues that are
associated with af-xdp socket? I remember seeing a significant bump in
rxdrop performance with this change.
The other overhead i saw was with the dma_sync_single calls in the driver.
Thanks
Sridhar
Powered by blists - more mailing lists