[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D40B5C89-53F8-4EC1-AB35-FB7C395864DE@fb.com>
Date: Mon, 13 May 2019 20:42:21 +0000
From: Jonathan Lemon <bsd@...com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
CC: Magnus Karlsson <magnus.karlsson@...el.com>,
Björn Töpel <bjorn.topel@...el.com>,
Daniel Borkmann <daniel@...earbox.net>,
Network Development <netdev@...r.kernel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
Jakub Kicinski <jakub.kicinski@...ronome.com>
Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
Tossing in my .02 cents:
I anticipate that most users of AF_XDP will want packet processing
for a given RX queue occurring on a single core - otherwise we end
up with cache delays. The usual model is one thread, one socket,
one core, but this isn't enforced anywhere in the AF_XDP code and is
up to the user to set this up.
On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.
So I think we're on the same page here.
> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
> kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user
> task.
...
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.
True, but would it be shared? This goes back to the current model,
which
as used by Intel is:
(channel == RX, TX, softirq)
MLX, on the other hand, wants:
(channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
Which would indeed lead to sharing. The more I look at the above, the
stronger I start to dislike it. Perhaps this should be disallowed?
I believe there was some mention at LSF/MM that the 'channel' concept
was something specific to HW and really shouldn't be part of the SW API.
> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to
> that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
> (its also ok to pick the same cpu X for different af_xdp socket
> then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
> since ethtool api is a mess I propose to use af_xdp api to do this
> rss config
From a high level point of view, this sounds quite sensible, but does
need
some details ironed out. The model above essentially enforces a model
of:
(af_xdp = RX.af_xdp + bound_cpu)
(bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
(temporarily ignoring TX for right now)
I forsee two issues with the above approach:
1. hardware limitations in the number of queues/rings
2. RSS/steering rules
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> exposed)
Here, the driver may not be able to create an arbitrary RQ, but may need
to
tear down/reuse an existing one used by the stack. This may not be an
issue
for modern hardware.
> - user space configures hw to RSS to these set of af_xdp sockets.
> since ethtool api is a mess I propose to use af_xdp api to do this
> rss config
Currently, RSS only steers default traffic. On a system with shared
stack/af_xdp queues, there should be a way to split the traffic types,
unless we're talking about a model where all traffic goes to AF_XDP.
This classification has to be done by the NIC, since it comes before RSS
steering - which currently means sending flow match rules to the NIC,
which
is less than ideal. I agree that the ethtool interface is non optimal,
but
it does make things clear to the user what's going on.
Perhaps an af_xdp library that does some bookkeeping:
- open af_xdp socket
- define af_xdp_set as (classification, steering rules, other?)
- bind socket to (cpu, af_xdp_set)
- kernel:
- pins calling thread to cpu
- creates kthread if one doesn't exist, binds to irq and cpu
- has driver create RQ.af_xdp, possibly replacing RQ.stack
- applies (af_xdp_set) to NIC.
Seems workable, but a little complicated? The complexity could be moved
into a separate library.
> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to
> fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user
> thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.
Sounds like a sensible model from my POV.
--
Jonathan
Powered by blists - more mailing lists