netdev - Re: [RFC bpf-next 0/7] busy poll support for AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d55253dd-42b4-9cb1-ddc9-4f74c06ec845@intel.com>
Date:   Mon, 13 May 2019 16:30:58 -0700
From:   "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To:     Jonathan Lemon <bsd@...com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Magnus Karlsson <magnus.karlsson@...el.com>,
        Björn Töpel <bjorn.topel@...el.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Network Development <netdev@...r.kernel.org>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>
Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets

On 5/13/2019 1:42 PM, Jonathan Lemon wrote:
> Tossing in my .02 cents:
> 
> 
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays.  The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.

AF_XDP with busypoll should allow a single thread to poll a given RX
queue and use a single core.

> 
> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
>> I'm not saying that we shouldn't do busy-poll. I'm saying it's
>> complimentary, but in all cases single core per af_xdp rq queue
>> with user thread pinning is preferred.
> 
> So I think we're on the same page here.
> 
>> Stack rx queues and af_xdp rx queues should look almost the same from
>> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
>> kthread
>> to work with both poll and busy-poll. The only difference between
>> poll and busy-poll will be the running context: new kthread vs user
>> task.
> ...
>> A burst of 64 packets on stack queues or some other work in softirqd
>> will spike the latency for af_xdp queues if softirq is shared.
> 
> True, but would it be shared?  This goes back to the current model,
> which
> as used by Intel is:
> 
>      (channel == RX, TX, softirq)
> 
> MLX, on the other hand, wants:
> 
>      (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
> 
> Which would indeed lead to sharing.  The more I look at the above, the
> stronger I start to dislike it.  Perhaps this should be disallowed?
> 
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
> 
>> Hence the proposal for new napi_kthreads:
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>> - spawns kthread pinned to cpu X
>> - configures irq for that af_xdp queue to fire on cpu X
>> - user space with the help of libbpf pins its processing thread to
>> that cpu X
>> - repeat above for as many af_xdp sockets as there as cpus
>>    (its also ok to pick the same cpu X for different af_xdp socket
>>    then new kthread is shared)
>> - user space configures hw to RSS to these set of af_xdp sockets.
>>    since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
> 
> 
>   From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out.  The model above essentially enforces a model
> of:
> 
>      (af_xdp = RX.af_xdp + bound_cpu)
>        (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
> 
> (temporarily ignoring TX for right now)
> 
> 
> I forsee two issues with the above approach:
>     1. hardware limitations in the number of queues/rings
>     2. RSS/steering rules
> 
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
> 
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack.  This may not be an
> issue
> for modern hardware.
> 
>> - user space configures hw to RSS to these set of af_xdp sockets.
>>    since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
> 
> Currently, RSS only steers default traffic.  On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
> 
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal.  I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.

'tc' provides another interface to split NIC queues into groups of
queues each with its own RSS. For ex:
tc qdisc add dev <i/f> root mqprio num_tc 3 map 0 1 2 queues 2@0 32@2 
8@34 hw 1 mode channel
will split NIC queues into 3 groups of 2, 32 and 8 queues.

By default all the packets goto only the first queue group with 2 
queues. Filters can be added to redirect packets to the other queues groups.

tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip 
192.168.0.2 ip_proto tcp dst_port 1234 skip_sw hw_tc 1
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip 
192.168.0.3 ip_proto tcp dst_port 1234 skip_sw hw_tc 2

Here hw_tc indicates the queue group.

It should be possible to run AF_XDP on queue group 3 by creating 8 
af-xdp sockets and binding them to queues 34-42.

Does this look like a reasonable model to use a subset of nic queues for 
af-xdp applications?


> 
> Perhaps an af_xdp library that does some bookkeeping:
>     - open af_xdp socket
>     - define af_xdp_set as (classification, steering rules, other?)
>     - bind socket to (cpu, af_xdp_set)
>     - kernel:
>       - pins calling thread to cpu
>       - creates kthread if one doesn't exist, binds to irq and cpu
>       - has driver create RQ.af_xdp, possibly replacing RQ.stack
>       - applies (af_xdp_set) to NIC.
> 
> Seems workable, but a little complicated?  The complexity could be moved
> into a separate library.
> 
> 
>> imo that would be the simplest and performant way of using af_xdp.
>> All configuration apis are under libbpf (or libxdp if we choose to
>> fork it)
>> End result is one af_xdp rx queue - one napi - one kthread - one user
>> thread.
>> All pinned to the same cpu with irq on that cpu.
>> Both poll and busy-poll approaches will not bounce data between cpus.
>> No 'shadow' queues to speak of and should solve the issues that
>> folks were bringing up in different threads.
> 
> Sounds like a sensible model from my POV.
>