netdev - Re: ADQ - comparison to aRFS, clarifications on NAPI ID, binding with busy-polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6ca520df-2668-992a-2717-c0501f1d2a89@intel.com>
Date:   Tue, 30 Jun 2020 17:23:27 -0700
From:   "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To:     Maxim Mikityanskiy <maximmi@...lanox.com>
Cc:     Amritha Nambiar <amritha.nambiar@...el.com>,
        Kiran Patil <kiran.patil@...el.com>,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Tom Herbert <tom@...bertland.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: ADQ - comparison to aRFS, clarifications on NAPI ID, binding with
 busy-polling



On 6/26/2020 5:48 AM, Maxim Mikityanskiy wrote:
> Thanks a lot for your reply! It was really helpful. I have a few
> comments, please see below.
> 
> On 2020-06-24 23:21, Samudrala, Sridhar wrote:
>>
>>
>> On 6/17/2020 6:15 AM, Maxim Mikityanskiy wrote:
>>> Hi,
>>>
>>> I discovered Intel ADQ feature [1] that allows to boost performance by
>>> picking dedicated queues for application traffic. We did some
>>> research, and I got some level of understanding how it works, but I
>>> have some questions, and I hope you could answer them.
>>>
>>> 1. SO_INCOMING_NAPI_ID usage. In my understanding, every connection
>>> has a key (sk_napi_id) that is unique to the NAPI where this
>>> connection is handled, and the application uses that key to choose a
>>> handler thread from the thread pool. If we have a one-to-one
>>> relationship between application threads and NAPI IDs of connections,
>>> each application thread will handle only traffic from a single NAPI.
>>> Is my understanding correct?
>>
>> Yes. It is correct and recommended with the current implementation.
>>
>>>
>>> 1.1. I wonder how the application thread gets scheduled on the same
>>> core that NAPI runs at. It currently only works with busy_poll, so
>>> when the application initiates busy polling (calls epoll), does the
>>> Linux scheduler move the thread to the right CPU? Do we have to have a
>>> strict one-to-one relationship between threads and NAPIs, or can one
>>> thread handle multiple NAPIs? When the data arrives, does the
>>> scheduler run the application thread on the same CPU that NAPI ran on?
>>
>> The app thread can do busypoll from any core and there is no requirement
>> that the scheduler needs to move the thread to a specific CPU.
>>
>> If the NAPI processing happens via interrupts, the scheduler could move
>> the app thread to the same CPU that NAPI ran on.
>>
>>>
>>> 1.2. I see that SO_INCOMING_NAPI_ID is tightly coupled with busy_poll.
>>> It is enabled only if CONFIG_NET_RX_BUSY_POLL is set. Is there a real
>>> reason why it can't be used without busy_poll? In other words, if we
>>> modify the kernel to drop this requirement, will the kernel still
>>> schedule the application thread on the same CPU as NAPI when busy_poll
>>> is not used?
>>
>> It should be OK to remove this restriction, but requires enabling this
>> in skb_mark_napi_id() and sk_mark_napi_id() too.
>>
>>>
>>> 2. Can you compare ADQ to aRFS+XPS? aRFS provides a way to steer
>>> traffic to the application's CPU in an automatic fashion, and xps_rxqs
>>> can be used to transmit from the corresponding queues. This setup
>>> doesn't need manual configuration of TCs and is not limited to 4
>>> applications. The difference of ADQ is that (in my understanding) it
>>> moves the application to the RX CPU, while aRFS steers the traffic to
>>> the RX queue handled my the application's CPU. Is there any advantage
>>> of ADQ over aRFS, that I failed to find?
>>
>> aRFS+XPS ties app thread to a cpu,
> 
> Well, not exactly. To pin the app thread to a CPU, one uses
> taskset/sched_setaffinity, while aRFS+XPS pick a queue that corresponds
> to that CPU.

Yes. I should have said XPS-cpus ties app thread to a cpu and aRFS maps 
that cpu to rx queue.

> 
>> whereas ADQ ties app thread to a napi
>> id which in turn ties to a queue(s)
> 
> So, basically, both technologies result in making NAPI and the app run
> on the same CPU. The difference that I see is that ADQ forces NAPI
> processing (in busy polling) on the app's CPU, while aRFS steers the
> traffic to a queue, whose NAPI runs on the app's CPU. The effect is the
> same, but ADQ requires busy polling. Is my understanding correct?

'busy polling' is not a requirement. It is possible to use ADQ receive 
filters with XPS based on rx queues to make NAPI and the app run on the 
same CPU without busy polling.

> 
>> ADQ also provides 2 levels of filtering compared to aRFS+XPS. The first
>> level of filtering selects a queue-set associated with the application
>> and the second level filter or RSS will select a queue within that queue
>> set associated with an app thread.
> 
> This difference looks important. So, ADQ reserves a dedicated set of
> queues solely for the application use.
> 
>> The current interface to configure ADQ limits us to support upto 16
>> application specific queue sets(TC_MAX_QUEUE)
> 
>   From the commit message:
> 
> https://patchwork.ozlabs.org/project/netdev/patch/20180214174539.11392-5-jeffrey.t.kirsher@intel.com/
> 
> I got that i40e supports up to 4 groups. Has this limitation been
> lifted, or are you saying that 16 is the limitation of mqprio, while the
> driver may support fewer? Or is it different for different Intel drivers?

Yes. That patch is enabling support for ADQ in a VF and it is currently 
limited to 4 queue groups. But 16 is tc mqprio interface limitation.

> 
>>
>>
>>>
>>> 3. At [1], you mention that ADQ can be used to create separate RSS
>>> sets.   Could you elaborate about the API used? Does the tc mqprio
>>> configuration also affect RSS? Can it be turned on/off?
>>
>> Yes. tc mqprio allows to create queue-sets per application and the
>> driver configures RSS per queue-set.
>>
>>>
>>> 4. How is tc flower used in context of ADQ? Does the user need to
>>> reflect the configuration in both mqprio qdisc (for TX) and tc flower
>>> (for RX)? It looks like tc flower maps incoming traffic to TCs, but
>>> what is the mechanism of mapping TCs to RX queues?
>>
>> tc mqprio is used to map TCs to RX queues
> 
> OK, I got how the configuration works now, thanks! Though I'm not sure
> mqprio is the best API to configure the RX side. I thought it's supposed
> to configure the TX queues. Looks more like a hack to me.
> 
> Ethtool RSS context API (look for "context" in man ethtool) seems more
> appropriate for the RX side for this purpose.

Thanks, we will explore if ethtool will work for us. We went with mqprio 
so that we can configure both TX/RX queue-sets together rather than 
splitting the configuration into 2 steps.

> 
> Thanks,
> Max
> 
>> tc flower is used to configure the first level of filter to redirect
>> packets to a queue set associated with an application.
>>
>>>
>>> I really hope you will be able to shed more light on this feature to
>>> increase my awareness on how to use it and to compare it with aRFS.
>>
>> Hope this helps and we will go over in more detail in our netdev session.
>>
>>>
>>> Thanks,
>>> Max
>>>
>>> [1]:
>>> https://netdevconf.info/0x14/session.html?talk-ADQ-for-system-level-network-io-performance-improvements
>>>
>