netdev - Re: ADQ - comparison to aRFS, clarifications on NAPI ID, binding with busy-polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AM6PR05MB5974D512D3205C247B07D0C7D1930@AM6PR05MB5974.eurprd05.prod.outlook.com>
Date:   Fri, 26 Jun 2020 12:48:06 +0000
From:   Maxim Mikityanskiy <maximmi@...lanox.com>
To:     "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
CC:     Amritha Nambiar <amritha.nambiar@...el.com>,
        Kiran Patil <kiran.patil@...el.com>,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Tom Herbert <tom@...bertland.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: ADQ - comparison to aRFS, clarifications on NAPI ID, binding with
 busy-polling

Thanks a lot for your reply! It was really helpful. I have a few 
comments, please see below.

On 2020-06-24 23:21, Samudrala, Sridhar wrote:
> 
> 
> On 6/17/2020 6:15 AM, Maxim Mikityanskiy wrote:
>> Hi,
>>
>> I discovered Intel ADQ feature [1] that allows to boost performance by 
>> picking dedicated queues for application traffic. We did some 
>> research, and I got some level of understanding how it works, but I 
>> have some questions, and I hope you could answer them.
>>
>> 1. SO_INCOMING_NAPI_ID usage. In my understanding, every connection 
>> has a key (sk_napi_id) that is unique to the NAPI where this 
>> connection is handled, and the application uses that key to choose a 
>> handler thread from the thread pool. If we have a one-to-one 
>> relationship between application threads and NAPI IDs of connections, 
>> each application thread will handle only traffic from a single NAPI. 
>> Is my understanding correct?
> 
> Yes. It is correct and recommended with the current implementation.
> 
>>
>> 1.1. I wonder how the application thread gets scheduled on the same 
>> core that NAPI runs at. It currently only works with busy_poll, so 
>> when the application initiates busy polling (calls epoll), does the 
>> Linux scheduler move the thread to the right CPU? Do we have to have a 
>> strict one-to-one relationship between threads and NAPIs, or can one 
>> thread handle multiple NAPIs? When the data arrives, does the 
>> scheduler run the application thread on the same CPU that NAPI ran on?
> 
> The app thread can do busypoll from any core and there is no requirement
> that the scheduler needs to move the thread to a specific CPU.
> 
> If the NAPI processing happens via interrupts, the scheduler could move
> the app thread to the same CPU that NAPI ran on.
> 
>>
>> 1.2. I see that SO_INCOMING_NAPI_ID is tightly coupled with busy_poll. 
>> It is enabled only if CONFIG_NET_RX_BUSY_POLL is set. Is there a real 
>> reason why it can't be used without busy_poll? In other words, if we 
>> modify the kernel to drop this requirement, will the kernel still 
>> schedule the application thread on the same CPU as NAPI when busy_poll 
>> is not used?
> 
> It should be OK to remove this restriction, but requires enabling this 
> in skb_mark_napi_id() and sk_mark_napi_id() too.
> 
>>
>> 2. Can you compare ADQ to aRFS+XPS? aRFS provides a way to steer 
>> traffic to the application's CPU in an automatic fashion, and xps_rxqs 
>> can be used to transmit from the corresponding queues. This setup 
>> doesn't need manual configuration of TCs and is not limited to 4 
>> applications. The difference of ADQ is that (in my understanding) it 
>> moves the application to the RX CPU, while aRFS steers the traffic to 
>> the RX queue handled my the application's CPU. Is there any advantage 
>> of ADQ over aRFS, that I failed to find?
> 
> aRFS+XPS ties app thread to a cpu,

Well, not exactly. To pin the app thread to a CPU, one uses 
taskset/sched_setaffinity, while aRFS+XPS pick a queue that corresponds 
to that CPU.

> whereas ADQ ties app thread to a napi 
> id which in turn ties to a queue(s)

So, basically, both technologies result in making NAPI and the app run 
on the same CPU. The difference that I see is that ADQ forces NAPI 
processing (in busy polling) on the app's CPU, while aRFS steers the 
traffic to a queue, whose NAPI runs on the app's CPU. The effect is the 
same, but ADQ requires busy polling. Is my understanding correct?

> ADQ also provides 2 levels of filtering compared to aRFS+XPS. The first
> level of filtering selects a queue-set associated with the application
> and the second level filter or RSS will select a queue within that queue
> set associated with an app thread.

This difference looks important. So, ADQ reserves a dedicated set of 
queues solely for the application use.

> The current interface to configure ADQ limits us to support upto 16
> application specific queue sets(TC_MAX_QUEUE)

 From the commit message:

https://patchwork.ozlabs.org/project/netdev/patch/20180214174539.11392-5-jeffrey.t.kirsher@intel.com/

I got that i40e supports up to 4 groups. Has this limitation been 
lifted, or are you saying that 16 is the limitation of mqprio, while the 
driver may support fewer? Or is it different for different Intel drivers?

> 
> 
>>
>> 3. At [1], you mention that ADQ can be used to create separate RSS 
>> sets.   Could you elaborate about the API used? Does the tc mqprio 
>> configuration also affect RSS? Can it be turned on/off?
> 
> Yes. tc mqprio allows to create queue-sets per application and the
> driver configures RSS per queue-set.
> 
>>
>> 4. How is tc flower used in context of ADQ? Does the user need to 
>> reflect the configuration in both mqprio qdisc (for TX) and tc flower 
>> (for RX)? It looks like tc flower maps incoming traffic to TCs, but 
>> what is the mechanism of mapping TCs to RX queues?
> 
> tc mqprio is used to map TCs to RX queues

OK, I got how the configuration works now, thanks! Though I'm not sure 
mqprio is the best API to configure the RX side. I thought it's supposed 
to configure the TX queues. Looks more like a hack to me.

Ethtool RSS context API (look for "context" in man ethtool) seems more 
appropriate for the RX side for this purpose.

Thanks,
Max

> tc flower is used to configure the first level of filter to redirect
> packets to a queue set associated with an application.
> 
>>
>> I really hope you will be able to shed more light on this feature to 
>> increase my awareness on how to use it and to compare it with aRFS.
> 
> Hope this helps and we will go over in more detail in our netdev session.
> 
>>
>> Thanks,
>> Max
>>
>> [1]: 
>> https://netdevconf.info/0x14/session.html?talk-ADQ-for-system-level-network-io-performance-improvements 
>>