lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 2 Feb 2024 11:23:28 -0600
From: "Samudrala, Sridhar" <sridhar.samudrala@...el.com>
To: Joe Damato <jdamato@...tly.com>
CC: Eric Dumazet <edumazet@...gle.com>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <chuck.lever@...cle.com>,
	<jlayton@...nel.org>, <linux-api@...r.kernel.org>, <brauner@...nel.org>,
	<davem@...emloft.net>, <alexander.duyck@...il.com>, <kuba@...nel.org>, "Wei
 Wang" <weiwan@...gle.com>, Amritha Nambiar <amritha.nambiar@...el.com>
Subject: Re: [net-next 0/3] Per epoll context busy poll support



On 2/1/2024 9:28 PM, Joe Damato wrote:
> On Tue, Jan 30, 2024 at 12:54:50PM -0600, Samudrala, Sridhar wrote:
>>
>>
>> On 1/24/2024 2:20 AM, Eric Dumazet wrote:
>>> On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@...tly.com> wrote:
>>>>
>>>> Greetings:
>>>>
>>>> TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
>>>> epoll with socket fds.") by allowing user applications to enable
>>>> epoll-based busy polling and set a busy poll packet budget on a per epoll
>>>> context basis.
>>>>
>>>> To allow for this, two ioctls have been added for epoll contexts for
>>>> getting and setting a new struct, struct epoll_params.
>>>>
>>>> This makes epoll-based busy polling much more usable for user
>>>> applications than the current system-wide sysctl and hardcoded budget.
>>
>> Agree. looking forward to see this patch series accepted soon.
>>
>>>>
>>>> Longer explanation:
>>>>
>>>> Presently epoll has support for a very useful form of busy poll based on
>>>> the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
>>>>
>>>> This form of busy poll allows epoll_wait to drive NAPI packet processing
>>>> which allows for a few interesting user application designs which can
>>>> reduce latency and also potentially improve L2/L3 cache hit rates by
>>>> deferring NAPI until userland has finished its work.
>>>>
>>>> The documentation available on this is, IMHO, a bit confusing so please
>>>> allow me to explain how one might use this:
>>>>
>>>> 1. Ensure each application thread has its own epoll instance mapping
>>>> 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
>>>> direct connections with specific dest ports to these queues.
>>>>
>>>> 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
>>>> polling will occur. This can help avoid the userland app from being
>>>> pre-empted by a hard IRQ while userland is running. Note this means that
>>>> userland must take care to call epoll_wait and not take too long in
>>>> userland since it now drives NAPI via epoll_wait.
>>>>
>>>> 3. Ensure that all incoming connections added to an epoll instance
>>>> have the same NAPI ID. This can be done with a BPF filter when
>>>> SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
>>>> accept thread is used which dispatches incoming connections to threads.
>>>>
>>>> 4. Lastly, busy poll must be enabled via a sysctl
>>>> (/proc/sys/net/core/busy_poll).
>>>>
>>>> The unfortunate part about step 4 above is that this enables busy poll
>>>> system-wide which affects all user applications on the system,
>>>> including epoll-based network applications which were not intended to
>>>> be used this way or applications where increased CPU usage for lower
>>>> latency network processing is unnecessary or not desirable.
>>>>
>>>> If the user wants to run one low latency epoll-based server application
>>>> with epoll-based busy poll, but would like to run the rest of the
>>>> applications on the system (which may also use epoll) without busy poll,
>>>> this system-wide sysctl presents a significant problem.
>>>>
>>>> This change preserves the system-wide sysctl, but adds a mechanism (via
>>>> ioctl) to enable or disable busy poll for epoll contexts as needed by
>>>> individual applications, making epoll-based busy poll more usable.
>>>>
>>>
>>> I think this description missed the napi_defer_hard_irqs and
>>> gro_flush_timeout settings ?
>>>
>>> I would think that if an application really wants to make sure its
>>> thread is the only one
>>> eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
>>>
>>> Current implementations of busy poll always release NAPI_STATE_SCHED bit when
>>> returning to user space.
>>>
>>> It seems you want to make sure the application and only the
>>> application calls the napi->poll()
>>> at chosen times.
>>>
>>> Some kind of contract is needed, and the presence of the hrtimer
>>> (currently only driven from dev->@gro_flush_timeout)
>>> would allow to do that correctly.
>>>
>>> Whenever we 'trust' user space to perform the napi->poll shortly, we
>>> also want to arm the hrtimer to eventually detect
>>> the application took too long, to restart the other mechanisms (NIC irq based)
>>>
>>> Note that we added the kthread based napi polling, and we are working
>>> to add a busy polling feature to these kthreads.
>>> allowing to completely mask NIC interrupts and further reduce latencies.
>>
>>
>> Good to know that you are looking into enabling busy polling for napi
>> kthreads.
>> We have something similar in our ice OOT driver that is implemented and we
>> call it 'independent pollers' as in this mode, busy polling will not be app
>> dependent or triggered by an application.
>> Here is a link to the slides we presented at netdev 0x16 driver workshop.
>> https://netdevconf.info/0x16/slides/48/netdev0x16_driver_workshop_ADQ.pdf
>>
>> We haven't yet submitted the patches upstream as there is no kernel
>> interface to configure napi specific timeouts.
>> With the recent per-queue and per-napi netlink APIs that are accepted
>> upstream
>> https://lore.kernel.org/netdev/170147307026.5260.9300080745237900261.stgit@anambiarhost.jf.intel.com/
>>
>> we are thinking of making timeout as a per-napi parameter and can be used as
>> an interface to enable napi kthread based busy polling.
> 
> I know I am replying to a stale thread on the patches I've submit (there is
> a v5 now [1]), but I just looked at your message - sorry I didn't reply
> sooner.
> 
> The per-queue and per-napi netlink APIs look extremely useful, thanks for
> pointing this out.
> 
> In my development tree, I had added SIOCGIFNAME_BY_NAPI_ID which works
> similar to SIOCGIFNAME: it takes a NAPI ID and returns the IF name. This is
> useful on machines with multiple NICs where each NIC could be located in
> one of many different NUMA zones.
> 
> The idea was that apps would use SO_INCOMING_NAPI_ID, distribute the NAPI
> ID to a worker thread which could then use SIOCGIFNAME_BY_NAPI_ID to
> compute which NIC the connection came in on. The app would then (via
> configuration) know where to pin that worker thread; ideally somewhere NUMA
> local to the NIC.
> 
> I had assumed that such a change would be rejected, but I figured I'd send
> an RFC for it after the per epoll context stuff was done and see if anyone
> thought SIOCGIFNAME_BY_NAPI_ID would be useful for them, as well.

I think you should be able to get this functionality via the netdev-genl 
API to get napi parameters. It returns ifindex as one of the parameters 
and you should able to get the name from ifindex.

$ ./cli.py --spec netdev.yaml --do napi-get --json='{"id": 593}'
{'id': 593, 'ifindex': 12, 'irq': 291, 'pid': 3727}

> 
>> I think even the per-device napi_defer_hard_irqs and gro_flush_timeout
>> should become per-napi parameters.
> 
> I agree.
> 
> I had been contemplating implementing this until I tried a different method
> similar to an academic paper I was reading [2][3]. I think per-device
> defer_hard_irqs and gro_flush_timeout would be extremely useful and a
> better approach than the one I'm currently using.
> 
> Is this something you are currently working? I may try implementing this,
> but didn't want to duplicate effort if you are already working on this.

Not yet. Please go ahead and work on it it if you have time.
napi-get and napi-set can be extended to show and set these parameters.

> 
> Thanks,
> Joe
> 
> [1]: https://lore.kernel.org/all/20240131180811.23566-1-jdamato@fastly.com/
> [2]: https://dl.acm.org/doi/pdf/10.1145/3626780
> [3]: https://gitlab.uwaterloo.ca/p5cai/netstack-exp/-/blob/master/kernel-polling-5.15.79-base.patch?ref_type=heads

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ