lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b35fe4bf-25d7-41cd-90c9-f68e1819cded@uwaterloo.ca>
Date: Tue, 25 Mar 2025 13:47:38 -0400
From: Martin Karsten <mkarsten@...terloo.ca>
To: Samiullah Khawaja <skhawaja@...gle.com>
Cc: Jakub Kicinski <kuba@...nel.org>, "David S . Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, almasrymina@...gle.com, willemb@...gle.com,
 jdamato@...tly.com, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v4 0/4] Add support to do threaded napi busy poll

On 2025-03-25 12:40, Samiullah Khawaja wrote:
> On Sun, Mar 23, 2025 at 7:38 PM Martin Karsten <mkarsten@...terloo.ca> wrote:
>>
>> On 2025-03-20 22:15, Samiullah Khawaja wrote:
>>> Extend the already existing support of threaded napi poll to do continuous
>>> busy polling.
>>>
>>> This is used for doing continuous polling of napi to fetch descriptors
>>> from backing RX/TX queues for low latency applications. Allow enabling
>>> of threaded busypoll using netlink so this can be enabled on a set of
>>> dedicated napis for low latency applications.
>>>
>>> Once enabled user can fetch the PID of the kthread doing NAPI polling
>>> and set affinity, priority and scheduler for it depending on the
>>> low-latency requirements.
>>>
>>> Currently threaded napi is only enabled at device level using sysfs. Add
>>> support to enable/disable threaded mode for a napi individually. This
>>> can be done using the netlink interface. Extend `napi-set` op in netlink
>>> spec that allows setting the `threaded` attribute of a napi.
>>>
>>> Extend the threaded attribute in napi struct to add an option to enable
>>> continuous busy polling. Extend the netlink and sysfs interface to allow
>>> enabling/disabling threaded busypolling at device or individual napi
>>> level.
>>>
>>> We use this for our AF_XDP based hard low-latency usecase with usecs
>>> level latency requirement. For our usecase we want low jitter and stable
>>> latency at P99.
>>>
>>> Following is an analysis and comparison of available (and compatible)
>>> busy poll interfaces for a low latency usecase with stable P99. Please
>>> note that the throughput and cpu efficiency is a non-goal.
>>>
>>> For analysis we use an AF_XDP based benchmarking tool `xdp_rr`. The
>>> description of the tool and how it tries to simulate the real workload
>>> is following,
>>>
>>> - It sends UDP packets between 2 machines.
>>> - The client machine sends packets at a fixed frequency. To maintain the
>>>     frequency of the packet being sent, we use open-loop sampling. That is
>>>     the packets are sent in a separate thread.
>>> - The server replies to the packet inline by reading the pkt from the
>>>     recv ring and replies using the tx ring.
>>> - To simulate the application processing time, we use a configurable
>>>     delay in usecs on the client side after a reply is received from the
>>>     server.
>>>
>>> The xdp_rr tool is posted separately as an RFC for tools/testing/selftest.
>>
>> Thanks very much for sending the benchmark program and these specific
>> experiments. I am able to build the tool and run the experiments in
>> principle. While I don't have a complete picture yet, one observation
>> seems already clear, so I want to report back on it.
> Thanks for reproducing this Martin. Really appreciate you reviewing
> this and your interest in this.
>>
>>> We use this tool with following napi polling configurations,
>>>
>>> - Interrupts only
>>> - SO_BUSYPOLL (inline in the same thread where the client receives the
>>>     packet).
>>> - SO_BUSYPOLL (separate thread and separate core)
>>> - Threaded NAPI busypoll
>>
>> The configurations that you describe as SO_BUSYPOLL here are not using
>> the best busy-polling configuration. The best busy-polling strictly
>> alternates between application processing and network polling. No
>> asynchronous processing due to hardware irq delivery or softirq
>> processing should happen.
>>
>> A high-level check is making sure that no softirq processing is reported
>> for the relevant cores (see, e.g., "%soft" in sar -P <cores> -u ALL 1).
>> In addition, interrupts can be counted in /proc/stat or /proc/interrupts.
>>
>> Unfortunately it is not always straightforward to enter this pattern. In
>> this particular case, it seems that two pieces are missing:
>>
>> 1) Because the XPD socket is created with XDP_COPY, it is never marked
>> with its corresponding napi_id. Without the socket being marked with a
>> valid napi_id, sk_busy_loop (called from __xsk_recvmsg) never invokes
>> napi_busy_loop. Instead the gro_flush_timeout/napi_defer_hard_irqs
>> softirq loop controls packet delivery.
> Nice catch. It seems a recent change broke the busy polling for AF_XDP
> and there was a fix for the XDP_ZEROCOPY but the XDP_COPY remained
> broken and seems in my experiments I didn't pick that up. During my
> experimentation I confirmed that all experiment modes are invoking the
> busypoll and not going through softirqs. I confirmed this through perf
> traces. I sent out a fix for XDP_COPY busy polling here in the link
> below. I will resent this for the net since the original commit has
> already landed in 6.13.
> https://lore.kernel.org/netdev/CAAywjhSEjaSgt7fCoiqJiMufGOi=oxa164_vTfk+3P43H60qwQ@mail.gmail.com/T/#t
>>
>> I found code at the end of xsk_bind in xsk.c that is conditional on xs->zc:
>>
>>          if (xs->zc && qid < dev->real_num_rx_queues) {
>>                  struct netdev_rx_queue *rxq;
>>
>>                  rxq = __netif_get_rx_queue(dev, qid);
>>                  if (rxq->napi)
>>                          __sk_mark_napi_id_once(sk, rxq->napi->napi_id);
>>          }
>>
>> I am not an expert on XDP sockets, so I don't know why that is or what
>> would be an acceptable workaround/fix, but when I simply remove the
>> check for xs->zc, the socket is being marked and napi_busy_loop is being
>> called. But maybe there's a better way to accomplish this.
> +1
>>
>> 2) SO_PREFER_BUSY_POLL needs to be set on the XDP socket to make sure
>> that busy polling stays in control after napi_busy_loop, regardless of
>> how many packets were found. Without this setting, the gro_flush_timeout
>> timer is not extended in busy_poll_stop.
>>
>> With these two changes, both SO_BUSYPOLL alternatives perform noticeably
>> better in my experiments and come closer to Threaded NAPI busypoll, so I
>> was wondering if you could try that in your environment? While this
>> might not change the big picture, I think it's important to fully
>> understand and document the trade-offs.
> I agree. In my experiments the SO_BUSYPOLL works properly, please see
> the commit I mentioned above. But I will experiment with
> SO_PREFER_BUSY_POLL to see whether it makes any significant change.

I'd like to clarify: Your original experiments cannot have used 
busypoll, because it was broken for XDP_COPY. Did you rerun the 
experiments with the XDP_COPY fix but without SO_PREFER_BUSY_POLL and 
see the same latency numbers as before? Also, can you provide more 
details about the perf tracing that you used to see that busypoll is 
invoked, but softirq is not?

Thanks,
Martin


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ