lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66f7bc8a-e4a9-4912-9ea7-c88dbb6fb999@gmail.com>
Date: Wed, 31 Jul 2024 02:00:11 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: Olivier Langlois <olivier@...llion01.com>, io-uring@...r.kernel.org
Cc: netdev@...r.kernel.org
Subject: Re: io_uring NAPI busy poll RCU is causing 50 context switches/second
 to my sqpoll thread

On 7/31/24 01:33, Pavel Begunkov wrote:
> On 7/31/24 00:14, Olivier Langlois wrote:
>> On Tue, 2024-07-30 at 21:25 +0100, Pavel Begunkov wrote:
>>>
>>> Removing an entry or two once every minute is definitely not
>>> going to take 50% CPU, RCU machinery is running in background
>>> regardless of whether io_uring uses it or not, and it's pretty
>>> cheap considering ammortisation.
>>>
>>> If anything it more sounds from your explanation like the
>>> scheduler makes a wrong decision and schedules out the sqpoll
>>> thread even though it could continue to run, but that's need
>>> a confirmation. Does the CPU your SQPOLL is pinned to stays
>>> 100% utilised?
>>
>> Here are the facts as they are documented in the github issue.
>>
>> 1. despite thinking that I was doing NAPI busy polling, I was not
>> because my ring was not receiving any sqe after its initial setup.
>>
>> This is what the patch developped with your input
>> https://lore.kernel.org/io-uring/382791dc97d208d88ee31e5ebb5b661a0453fb79.1722374371.git.olivier@trillion01.com/T/#u
>>
>> is addressing
>>
>> (BTW, I should check if there is such a thing, but I would love to know
>> if the net code is exposing a tracepoint when napi_busy_poll is called
>> because it is very tricky to know if it is done for real or not)
>>
>> 2. the moment a second ring has been attached to the sqpoll thread that
>> was receving a lot of sqe, the NAPI busy loop started to be made for
>> real and the sqpoll cpu usage unexplicably dropped from 99% to 55%
>>
>> 3. here is my kernel cmdline:
>> hugepages=72 isolcpus=0,1,2 nohz_full=0,1,2 rcu_nocbs=0,1,2
>> rcu_nocb_poll irqaffinity=3 idle=nomwait processor.max_cstate=1
>> intel_idle.max_cstate=1 nmi_watchdog=0
>>
>> there is absolutely nothing else on CPU0 where the sqpoll thread
>> affinity is set to run.
>>
>> 4. I got the idea of doing this:
>> echo common_pid == sqpoll_pid > /sys/kernel/tracing/events/sched/filter
>> echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
>>
>> and I have recorded over 1,000 context switches in 23 seconds with RCU
>> related kernel threads.
>>
>> 5. just for the fun of checking out, I have disabled NAPI polling on my
>> io_uring rings and the sqpoll thread magically returned to 99% CPU
>> usage from 55%...
>>
>> I am open to other explanations for what I have observed but my current
>> conclusion is based on what I am able to see... the evidence appears
>> very convincing to me...
> 
> You're seeing something that doesn't make much sense to me, and we need
> to understand what that is. There might be a bug _somewhere_, that's
> always a possibility, but before saying that let's get a bit more data.
> 
> While the app is working, can you grab a profile and run mpstat for the
> CPU on which you have the SQPOLL task?
> 
> perf record -g -C <CPU number> --all-kernel &
> mpstat -u -P <CPU number> 5 10 &
> 
> And then as usual, time it so that you have some activity going on,
> mpstat interval may need adjustments, and perf report it as before

I forgot to add, ~50 switches/second for relatively brief RCU handling
is not much, not enough to take 50% of a CPU. I wonder if sqpoll was
still running but napi busy polling time got accounted to softirq
because of disabled bh and you didn't include it, hence asking CPU
stats. Do you see any latency problems for that configuration?

-- 
Pavel Begunkov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ