lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 4 Jun 2024 14:59:21 +0100
From: Hongyan Xia <hongyan.xia2@....com>
To: Peter Zijlstra <peterz@...radead.org>, Luis Machado <luis.machado@....com>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, tglx@...utronix.de, efault@....de, nd
 <nd@....com>, John Stultz <jstultz@...gle.com>
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

Hi Peter,

On 04/06/2024 11:11, Peter Zijlstra wrote:
> On Mon, Jun 03, 2024 at 08:30:43PM +0100, Luis Machado wrote:
> 
>> Exchanging some information with Hongyan today, he was a bit suspicious of the uclamp
>> behavior with the eevdf complete series applied.
>>
>> Checking the uclamp code, I see we have some refcounting tied to enqueuing/dequeuing
>> of tasks, and the uclamp values are organized in buckets.
>>
>> Just for fun I added a few trace_printk's in uclamp_eff_value, uclamp_rq_inc_id and
>> uclamp_rq_dec_id.
>>
>> Booting up the system with delayed_dequeue disabled and running the benchmark, I
>> see the uclamp bucket management pretty stable. Tasks get added to the uclamp
>> buckets but then get removed. At the end of the benchmark, the uclamp buckets
>> are (almost always) clean of tasks.
>>
>> Enabling delayed dequeue, I can see the uclamp buckets slowly filling up with
>> tasks. At the end of the benchmark, I see uclamp buckets with 30, 40 or 50
>> tasks still. If I do another run, I can see 80, 100 tasks still.
>>
>> I suspect refcounting might be going wrong somewhere due to delayed dequeue
>> tasks, but that's more of a guess right now. Hopefully that is useful
>> information. I'll resume investigation tomorrow.
> 
> Thank you both!!
> 
> Does the below help?
> 
> Note how dequeue_task() does uclamp_rq_dec() unconditionally, which is
> then not balanced in the case below.
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3664,6 +3664,7 @@ static int ttwu_runnable(struct task_str
>   			/* mustn't run a delayed task */
>   			SCHED_WARN_ON(task_on_cpu(rq, p));
>   			enqueue_task(rq, p, ENQUEUE_DELAYED);
> +			uclamp_rq_inc(rq, p);
>   		}
>   		if (!task_on_cpu(rq, p)) {
>   			/*

Further communication with Luis makes me investigate more on uclamp 
bucket leaks.

I added a separate variable uclamp_on_rq in task_struct and WARN when 
the state didn't match. The thing is, without your above fix, the first 
WARN happens on uclamp_rq_inc(), not dec(), which means a task is 
already added to rq uclamp and is somehow added again by mistake.

Because this is a double inc(), your fix won't fix this problem. I added 
the trace here when the double inc() happened, in case you can see 
something.

Juno-r2-Debian login: [  176.010780] ------------[ cut here 
]------------ 
                             [1083/15764]
[  176.015414] WARNING: CPU: 5 PID: 0 at kernel/sched/core.c:1594 
uclamp_rq_inc+0x188/0x1a0 

[  176.023539] Modules linked in: 
 

[  176.026602] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 
6.10.0-rc1-ga946a2b48119-dirty #1034 

[  176.034882] Hardware name: ARM Juno development board (r2) (DT) 
 

[  176.040815] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--) 

[  176.047796] pc : uclamp_rq_inc+0x188/0x1a0 
 

[  176.051903] lr : enqueue_task+0x48/0x70 
 

[  176.055752] sp : ffffffc081fb3de0 
 

[  176.059073] x29: ffffffc081fb3de0 x28: 0000000000000000 x27: 
ffffff8800342300 

[  176.066233] x26: ffffff88084311b8 x25: ffffffc081ed2000 x24: 
ffffffc081ca7008 

[  176.073388] x23: 0000000000000000 x22: 00000000000000c0 x21: 
ffffff897efbbf40 

[  176.080543] x20: ffffff8808431180 x19: 0000000000000049 x18: 
0000000000000408 

[  176.087700] x17: ffffffc8fd301000 x16: ffffffc081fb0000 x15: 
000000000000afc8 

[  176.094857] x14: 0000000000000001 x13: 0000000000000004 x12: 
000000020c8392c2 

[  176.102017] x11: 0000000000000800 x10: 0000000000000800 x9 : 
0000000000000000 

[  176.109177] x8 : 0000000000000001 x7 : 0000000000000021 x6 : 
ffffffffffffff4c 

[  176.116337] x5 : 0000000000007e90 x4 : 00000000000fc33c x3 : 
00000000000139b4 

[  176.123487] x2 : 0000000000000049 x1 : ffffff8808431180 x0 : 
ffffff897efbbf40 

[  176.130639] Call trace: 
 

[  176.133089]  uclamp_rq_inc+0x188/0x1a0 
 

[  176.136849]  activate_task+0x48/0x60 
 

[  176.140433]  ttwu_do_activate+0x80/0x22c 
 

[  176.144364]  sched_ttwu_pending+0x9c/0x120 
 

[  176.148470]  __flush_smp_call_function_queue+0x3d0/0x568 
 

[  176.153792]  generic_smp_call_function_single_interrupt+0x10/0x18 
 

[  176.159902]  do_handle_IPI+0xa0/0x230 
 

[  176.163577]  ipi_handler+0x18/0x24 
 

[  176.166987]  handle_percpu_devid_irq+0x80/0x204 
 

[  176.171532]  generic_handle_domain_irq+0x28/0x40 
 

[  176.176161]  gic_handle_irq+0x58/0xd0 
 

[  176.179833]  call_on_irq_stack+0x24/0x34 
 

[  176.183764]  el1_interrupt+0x70/0xbc 
 

[  176.187348]  el1h_64_irq_handler+0x14/0x1c 
 

[  176.191449]  el1h_64_irq+0x64/0x68 
 

[  176.194857]  arch_local_irq_enable+0x4/0x8 
 

[  176.198960]  cpuidle_enter+0x34/0x48 
 

[  176.202547]  do_idle+0x1c4/0x234 
 

[  176.205786]  cpu_startup_entry+0x30/0x34 
 

[  176.209723]  secondary_start_kernel+0x124/0x144 
 

[  176.214267]  __secondary_switched+0x6c/0x70 
 

[  176.218458] ---[ end trace 0000000000000000 ]---

Or, do you mean to add a uclamp_rq_dec() instead? But even if I skip the 
uclamp_rq_inc() at that line (which is equivalent to doing 
uclamp_rq_dec() after enqueue_task()), I still get a WARN like this:

Juno-r2-Debian login: [   77.183511] ------------[ cut here 
]------------ 

[   77.188157] WARNING: CPU: 2 PID: 114 at kernel/sched/core.c:1594 
uclamp_rq_inc+0x188/0x1a0 

[   77.196455] Modules linked in: 
 

[   77.199519] CPU: 2 PID: 114 Comm: kworker/u25:3 Not tainted 
6.10.0-rc1-ga946a2b48119-dirty #1035 

[   77.208325] Hardware name: ARM Juno development board (r2) (DT) 
 

[   77.214255] Workqueue: rpciod rpc_async_schedule 
 

[   77.218893] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--) 

[   77.225873] pc : uclamp_rq_inc+0x188/0x1a0 
 

[   77.229979] lr : enqueue_task+0x44/0x7c 
 

[   77.233826] sp : ffffffc082af3c80 
 

[   77.237143] x29: ffffffc082af3c80 x28: ffffffc081cd6000 x27: 
0000000000000402 

[   77.244305] x26: ffffffc8fd285000 x25: ffffffc081cbaf40 x24: 
ffffff8800f92300 

[   77.251466] x23: ffffff897ef3ff40 x22: 00000000000000c0 x21: 
ffffff897ef3ff40 

[   77.258625] x20: ffffff88028f3480 x19: 0000000000000009 x18: 
0000000000000000 

[   77.265784] x17: 0000000000000000 x16: 0000000000000001 x15: 
00000000000003ae 

[   77.272943] x14: 000000000003eca5 x13: 000000000003eca5 x12: 
00000000000000eb 

[   77.280102] x11: 0000000000000000 x10: 0000000000001800 x9 : 
0000000000000000 

[   77.287261] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 
000000000000003f 

[   77.294420] x5 : 0000000000000311 x4 : 0000000000000113 x3 : 
000000000000017f 

[   77.301578] x2 : 0000000000000009 x1 : ffffff88028f3480 x0 : 
ffffff897ef3ff40 

[   77.308738] Call trace: 
 

[   77.311185]  uclamp_rq_inc+0x188/0x1a0 
 

[   77.314944]  activate_task+0x48/0x60 
 

[   77.318530]  ttwu_do_activate+0x80/0x22c 
 

[   77.322464]  try_to_wake_up+0x1e8/0x50c 
 

[   77.326311]  wake_up_process+0x14/0x1c 

[   77.330070]  kick_pool+0xb8/0x144 

[   77.333395]  process_one_work+0x104/0x3a4 

[   77.337418]  worker_thread+0x2b8/0x424 

[   77.341176]  kthread+0xf8/0x194 

[   77.344327]  ret_from_fork+0x10/0x20 

[   77.347913] ---[ end trace 0000000000000000 ]---

I pasted the trace here in case useful.

Hongyan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ