linux-kernel - Re: [PATCH v5 7/7] sched/fair: Fair server interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f002616-0975-49b8-a2bf-04abd0446b95@redhat.com>
Date: Mon, 22 Jan 2024 15:14:23 +0100
From: Daniel Bristot de Oliveira <bristot@...hat.com>
To: Joel Fernandes <joel@...lfernandes.org>,
 Daniel Bristot de Oliveira <bristot@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 linux-kernel@...r.kernel.org, Luca Abeni <luca.abeni@...tannapisa.it>,
 Tommaso Cucinotta <tommaso.cucinotta@...tannapisa.it>,
 Thomas Gleixner <tglx@...utronix.de>,
 Vineeth Pillai <vineeth@...byteword.org>,
 Shuah Khan <skhan@...uxfoundation.org>, Phil Auld <pauld@...hat.com>,
 suleiman@...gle.com
Subject: Re: [PATCH v5 7/7] sched/fair: Fair server interface

On 1/19/24 02:55, Joel Fernandes wrote:
> On Sat, Nov 04, 2023 at 11:59:24AM +0100, Daniel Bristot de Oliveira wrote:
>> Add an interface for fair server setup on debugfs.
>>
>> Each rq have three files under /sys/kernel/debug/sched/rq/CPU{ID}:
>>
>>  - fair_server_runtime: set runtime in ns
>>  - fair_server_period: set period in ns
>>  - fair_server_defer: on/off for the defer mechanism
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@...nel.org>
> 
> Hi Daniel, Peter,
> I am writing on behalf of the ChromeOS scheduler team.
> 
> We had to revert the last 3 patches in this series because of a syzkaller
> reported bug, this happens on the sched/more branch in Peter's tree:
> 
>  WARNING: CPU: 0 PID: 2404 at kernel/sched/fair.c:5220
>  place_entity+0x240/0x290 kernel/sched/fair.c:5147
>  Call Trace:
>  <TASK>
>   enqueue_entity+0xdf/0x1130 kernel/sched/fair.c:5283
>   enqueue_task_fair+0x241/0xbd0 kernel/sched/fair.c:6717
>   enqueue_task+0x199/0x2f0 kernel/sched/core.c:2117
>   activate_task+0x60/0xc0 kernel/sched/core.c:2147
>   ttwu_do_activate+0x18d/0x6b0 kernel/sched/core.c:3794
>   ttwu_queue kernel/sched/core.c:4047 [inline]
>   try_to_wake_up+0x805/0x12f0 kernel/sched/core.c:4368
>   kick_pool+0x2e7/0x3b0 kernel/workqueue.c:1142
>   __queue_work+0xcf8/0xfe0 kernel/workqueue.c:1800
>   queue_delayed_work_on+0x15a/0x260 kernel/workqueue.c:1986
>   queue_delayed_work include/linux/workqueue.h:577 [inline]
>   srcu_funnel_gp_start kernel/rcu/srcutree.c:1068 [inline]
> 
> which is basically this warning in place_entity:
> 		if (WARN_ON_ONCE(!load))
> 			load = 1;
> 
> Full log (scroll to the bottom as there is console/lockdep side effects which
> are likely not relevant to this issue): https://paste.debian.net/1304579/
> 
> Side note, we are also looking into a KASAN nullptr deref but this happens
> only on our backport of the patches to a 5.15 kernel, as far as we know.
> 
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> CPU: 0 PID: 1592 Comm: syz-executor.0 Not tainted [...]
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
>  RIP: 0010:____rb_erase_color lib/rbtree.c:354 [inline] 
>  RIP: 0010:rb_erase+0x664/0xe1e lib/rbtree.c:445
>  [...]
> Call Trace:
>  <TASK>
>   set_next_entity+0x6e/0x576 kernel/sched/fair.c:4728
>   set_next_task_fair+0x1bb/0x355 kernel/sched/fair.c:11943
>   set_next_task kernel/sched/sched.h:2241 [inline] 
>   pick_next_task kernel/sched/core.c:6014 [inline] 
>   __schedule+0x36fb/0x402d kernel/sched/core.c:6378
>   preempt_schedule_common+0x74/0xc0 kernel/sched/core.c:6590
>   preempt_schedule+0xd6/0xdd kernel/sched/core.c:6615
> 
> Full splat: https://paste.debian.net/1304573/

Interesting, does it keep any task hung? I am having a case where I see
a hung task, but I do not get the splat because the system freezes (printk
with rq_lock I guess)...

It might be the same problem.

> Investigation is on going but could you also please take a look at these? It
> is hard to reproduce and only syzbot has luck reproducing these.
> 
> Also I had a comment below:
> 
>> +int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
>> +{
>> +	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
>> +	u64 new_bw = to_ratio(period, runtime);
>> +	struct rq *rq = dl_se->rq;
>> +	int cpu = cpu_of(rq);
>> +	struct dl_bw *dl_b;
>> +	unsigned long cap;
>> +	int retval = 0;
>> +	int cpus;
>> +
>> +	dl_b = dl_bw_of(cpu);
>> +	raw_spin_lock(&dl_b->lock);
>> +	cpus = dl_bw_cpus(cpu);
>> +	cap = dl_bw_capacity(cpu);
>> +
>> +	if (__dl_overflow(dl_b, cap, old_bw, new_bw)) {
> 
> The dl_overflow() call here seems introducing an issue with our conceptual
> understanding of how the dl server is supposed to work.
> 
> Suppose we have a 4 CPU system. Also suppose RT throttling is disabled.
> Suppose the DL server params are 50ms runtime in 100ms period (basically we
> want to dedicate 50% of the bandwidth of each CPU to CFS).
> 
> In such a situation, __dl_overflow() will return an error right? Because
> total bandwidth will exceed 100% (4 times 50% is 200%).

I might be missing something in your case, but, it accepts:

root@...ora:/sys/kernel/debug/sched/fair_server# find . -type f -exec cat {} \;
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000
1
1000000000
500000000

your system accepts 400%... the percentage is "global".

is it failing in your system?

-- Daniel