linux-kernel - Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f83d9e8d-af07-4ce4-81aa-e29fae5762b3@arm.com>
Date: Thu, 12 Sep 2024 22:44:48 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Luis Machado <luis.machado@....com>, Mike Galbraith <efault@....de>,
 Peter Zijlstra <peterz@...radead.org>
Cc: Vincent Guittot <vincent.guittot@...aro.org>,
 Hongyan Xia <hongyan.xia2@....com>, mingo@...hat.com, juri.lelli@...hat.com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, youssefesmat@...omium.org, tglx@...utronix.de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

On 12/09/2024 14:58, Luis Machado wrote:
> On 9/11/24 10:10, Mike Galbraith wrote:
>> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
>>> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:

[...]

> Ok, I seem to have narrowed this down to scheduler class switching. In particular
> switched_from_fair.
> 
> Valentin's patch (75b6499024a6c1a4ef0288f280534a5c54269076
> sched/fair: Properly deactivate sched_delayed task upon class change) introduced
> finish_delayed_dequeue_entity, which takes care of cleaning up the state of delayed-dequeue
> tasks during class change. Things work fine (minus delayed task accounting) up to this point.
> 
> When Peter introduced his patch to do h_nr_delayed accounting, we modified
> finish_delayed_dequeue_entity to also call clear_delayed, instead of simply
> zeroing se->sched_delayed.
> 
> The call to clear_delayed decrements the rq's h_nr_delayed, and it gets used elsewhere
> to cleanup the state of delayed-dequeue tasks, in order to share some common code.
> 
> With that said, my testing on Android shows that when we hit switched_from_fair during
> switching sched classes (due to some RT task), we're in a state where...
> 
> 1 - We already called into dequeue_entities for this delayed task.
> 2 - We tested true for the !task_sleep && !task_delayed condition.
> 3 - se->sched_delayed is true, so h_nr_delayed == 1.
> 4 - We carry on executing the rest of dequeue_entities and decrement the rq's h_nr_running by 1.
> 
> In switched_from_fair, after the above events, we call into finish_delayed_dequeue_entity -> clear_delayed
> and do yet another decrement to the rq's h_nr_delayed, now potentially making it negative. As
> a consequence, we probably misuse the negative value and adjust the frequencies incorrectly. I
> think this is the issue I'm seeing.
> 
> It is worth pointing out that even with the Android setup, things only go bad when there is enough
> competition and switching of classes (lots of screen events etc).
> 
> My suggestion of a fix (below), still under testing, is to inline the delayed-dequeue and the lag zeroing
> cleanup within switched_from_fair instead of calling finish_delayed_dequeue_entity. Or maybe
> drop finish_delayed_dequeue_entity and inline its contents into its callers.
> 
> The rest of Peter's patch introducing h_nr_delayed seems OK as far as I could test.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993ac282a83..f8df2f8d2e11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13168,7 +13168,9 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
>                  * related to sched_delayed being true and that wasn't done
>                  * due to the generic dequeue not using DEQUEUE_DELAYED.
>                  */
> -               finish_delayed_dequeue_entity(&p->se);
> +               p->se.sched_delayed = 0;
> +               if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
> +                       p->se.vlag = 0;
>                 p->se.rel_deadline = 0;
>                 __block_task(rq, p);
>         }

I could recreate this on QEMU with:

@@ -5473,6 +5473,7 @@ static void clear_delayed(struct sched_entity *se)
        for_each_sched_entity(se) {
                struct cfs_rq *cfs_rq = cfs_rq_of(se);
                cfs_rq->h_nr_delayed--;
+               BUG_ON((int)cfs_rq->h_nr_delayed < 0);
                if (cfs_rq_throttled(cfs_rq))
                        break;
        }
running:

  # while(true); do chrt -rr -p 50 $$; chrt -o -p 0 $$; done

in one shell and:

  # hackbench

in another.

[  318.490522] ------------[ cut here ]------------
[  318.490969] kernel BUG at kernel/sched/fair.c:5476!
[  318.491411] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  318.491964] CPU: 3 UID: 0 PID: 68053 Comm: chrt Not tainted 6.11.0-rc1-00066-g2e05f6c71d36-dirty #23
[  318.492604] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.4
[  318.494192] RIP: 0010:switched_from_fair+0x67/0xe0
[  318.494899] Code: ff ff c6 85 d1 00 00 00 00 48 85 db 75 0e eb 1c 48 8b 9b 98 00 00 00 48 85 db 74 10 48 8b 0
[  318.496491] RSP: 0018:ffffb1154bc63e20 EFLAGS: 00010097
[  318.496991] RAX: 0000000000000001 RBX: ffff92668844e200 RCX: ffff9266fddadea8
[  318.497681] RDX: ffff926684213e00 RSI: ffff9266fddadea8 RDI: ffff92668844e608
[  318.498339] RBP: ffff92668844e180 R08: ffff92668106c44c R09: ffff92668119c0b8
[  318.498940] R10: ffff926681f42000 R11: ffffffffb1e155d0 R12: ffff9266fddad640
[  318.499525] R13: ffffb1154bc63ed8 R14: 0000000000000078 R15: ffff9266fddad640
[  318.500234] FS:  00007f68f52bf740(0000) GS:ffff9266fdd80000(0000) knlGS:0000000000000000
[  318.500837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.501261] CR2: 00007f68f53afdd0 CR3: 0000000006a58002 CR4: 0000000000370ef0
[  318.501798] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  318.502385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  318.502919] Call Trace:
[  318.503118]  <TASK>
[  318.503284]  ? die+0x32/0x90
[  318.503508]  ? do_trap+0xd8/0x100
[  318.503770]  ? switched_from_fair+0x67/0xe0
[  318.504085]  ? do_error_trap+0x60/0x80
[  318.504374]  ? switched_from_fair+0x67/0xe0
[  318.504652]  ? exc_invalid_op+0x53/0x70
[  318.504995]  ? switched_from_fair+0x67/0xe0
[  318.505270]  ? asm_exc_invalid_op+0x1a/0x20
[  318.505588]  ? switched_from_fair+0x67/0xe0
[  318.505929]  check_class_changed+0x2a/0x80
[  318.506236]  __sched_setscheduler+0x1f3/0x920
[  318.506526]  do_sched_setscheduler+0xfd/0x1c0
[  318.506867]  ? do_sys_openat2+0x7c/0xc0
[  318.507141]  __x64_sys_sched_setscheduler+0x1a/0x30
[  318.507462]  do_syscall_64+0x9e/0x1a0
[  318.507722]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[  318.508127] RIP: 0033:0x7f68f53c3719
[  318.508369] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 8
[  318.509941] RSP: 002b:00007fff8d920578 EFLAGS: 00000246 ORIG_RAX: 0000000000000090
[  318.511178] RAX: ffffffffffffffda RBX: 00007fff8d920610 RCX: 00007f68f53c3719
[  318.512105] RDX: 00007fff8d92058c RSI: 0000000000000002 RDI: 0000000000000103
[  318.512676] RBP: 0000000000000103 R08: 0000000000000000 R09: 0000000000000000
[  318.513296] R10: 1999999999999999 R11: 0000000000000246 R12: 0000000000000002
[  318.513917] R13: 0000000000000002 R14: 0000000000000032 R15: 0000000000000103
[  318.514617]  </TASK>
[  318.514861] Modules linked in:
[  318.515132] ---[ end trace 0000000000000000 ]---
[  318.515466] RIP: 0010:switched_from_fair+0x67/0xe0
[  318.515942] Code: ff ff c6 85 d1 00 00 00 00 48 85 db 75 0e eb 1c 48 8b 9b 98 00 00 00 48 85 db 74 10 48 8b 0
[  318.517411] RSP: 0018:ffffb1154bc63e20 EFLAGS: 00010097
[  318.517819] RAX: 0000000000000001 RBX: ffff92668844e200 RCX: ffff9266fddadea8
[  318.518354] RDX: ffff926684213e00 RSI: ffff9266fddadea8 RDI: ffff92668844e608
[  318.518946] RBP: ffff92668844e180 R08: ffff92668106c44c R09: ffff92668119c0b8
[  318.519527] R10: ffff926681f42000 R11: ffffffffb1e155d0 R12: ffff9266fddad640
[  318.520138] R13: ffffb1154bc63ed8 R14: 0000000000000078 R15: ffff9266fddad640
[  318.520684] FS:  00007f68f52bf740(0000) GS:ffff9266fdd80000(0000) knlGS:0000000000000000
[  318.521350] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.521745] CR2: 00007f68f53afdd0 CR3: 0000000006a58002 CR4: 0000000000370ef0
[  318.522306] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  318.522896] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  318.523388] note: chrt[68053] exited with irqs disabled

With your proposed fix the issue goes away.

Tested-by: Dietmar Eggemann <dietmar.eggemann@....com>