lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtW0W9EQrMOV4eeo@jlelli-thinkpadt14gen4.remote.csb>
Date: Mon, 2 Sep 2024 14:49:31 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Huang Shijie <shijie@...amperecomputing.com>, mingo@...hat.com,
	patches@...erecomputing.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, vineeth@...byteword.org,
	bristot@...nel.org, linux-kernel@...r.kernel.org,
	hongyan.xia2@....com
Subject: Re: [PATCH v3] sched/deadline: fix the hang in dl_task_of

On 02/09/24 13:14, Peter Zijlstra wrote:
> On Thu, Aug 29, 2024 at 10:14:02AM +0200, Juri Lelli wrote:
> > Hi,
> > 
> > On 29/08/24 11:11, Huang Shijie wrote:
> > > When we enable the schedstats, we will meet an OS hang like this:
> > >   --------------------------------------------------------
> > > 	[  134.104253] kernel BUG at kernel/sched/deadline.c:63!
> > > 	[  134.132013] ------------[ cut here ]------------
> > > 	[  134.133441]  x27: 0000000000000001
> > > 	[  134.138048] kernel BUG at kernel/sched/deadline.c:63!
> > > 	[  134.146478] x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000001
> > > 	[  134.153607] x23: 0000000000000001 x22: 0000000000000000 x21: 0000000000000001
> > > 	[  134.160734] x20: ffff007dbf1b6d00 x19: ffff007dbf1b7610 x18: 0000000000000014
> > > 	[  134.162027] ------------[ cut here ]------------
> > > 	[  134.167861] x17: 000000009deab6cd x16: 00000000527c9a1c x15: 00000000000000dc
> > > 	[  134.172473] kernel BUG at kernel/sched/deadline.c:63!
> > > 	[  134.179595] x14: 0000000001200011 x13: 0000000040001000 x12: 0000ffffb6df05bc
> > > 	[  134.191760] x11: ffff007dbf1b6d00 x10: ffff0001062dd2e8 x9 : ffff8000801215ac
> > > 	[  134.192036] ------------[ cut here ]------------
> > > 	[  134.198888] x8 : 0000000000000000 x7 : 0000000000000021 x6 : ffff0001764ed280
> > > 	[  134.203498] kernel BUG at kernel/sched/deadline.c:63!
> > > 	[  134.210622] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffff807d3dd24000
> > > 	[  134.222787] x2 : 000000028b77a140 x1 : 0000003400000000 x0 : ffff007dbf1b6c80
> > > 	[  134.229915] Call trace:
> > > 	[  134.232353]  dl_task_of.part.0+0x0/0x10
> > > 	[  134.236182]  dl_server_start+0x54/0x158
> > > 	[  134.240013]  enqueue_task_fair+0x138/0x420
> > > 	[  134.244100]  enqueue_task+0x44/0xb0
> > > 	[  134.247584]  wake_up_new_task+0x1c0/0x3a0
> > > 	[  134.251584]  kernel_clone+0xe8/0x3e8
> > > 	[  134.252022] ------------[ cut here ]------------
> > > 	[  134.255156]  __do_sys_clone+0x70/0xa8
> > > 	[  134.259764] kernel BUG at kernel/sched/deadline.c:63!
> > > 	[  134.263412]  __arm64_sys_clone+0x28/0x40
> > > 	[  134.272360]  invoke_syscall+0x50/0x120
> > > 	[  134.276101]  el0_svc_common+0x44/0xf8
> > > 	[  134.279753]  do_el0_svc+0x28/0x40
> > > 	[  134.283058]  el0_svc+0x40/0x150
> > > 	[  134.286195]  el0t_64_sync_handler+0x100/0x130
> > > 	[  134.290546]  el0t_64_sync+0x1a4/0x1a8
> > > 	[  134.294200] Code: 35ffffa2 17ffffe3 d4210000 17ffffb4 (d4210000)
> > > 	[  134.300283] ---[ end trace 0000000000000000 ]---
> > > 	[  134.304890] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > 	[  134.311147] SMP: stopping secondary CPUs
> > > 	[  135.365096] SMP: failed to stop secondary CPUs 8-9,16,30,43,86,88,121,149
> > > 	[  135.371884] Kernel Offset: disabled
> > > 	[  135.375361] CPU features: 0x00,00100003,80153d29,d75ffea7
> > > 	[  135.380749] Memory Limit: none
> > > 	[  135.383793] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]
> > >   --------------------------------------------------------
> > > 
> > > In dl_server_start(), we set the dl_se->dl_server with 1. When schedstats
> > > is enabled, in the following:
> > >    dl_server_start() --> enqueue_dl_entity() --> update_stats_enqueue_dl()
> > > 	__schedstats_from_dl_se() -->dl_task_of()
> > > 
> > > we will meet the BUG_ON.
> > > 
> > > Since the fair task has already had its own schedstats, there is no need
> > > to track anything for the associated dl_server.
> > > 
> > > So add check in:
> > >             update_stats_wait_start_dl()
> > > 	    update_stats_wait_end_dl()
> > > 	    update_stats_enqueue_dl()
> > > 	    update_stats_dequeue_dl()
> > > 
> > > return early for a dl_server dl_se.
> > > 
> > > Tested this patch with memcached in Altra.
> > > 
> > > Fixes: 5f6bd380c7bd ("sched/rt: Remove default bandwidth control")
> > > Signed-off-by: Huang Shijie <shijie@...amperecomputing.com>
> > > ---
> > > v2 --> v3:
> > >    Return early in:
> > >             update_stats_wait_start_dl()
> > > 	    update_stats_wait_end_dl()
> > > 	    update_stats_enqueue_dl()
> > > 	    update_stats_dequeue_dl()
> > 
> > This looks better, thanks.
> > 
> > Peter, what do you think?
> 
> Peter thinks that the Changelog is overly verbose, the Fixes tag is
> wrong and he doesn't much like the repeated conditions. But he is very
> glad this issue is found and fixed.

:)

> As such, he's rewritten things like so, does this work for people?

Works for me!

> ---
> Subject: sched/deadline: Fix schedstats vs deadline servers
> From: Huang Shijie <shijie@...amperecomputing.com>
> Date: Thu, 29 Aug 2024 11:11:11 +0800
> 
> From: Huang Shijie <shijie@...amperecomputing.com>
> 
> In dl_server_start(), when schedstats is enabled, the following
> happens:
> 
>   dl_server_start()
>     dl_se->dl_server = 1;
>     enqueue_dl_entity()
>       update_stats_enqueue_dl()
>         __schedstats_from_dl_se()
>           dl_task_of()
>             BUG_ON(dl_server(dl_se));
> 
> Since only tasks have schedstats and internal entries do not, avoid
> trying to update stats in this case.
> 
> Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
> Signed-off-by: Huang Shijie <shijie@...amperecomputing.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com

Acked-by: Juri Lelli <juri.lelli@...hat.com>

Thanks,
Juri


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ