lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <717e6294-5c62-415c-bc8b-5da1d8ac3642@arm.com>
Date: Wed, 21 Aug 2024 10:46:07 +0100
From: Hongyan Xia <hongyan.xia2@....com>
To: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 vschneid@...hat.com, linux-kernel@...r.kernel.org
Cc: kprateek.nayak@....com, wuyun.abel@...edance.com,
 youssefesmat@...omium.org, tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 00/24] Complete EEVDF

On 20/08/2024 17:43, Hongyan Xia wrote:
> Hi Peter,
> 
> On 27/07/2024 11:27, Peter Zijlstra wrote:
>> Hi all,
>>
>> So after much delay this is hopefully the final version of the EEVDF 
>> patches.
>> They've been sitting in my git tree for ever it seems, and people have 
>> been
>> testing it and sending fixes.
>>
>> I've spend the last two days testing and fixing cfs-bandwidth, and as far
>> as I know that was the very last issue holding it back.
>>
>> These patches apply on top of queue.git sched/dl-server, which I plan 
>> on merging
>> in tip/sched/core once -rc1 drops.
>>
>> I'm hoping to then merge all this (+- the DVFS clock patch) right 
>> before -rc2.
>>
>>
>> Aside from a ton of bug fixes -- thanks all! -- new in this version is:
>>
>>   - split up the huge delay-dequeue patch
>>   - tested/fixed cfs-bandwidth
>>   - PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
>>   - SCHED_BATCH is equivalent to RESPECT_SLICE
>>   - propagate min_slice up cgroups
>>   - CLOCK_THREAD_DVFS_ID
>>
> 
> The latest tip/sched/core at commit
> 
> aef6987d89544d63a47753cf3741cabff0b5574c
> 
> crashes very early on on my Juno r2 board (arm64). The trace is here:
> 
> [    0.049599] ------------[ cut here ]------------
> [    0.054279] kernel BUG at kernel/sched/deadline.c:63!
> [    0.059401] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT 
> SMP
> [    0.066285] Modules linked in:
> [    0.069382] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 
> 6.11.0-rc1-g55404cef33db #1070
> [    0.077855] Hardware name: ARM Juno development board (r2) (DT)
> [    0.083856] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS 
> BTYPE=--)
> [    0.090919] pc : enqueue_dl_entity+0x53c/0x540
> [    0.095434] lr : dl_server_start+0xb8/0x10c
> [    0.099679] sp : ffffffc081ca3c30
> [    0.103034] x29: ffffffc081ca3c40 x28: 0000000000000001 x27: 
> 0000000000000002
> [    0.110281] x26: 00000000000b71b0 x25: 0000000000000000 x24: 
> 0000000000000001
> [    0.117525] x23: ffffff897ef21140 x22: 0000000000000000 x21: 
> 0000000000000000
> [    0.124770] x20: ffffff897ef21040 x19: ffffff897ef219a8 x18: 
> ffffffc080d0ad00
> [    0.132015] x17: 000000000000002f x16: 0000000000000000 x15: 
> ffffffc081ca8000
> [    0.139260] x14: 00000000016ef200 x13: 00000000000e6667 x12: 
> 0000000000000001
> [    0.146505] x11: 000000003b9aca00 x10: 0000000002faf080 x9 : 
> 0000000000000030
> [    0.153749] x8 : 0000000000000071 x7 : 000000002cf93d25 x6 : 
> 000000002cf93d25
> [    0.160994] x5 : ffffffc081e04938 x4 : ffffffc081ca3d40 x3 : 
> 0000000000000001
> [    0.168238] x2 : 000000003b9aca00 x1 : 0000000000000001 x0 : 
> ffffff897ef21040
> [    0.175483] Call trace:
> [    0.177958]  enqueue_dl_entity+0x53c/0x540
> [    0.182117]  dl_server_start+0xb8/0x10c
> [    0.186010]  enqueue_task_fair+0x5c8/0x6ac
> [    0.190165]  enqueue_task+0x54/0x1e8
> [    0.193793]  wake_up_new_task+0x250/0x39c
> [    0.197862]  kernel_clone+0x140/0x2f0
> [    0.201578]  user_mode_thread+0x4c/0x58
> [    0.205468]  rest_init+0x24/0xd8
> [    0.208743]  start_kernel+0x2bc/0x2fc
> [    0.212460]  __primary_switched+0x80/0x88
> [    0.216535] Code: b85fc3a8 7100051f 54fff8e9 17ffffce (d4210000)
> [    0.222711] ---[ end trace 0000000000000000 ]---
> [    0.227391] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.234187] ---[ end Kernel panic - not syncing: Attempted to kill 
> the idle task! ]---
> 
> I'm not an expert in DL server so I have no idea where the problem could 
> be. If you know where to look off the top of your head then much better. 
> If not, I'll do some bi-section later.
> 

Okay, in case the trace I provided isn't clear enough, I traced the 
crash to a call chain like this:

dl_server_start()
	enqueue_dl_entity()
		update_stats_enqueue_dl()
			update_stats_enqueue_sleeper_dl()
				__schedstats_from_dl_se()
					dl_task_of() <---------- crash

If I undefine CONFIG_SCHEDSTATS, then it boots fine, and I wonder if 
this is the reason why other people are not seeing this. This is 
probably not EEVDF but DL refactoring related.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ