linux-kernel - Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <557be85d-e1c1-0835-eebd-f76e32456179@amd.com>
Date: Fri, 12 Apr 2024 16:12:47 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 bristot@...hat.com, vschneid@...hat.com, linux-kernel@...r.kernel.org
Cc: wuyun.abel@...edance.com, tglx@...utronix.de, efault@....de
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

Hello Peter,

On 4/5/2024 3:58 PM, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
> 
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
> 
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
> 
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
> 
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
> 
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
> 
> XXX test the cfs-throttle stuff

I ran into a few issues when testing the series on top of tip:sched/core
at commit 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask
and use separate ->overloaded and ->overutilized flags"). All of these
splats surfaced when running unixbench with Delayed Dequeue (echoing
NO_DELAY_DEQUEUE to /sys/kernel/debug/sched/features seems to make the
system stable when running Unixbench spawn)

Unixbench (https://github.com/kdlucas/byte-unixbench.git) command:

	./Run spawn -c 512

Splats appear soon into the run. Following are the splats and their
corresponding code blocks from my 3rd Generation EPYC system
(2 x 64C/128T):

1. NULL pointer dereferencing in can_migrate_task():

	BUG: kernel NULL pointer dereference, address: 0000000000000040
	#PF: supervisor read access in kernel mode
	#PF: error_code(0x0000) - not-present page
	PGD 0 P4D 0
	Oops: 0000 [#1] PREEMPT SMP NOPTI
	CPU: 154 PID: 1507736 Comm: spawn Not tainted 6.9.0-rc1-test+ #958
	Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
	RIP: 0010:can_migrate_task+0x2b/0x6c0
	Code: ...
	RSP: 0018:ffffb6bb9e6a3bc0 EFLAGS: 00010086
	RAX: 0000000000000000 RBX: ffffb6bb9e6a3c80 RCX: ffff90ad0d209400
	RDX: 0000000000000008 RSI: ffffb6bb9e6a3c80 RDI: ffff90eb3b236438
	RBP: ffff90eb3b236438 R08: 0000005c743512ab R09: ffffffffffff0000
	R10: 0000000000000001 R11: 0000000000000100 R12: ffff90eb3b236438
	R13: ffff90eb3b2364f0 R14: ffff90eb3b6359c0 R15: ffff90eb3b6359c0
	FS:  0000000000000000(0000) GS:ffff90eb3df00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 0000000000000040 CR3: 000000807da3c006 CR4: 0000000000770ef0
	PKRU: 55555554
	Call Trace:
	<TASK>
	? __die+0x24/0x70
	? page_fault_oops+0x14a/0x510
	? exc_page_fault+0x77/0x170
	? asm_exc_page_fault+0x26/0x30
	? can_migrate_task+0x2b/0x6c0
	sched_balance_rq+0x7a8/0x1190
	sched_balance_newidle+0x1e2/0x490
	pick_next_task_fair+0x36/0x4a0
	__schedule+0x1c0/0x1710
	? srso_alias_return_thunk+0x5/0xfbef5
	? refill_stock+0x1a/0x30
	? srso_alias_return_thunk+0x5/0xfbef5
	? obj_cgroup_uncharge_pages+0x4d/0xd0
	do_task_dead+0x42/0x50
	do_exit+0x777/0xad0
	do_group_exit+0x30/0x80
	__x64_sys_exit_group+0x18/0x20
	do_syscall_64+0x79/0x120
	? srso_alias_return_thunk+0x5/0xfbef5
	? srso_alias_return_thunk+0x5/0xfbef5
	? irqentry_exit_to_user_mode+0x5b/0x170
	entry_SYSCALL_64_after_hwframe+0x6c/0x74
	RIP: 0033:0x7f963a6eac31
	Code: Unable to access opcode bytes at 0x7f963a6eac07.
	RSP: 002b:00007ffc6b7158c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
	RAX: ffffffffffffffda RBX: 00007f963a816a00 RCX: 00007f963a6eac31
	RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
	RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000020
	R10: 0000000000000000 R11: 0000000000000246 R12: 00007f963a816a00
	R13: 0000000000000000 R14: 00007f963a81bee8 R15: 00007f963a81bf00
	</TASK>
	Modules linked in: ...
	CR2: 0000000000000040
	---[ end trace 0000000000000000 ]---

    $ scripts/faddr2line vmlinux can_migrate_task+0x2b/0x6c0
    can_migrate_task+0x2b/0x6c0:
    throttled_lb_pair at kernel/sched/fair.c:5738
    (inlined by) can_migrate_task at kernel/sched/fair.c:9090

    Corresponds to:

	static inline int throttled_lb_pair(struct task_group *tg,
                                    int src_cpu, int dest_cpu)
	{
	        struct cfs_rq *src_cfs_rq, *dest_cfs_rq;

	        src_cfs_rq = tg->cfs_rq[src_cpu];  /* <----- Here -----< */
	        dest_cfs_rq = tg->cfs_rq[dest_cpu];

	        return throttled_hierarchy(src_cfs_rq) ||
	               throttled_hierarchy(dest_cfs_rq);
	}

	(inlined by)
	int can_migrate_task(struct task_struct *p, struct lb_env *env)
	{
		/* Called here */
		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
			return 0;

		...
	}


2. A NULL pointer dereferencing in pick_next_task_fair():

	BUG: kernel NULL pointer dereference, address: 0000000000000098
	#PF: supervisor read access in kernel mode
	#PF: error_code(0x0000) - not-present page
	PGD 0 P4D 0
	Oops: 0000 [#1] PREEMPT SMP NOPTI
	CPU: 107 PID: 1206665 Comm: spawn Tainted: G        W          6.9.0-rc1-test+ #958
	Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
	RIP: 0010:pick_next_task_fair+0x327/0x4a0
	Code: ...
	RSP: 0018:ffffb613c212fd28 EFLAGS: 00010002
	RAX: 0000004ed2799383 RBX: 0000000000000000 RCX: ffff8f65baf3f800
	RDX: ffff8f65baf3ca00 RSI: 0000000000000000 RDI: 000000825ae302ab
	RBP: ffff8f64b13b59c0 R08: 0000000000000015 R09: 0000000000000314
	R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f261ac199c0
	R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
	FS:  00007f768b79e740(0000) GS:ffff8f64b1380000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 0000000000000098 CR3: 00000040d1010006 CR4: 0000000000770ef0
	PKRU: 55555554
	Call Trace:
	 <TASK>
	 ? __die+0x24/0x70
	 ? page_fault_oops+0x14a/0x510
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? report_bug+0x18e/0x1a0
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? exc_page_fault+0x77/0x170
	 ? asm_exc_page_fault+0x26/0x30
	 ? pick_next_task_fair+0x327/0x4a0
	 ? pick_next_task_fair+0x320/0x4a0
	 __schedule+0x1c0/0x1710
	 ? release_task+0x2fc/0x4c0
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 schedule+0x30/0x120
	 syscall_exit_to_user_mode+0x98/0x1b0
	 do_syscall_64+0x85/0x120
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? __count_memcg_events+0x69/0x100
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? count_memcg_events.constprop.0+0x1a/0x30
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? handle_mm_fault+0x17d/0x2e0
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? do_user_addr_fault+0x33d/0x6f0
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? srso_alias_return_thunk+0x5/0xfbef5
	 ? irqentry_exit_to_user_mode+0x5b/0x170
	 entry_SYSCALL_64_after_hwframe+0x6c/0x74
	RIP: 0033:0x7f768b4eab57
	Code: ...
	RSP: 002b:00007fff5f6e2018 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
	RAX: 000000000026d13b RBX: 00007f768b7ee040 RCX: 00007f768b4eab57
	RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
	RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
	R10: 00007f768b79ea10 R11: 0000000000000246 R12: 0000000000000001
	R13: 000055b811d1a140 R14: 000055b811d1cd88 R15: 00007f768b7ee040
	 </TASK>
	Modules linked in: ...
	CR2: 0000000000000098
	---[ end trace 0000000000000000 ]---

    $ scripts/faddr2line vmlinux pick_next_task_fair+0x327/0x4a0
    pick_next_task_fair+0x327/0x4a0:
    is_same_group at kernel/sched/fair.c:418
    (inlined by) pick_next_task_fair at kernel/sched/fair.c:8625

	struct cfs_rq *
	is_same_group(struct sched_entity *se, struct sched_entity *pse)
	{
		if (se->cfs_rq == pse->cfs_rq) /* <----- HERE -----< */
			return se->cfs_rq;

		return NULL;
	}

	(inlined by)
	struct task_struct *
	pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
	{
		...
		if (prev != p) {
			...
			while (!(cfs_rq = is_same_group(se, pse) /* <---- HERE ----< */)) { 
				...
			}
		...
		}
		...
	}


3. A NULL Pointer dereferencing in __dequeue_entity():

	BUG: kernel NULL pointer dereference, address: 0000000000000000
	#PF: supervisor read access in kernel mode
	#PF: error_code(0x0000) - not-present page
	PGD 0 P4D 0
	Oops: 0000 [#1] PREEMPT SMP NOPTI
	CPU: 95 PID: 60896 Comm: spawn Not tainted 6.9.0-rc1-test+ #958
	Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
	RIP: 0010:__rb_erase_color+0x88/0x260
	Code: ...
	RSP: 0018:ffffab158755fc08 EFLAGS: 00010046
	RAX: 0000000000000000 RBX: ffffffff841314b0 RCX: 0000000017fc8dd0
	RDX: 0000000000000000 RSI: ffff8decfb1fe450 RDI: ffff8decf80bcdd0
	RBP: ffff8decf80bcdd0 R08: ffff8decf80bcdd0 R09: ffffffffffffbb60
	R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
	R13: ffff8decfb1fe450 R14: ffff8ded0ec03400 R15: ffff8decfb1fe400
	FS:  00007f1ded0a2740(0000) GS:ffff8e2bb0d80000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 0000000000000000 CR3: 00000040e66f0005 CR4: 0000000000770ef0
	PKRU: 55555554
	Call Trace:
	<TASK>
	? __die+0x24/0x70
	? page_fault_oops+0x14a/0x510
	? exc_page_fault+0x77/0x170
	? asm_exc_page_fault+0x26/0x30
	? __pfx_min_vruntime_cb_rotate+0x10/0x10
	? __rb_erase_color+0x88/0x260
	__dequeue_entity+0x1b7/0x310
	set_next_entity+0xc0/0x1e0
	pick_next_task_fair+0x355/0x4a0
	__schedule+0x1c0/0x1710
	? native_queued_spin_lock_slowpath+0x2a4/0x2f0
	schedule+0x30/0x120
	do_wait+0xad/0x100
	kernel_wait4+0xa9/0x150
	? __pfx_child_wait_callback+0x10/0x10
	do_syscall_64+0x79/0x120
	? srso_alias_return_thunk+0x5/0xfbef5
	? __count_memcg_events+0x69/0x100
	? srso_alias_return_thunk+0x5/0xfbef5
	? count_memcg_events.constprop.0+0x1a/0x30
	? srso_alias_return_thunk+0x5/0xfbef5
	? handle_mm_fault+0x17d/0x2e0
	? srso_alias_return_thunk+0x5/0xfbef5
	? do_user_addr_fault+0x33d/0x6f0
	? srso_alias_return_thunk+0x5/0xfbef5
	? srso_alias_return_thunk+0x5/0xfbef5
	? irqentry_exit_to_user_mode+0x5b/0x170
	entry_SYSCALL_64_after_hwframe+0x6c/0x74
	RIP: 0033:0x7f1deceea3ea
	Code: ...
	RSP: 002b:00007ffd7fd37ca8 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
	RAX: ffffffffffffffda RBX: 00007ffd7fd37cb4 RCX: 00007f1deceea3ea
	RDX: 0000000000000000 RSI: 00007ffd7fd37cb4 RDI: 00000000ffffffff
	RBP: 0000000000000002 R08: 00000000000136f5 R09: 0000000000000000
	R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd7fd37dd8
	R13: 000055fd8debf140 R14: 000055fd8dec1d88 R15: 00007f1ded0f2040
	</TASK>
	Modules linked in: ...
	CR2: 0000000000000000
	---[ end trace 0000000000000000 ]---


Note: I only ran into this issue with unixbench spawn. A bunch of other
benchmarks (hackbench, stream, tbench, netperf, schbench, other variants
of unixbench) ran fine without bringing down the machine.

Attaching my config below in case this in config specific.

> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---
>  include/linux/sched.h   |    1 
>  kernel/sched/core.c     |   22 +++++--
>  kernel/sched/fair.c     |  148 +++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/features.h |   12 +++
>  kernel/sched/sched.h    |    2 
>  5 files changed, 167 insertions(+), 18 deletions(-)
> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek
View attachment "eevdf_complete_config" of type "text/plain" (284818 bytes)