[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <557be85d-e1c1-0835-eebd-f76e32456179@amd.com>
Date: Fri, 12 Apr 2024 16:12:47 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
bristot@...hat.com, vschneid@...hat.com, linux-kernel@...r.kernel.org
Cc: wuyun.abel@...edance.com, tglx@...utronix.de, efault@....de
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue
Hello Peter,
On 4/5/2024 3:58 PM, Peter Zijlstra wrote:
> Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by
> noting that lag is fundamentally a temporal measure. It should not be
> carried around indefinitely.
>
> OTOH it should also not be instantly discarded, doing so will allow a
> task to game the system by purposefully (micro) sleeping at the end of
> its time quantum.
>
> Since lag is intimately tied to the virtual time base, a wall-time
> based decay is also insufficient, notably competition is required for
> any of this to make sense.
>
> Instead, delay the dequeue and keep the 'tasks' on the runqueue,
> competing until they are eligible.
>
> Strictly speaking, we only care about keeping them until the 0-lag
> point, but that is a difficult proposition, instead carry them around
> until they get picked again, and dequeue them at that point.
>
> Since we should have dequeued them at the 0-lag point, truncate lag
> (eg. don't let them earn positive lag).
>
> XXX test the cfs-throttle stuff
I ran into a few issues when testing the series on top of tip:sched/core
at commit 4475cd8bfd9b ("sched/balancing: Simplify the sg_status bitmask
and use separate ->overloaded and ->overutilized flags"). All of these
splats surfaced when running unixbench with Delayed Dequeue (echoing
NO_DELAY_DEQUEUE to /sys/kernel/debug/sched/features seems to make the
system stable when running Unixbench spawn)
Unixbench (https://github.com/kdlucas/byte-unixbench.git) command:
./Run spawn -c 512
Splats appear soon into the run. Following are the splats and their
corresponding code blocks from my 3rd Generation EPYC system
(2 x 64C/128T):
1. NULL pointer dereferencing in can_migrate_task():
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 154 PID: 1507736 Comm: spawn Not tainted 6.9.0-rc1-test+ #958
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:can_migrate_task+0x2b/0x6c0
Code: ...
RSP: 0018:ffffb6bb9e6a3bc0 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffffb6bb9e6a3c80 RCX: ffff90ad0d209400
RDX: 0000000000000008 RSI: ffffb6bb9e6a3c80 RDI: ffff90eb3b236438
RBP: ffff90eb3b236438 R08: 0000005c743512ab R09: ffffffffffff0000
R10: 0000000000000001 R11: 0000000000000100 R12: ffff90eb3b236438
R13: ffff90eb3b2364f0 R14: ffff90eb3b6359c0 R15: ffff90eb3b6359c0
FS: 0000000000000000(0000) GS:ffff90eb3df00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 000000807da3c006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x14a/0x510
? exc_page_fault+0x77/0x170
? asm_exc_page_fault+0x26/0x30
? can_migrate_task+0x2b/0x6c0
sched_balance_rq+0x7a8/0x1190
sched_balance_newidle+0x1e2/0x490
pick_next_task_fair+0x36/0x4a0
__schedule+0x1c0/0x1710
? srso_alias_return_thunk+0x5/0xfbef5
? refill_stock+0x1a/0x30
? srso_alias_return_thunk+0x5/0xfbef5
? obj_cgroup_uncharge_pages+0x4d/0xd0
do_task_dead+0x42/0x50
do_exit+0x777/0xad0
do_group_exit+0x30/0x80
__x64_sys_exit_group+0x18/0x20
do_syscall_64+0x79/0x120
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x5b/0x170
entry_SYSCALL_64_after_hwframe+0x6c/0x74
RIP: 0033:0x7f963a6eac31
Code: Unable to access opcode bytes at 0x7f963a6eac07.
RSP: 002b:00007ffc6b7158c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f963a816a00 RCX: 00007f963a6eac31
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000020
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f963a816a00
R13: 0000000000000000 R14: 00007f963a81bee8 R15: 00007f963a81bf00
</TASK>
Modules linked in: ...
CR2: 0000000000000040
---[ end trace 0000000000000000 ]---
$ scripts/faddr2line vmlinux can_migrate_task+0x2b/0x6c0
can_migrate_task+0x2b/0x6c0:
throttled_lb_pair at kernel/sched/fair.c:5738
(inlined by) can_migrate_task at kernel/sched/fair.c:9090
Corresponds to:
static inline int throttled_lb_pair(struct task_group *tg,
int src_cpu, int dest_cpu)
{
struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
src_cfs_rq = tg->cfs_rq[src_cpu]; /* <----- Here -----< */
dest_cfs_rq = tg->cfs_rq[dest_cpu];
return throttled_hierarchy(src_cfs_rq) ||
throttled_hierarchy(dest_cfs_rq);
}
(inlined by)
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
/* Called here */
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
...
}
2. A NULL pointer dereferencing in pick_next_task_fair():
BUG: kernel NULL pointer dereference, address: 0000000000000098
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 107 PID: 1206665 Comm: spawn Tainted: G W 6.9.0-rc1-test+ #958
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:pick_next_task_fair+0x327/0x4a0
Code: ...
RSP: 0018:ffffb613c212fd28 EFLAGS: 00010002
RAX: 0000004ed2799383 RBX: 0000000000000000 RCX: ffff8f65baf3f800
RDX: ffff8f65baf3ca00 RSI: 0000000000000000 RDI: 000000825ae302ab
RBP: ffff8f64b13b59c0 R08: 0000000000000015 R09: 0000000000000314
R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f261ac199c0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f768b79e740(0000) GS:ffff8f64b1380000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 00000040d1010006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x14a/0x510
? srso_alias_return_thunk+0x5/0xfbef5
? report_bug+0x18e/0x1a0
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x77/0x170
? asm_exc_page_fault+0x26/0x30
? pick_next_task_fair+0x327/0x4a0
? pick_next_task_fair+0x320/0x4a0
__schedule+0x1c0/0x1710
? release_task+0x2fc/0x4c0
? srso_alias_return_thunk+0x5/0xfbef5
schedule+0x30/0x120
syscall_exit_to_user_mode+0x98/0x1b0
do_syscall_64+0x85/0x120
? srso_alias_return_thunk+0x5/0xfbef5
? __count_memcg_events+0x69/0x100
? srso_alias_return_thunk+0x5/0xfbef5
? count_memcg_events.constprop.0+0x1a/0x30
? srso_alias_return_thunk+0x5/0xfbef5
? handle_mm_fault+0x17d/0x2e0
? srso_alias_return_thunk+0x5/0xfbef5
? do_user_addr_fault+0x33d/0x6f0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x5b/0x170
entry_SYSCALL_64_after_hwframe+0x6c/0x74
RIP: 0033:0x7f768b4eab57
Code: ...
RSP: 002b:00007fff5f6e2018 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
RAX: 000000000026d13b RBX: 00007f768b7ee040 RCX: 00007f768b4eab57
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f768b79ea10 R11: 0000000000000246 R12: 0000000000000001
R13: 000055b811d1a140 R14: 000055b811d1cd88 R15: 00007f768b7ee040
</TASK>
Modules linked in: ...
CR2: 0000000000000098
---[ end trace 0000000000000000 ]---
$ scripts/faddr2line vmlinux pick_next_task_fair+0x327/0x4a0
pick_next_task_fair+0x327/0x4a0:
is_same_group at kernel/sched/fair.c:418
(inlined by) pick_next_task_fair at kernel/sched/fair.c:8625
struct cfs_rq *
is_same_group(struct sched_entity *se, struct sched_entity *pse)
{
if (se->cfs_rq == pse->cfs_rq) /* <----- HERE -----< */
return se->cfs_rq;
return NULL;
}
(inlined by)
struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
...
if (prev != p) {
...
while (!(cfs_rq = is_same_group(se, pse) /* <---- HERE ----< */)) {
...
}
...
}
...
}
3. A NULL Pointer dereferencing in __dequeue_entity():
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 95 PID: 60896 Comm: spawn Not tainted 6.9.0-rc1-test+ #958
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:__rb_erase_color+0x88/0x260
Code: ...
RSP: 0018:ffffab158755fc08 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff841314b0 RCX: 0000000017fc8dd0
RDX: 0000000000000000 RSI: ffff8decfb1fe450 RDI: ffff8decf80bcdd0
RBP: ffff8decf80bcdd0 R08: ffff8decf80bcdd0 R09: ffffffffffffbb60
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff8decfb1fe450 R14: ffff8ded0ec03400 R15: ffff8decfb1fe400
FS: 00007f1ded0a2740(0000) GS:ffff8e2bb0d80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000040e66f0005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x14a/0x510
? exc_page_fault+0x77/0x170
? asm_exc_page_fault+0x26/0x30
? __pfx_min_vruntime_cb_rotate+0x10/0x10
? __rb_erase_color+0x88/0x260
__dequeue_entity+0x1b7/0x310
set_next_entity+0xc0/0x1e0
pick_next_task_fair+0x355/0x4a0
__schedule+0x1c0/0x1710
? native_queued_spin_lock_slowpath+0x2a4/0x2f0
schedule+0x30/0x120
do_wait+0xad/0x100
kernel_wait4+0xa9/0x150
? __pfx_child_wait_callback+0x10/0x10
do_syscall_64+0x79/0x120
? srso_alias_return_thunk+0x5/0xfbef5
? __count_memcg_events+0x69/0x100
? srso_alias_return_thunk+0x5/0xfbef5
? count_memcg_events.constprop.0+0x1a/0x30
? srso_alias_return_thunk+0x5/0xfbef5
? handle_mm_fault+0x17d/0x2e0
? srso_alias_return_thunk+0x5/0xfbef5
? do_user_addr_fault+0x33d/0x6f0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? irqentry_exit_to_user_mode+0x5b/0x170
entry_SYSCALL_64_after_hwframe+0x6c/0x74
RIP: 0033:0x7f1deceea3ea
Code: ...
RSP: 002b:00007ffd7fd37ca8 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 00007ffd7fd37cb4 RCX: 00007f1deceea3ea
RDX: 0000000000000000 RSI: 00007ffd7fd37cb4 RDI: 00000000ffffffff
RBP: 0000000000000002 R08: 00000000000136f5 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd7fd37dd8
R13: 000055fd8debf140 R14: 000055fd8dec1d88 R15: 00007f1ded0f2040
</TASK>
Modules linked in: ...
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
Note: I only ran into this issue with unixbench spawn. A bunch of other
benchmarks (hackbench, stream, tbench, netperf, schbench, other variants
of unixbench) ran fine without bringing down the machine.
Attaching my config below in case this in config specific.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---
> include/linux/sched.h | 1
> kernel/sched/core.c | 22 +++++--
> kernel/sched/fair.c | 148 +++++++++++++++++++++++++++++++++++++++++++-----
> kernel/sched/features.h | 12 +++
> kernel/sched/sched.h | 2
> 5 files changed, 167 insertions(+), 18 deletions(-)
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
View attachment "eevdf_complete_config" of type "text/plain" (284818 bytes)
Powered by blists - more mailing lists