linux-kernel - sched/fair: Kernel panics in pick_next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvVWq3WM6zVza_mD@linux.ibm.com>
Date: Thu, 26 Sep 2024 18:12:19 +0530
From: Vishal Chourasia <vishalc@...ux.ibm.com>
To: linux-kernel@...r.kernel.org
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: sched/fair: Kernel panics in pick_next_entity

I've noticed a kernel panic consistently occurring on the mainline v6.11
kernel (see attached dmesg log below). 

The panic occurs almost every time I build the Linux kernel from source.

Steps to Reproduce:

make clean
./scripts/config -e LOCALVERSION_AUTO
./scripts/config --set-str LOCALVERSION -master-with-print
make localmodconfig
make -j8 -s vmlinux modules

>>From my investigation, it seems that the function pick_eevdf() can return NULL.
Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
it can lead to a null pointer dereference. 

# objdump -S vmlinux | grep -C 5 c0000000001cfebc
c0000000001cfeb0:       00 00 00 60     nop
        struct sched_entity *se = pick_eevdf(cfs_rq);
c0000000001cfeb4:       78 f3 c3 7f     mr      r3,r30
c0000000001cfeb8:       01 46 ff 4b     bl      c0000000001c44b8 <pick_eevdf+0x8>
        if (se->sched_delayed) {
c0000000001cfebc:       51 00 23 89     lbz     r9,81(r3)  <<<<<<
        struct sched_entity *se = pick_eevdf(cfs_rq);
c0000000001cfec0:       78 1b 7f 7c     mr      r31,r3
        if (se->sched_delayed) {
c0000000001cfec4:       00 00 09 2c     cmpwi   r9,0
c0000000001cfec8:       98 00 82 40     bne     c0000000001cff60 <pick_next_entity+0xe0>

r3 is NULL which can be verified from the register context shared in the
dmesg logs

Here is the state of my git repository:
# git log --oneline
684a64bf32b6 (HEAD -> master, origin/master, origin/HEAD) Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
f7fccaa77271 Merge tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
4165cee7ecb1 Merge tag 'exfat-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
79952bdcbcea Merge tag 'f2fs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
fa8380a06bd0 Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
68e5c7d4cefb Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
7f8de2bf0725 Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux
cd3d64772981 Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux


To determine why pick_eevdf() would return NULL, I added a few printk statements
Based on one of the printk logs in the shared dmesg log, it appears that if
pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
are no eligible entities on that 'cfs_rq', it will return NULL. 

# git diff
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..8c5b96f1cd49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -48,6 +48,7 @@
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
 #include <linux/rbtree_augmented.h>
+#include <linux/delay.h>

 #include <asm/switch_to.h>

@@ -907,16 +908,25 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
 static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 {
        struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+       struct rb_node *tmpnode = node;
        struct sched_entity *se = __pick_first_entity(cfs_rq);
+       struct sched_entity *tmpse = se;
        struct sched_entity *curr = cfs_rq->curr;
+       struct sched_entity *tmpcurr = curr;
        struct sched_entity *best = NULL;
-
+       struct sched_entity *tmp = NULL;
        /*
         * We can safely skip eligibility check if there is only one entity
         * in this cfs_rq, saving some cycles.
         */
-       if (cfs_rq->nr_running == 1)
-               return curr && curr->on_rq ? curr : se;
+       if (cfs_rq->nr_running == 1) {
+               tmp = curr && curr->on_rq ? curr : se;
+               if (!tmp) {
+                       printk(KERN_INFO "pick_eevdf curr: %p, se: %p\n", curr, se);
+                       mdelay(10);
+               }
+               return tmp;
+       }

        if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
                curr = NULL;
@@ -966,6 +976,11 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
        if (!best || (curr && entity_before(curr, best)))
                best = curr;

+       if (!best) {
+               printk(KERN_INFO "best=%p, curr=%p, se=%p, node=%p ocrr=%p, ose=%p, onode=%p\n", best, curr, se, node, tmpcurr, tmpse, tmpnode);
+               mdelay(10);
+       }
+
        return best;
 }

>From the logs below:
[ 1355.763494] best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00

I have not been able to think of a quick reproducer to trigger a panic
for this case. Hoping if someone can guide me on this.

Note: The following dmesg log also contains a warning reported too. Panic
happens later.

------------[ cut here ]------------
!se->on_rq
WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
CFAR: c000000000156a10 IRQMASK: 1
GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
Call Trace:
[c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
[c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
[c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
[c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
[c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
[c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
[c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
[c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
[c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
[c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
[c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
[c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
[c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
[c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
[c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
[c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
[c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffb8f4a0c4
NIP:  00007fffb8f4a0c4 LR: 00007fffb8f4a0c4 CTR: 0000000000000000
REGS: c00000005c62fe80 TRAP: 3000   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48004222  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000003 00007fffe27ca2d0 00007fffb9067100 0000000000000000
GPR04: 000000003076051b 0000000000002000 00007fffb9060b50 0000000000000000
GPR08: 000000000000006f 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffb917a560 000000000000000b 0000000000000000
GPR16: 00007fffb9173570 0000000000002000 000000000000000c 000000000000000b
GPR20: 00000000307604c0 0000000030762509 00000000100484e8 0000000030762515
GPR24: 0000000010022b80 0000000000000000 0000000000000005 0000000000002000
GPR28: 000000003076051b 0000000000002000 00007fffb905e508 00007fffb9060b50
NIP [00007fffb8f4a0c4] 0x7fffb8f4a0c4
LR [00007fffb8f4a0c4] 0x7fffb8f4a0c4
--- interrupt: 3000
Code: 4e800020 3d220104 89297c19 2c090000 4082ff8c 3c62ff99 39200001 3d420104 38632d90 992a7c19 4bf88965 60000000 <0fe00000> 4bffff68 60000000 60000000
---[ end trace 0000000000000000 ]---
best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
Kernel attempted to read user page (51) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000051
Faulting instruction address: 0xc0000000001cfebc
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
Tainted: [W]=WARN
Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
NIP:  c0000000001cfebc LR: c0000000001cfebc CTR: 0000000000000000
REGS: c000000002c13950 TRAP: 0300   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44022282  XER: 0000000d
CFAR: c0000000001c4758 DAR: 0000000000000051 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000001cfebc c000000002c13bf0 c000000001b57400 0000000000000000
GPR04: c0000007fd147108 c0000007fd1cd600 c000000002c13968 00000007fafb0000
GPR08: 0000000000000027 00000000004e2002 0000896cdac1f9a0 0000000000002000
GPR12: c000000002a18d88 c000000002f10000 0000000000000000 00000007fffe0000
GPR16: 00000007fffd0000 0000000000000000 00000007fffe0114 0000000000000000
GPR20: 0000000000000000 0000000000000000 c0000000010e95ec c000000002bd6380
GPR24: c000000002bd6da8 c000000002c524e8 0000000000000000 c000000002bd6380
GPR28: c000000002bd6380 c0000007fd1d3f80 c0000007fd1d4080 c0000007fd1d3f80
NIP [c0000000001cfebc] pick_next_entity+0x3c/0x180
LR [c0000000001cfebc] pick_next_entity+0x3c/0x180
Call Trace:
[c000000002c13bf0] [c0000000001cfebc] pick_next_entity+0x3c/0x180 (unreliable)
[c000000002c13c70] [c0000000001d0064] pick_task_fair+0x64/0x130
[c000000002c13cb0] [c0000000001d6cd8] pick_next_task_fair+0x48/0x4a0
[c000000002c13d40] [c0000000001afc50] __pick_next_task+0x60/0x2d0
[c000000002c13da0] [c0000000010e8ce8] __schedule+0x198/0x840
[c000000002c13e80] [c0000000010e95ec] schedule_idle+0x3c/0x70
[c000000002c13eb0] [c0000000001eb1d0] do_idle+0x160/0x1b0
[c000000002c13f00] [c0000000001eb4d0] cpu_startup_entry+0x50/0x60
[c000000002c13f30] [c0000000000110e8] rest_init+0xf0/0xf4
[c000000002c13f60] [c0000000020053a4] do_initcalls+0x0/0x190
[c000000002c13fe0] [c00000000000e788] start_here_common+0x1c/0x20
Code: 60000000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 f8010010 f821ff81 60000000 7fc3f378 4bff4601 <89230051> 7c7f1b78 2c090000 40820098
---[ end trace 0000000000000000 ]---
pstore: backend (nvram) writing error (-1)

Kernel panic - not syncing: Fatal exception
Rebooting in 10 seconds..