linux-kernel - Re: sched/fair: Kernel panics in pick_next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <941ddfc5-da9e-4f09-8025-b1e4c25de406@arm.com>
Date: Thu, 26 Sep 2024 15:31:49 +0100
From: Luis Machado <luis.machado@....com>
To: Vishal Chourasia <vishalc@...ux.ibm.com>, linux-kernel@...r.kernel.org
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Mike Galbraith <efault@....de>
Subject: Re: sched/fair: Kernel panics in pick_next_entity

On 9/26/24 13:42, Vishal Chourasia wrote:
> I've noticed a kernel panic consistently occurring on the mainline v6.11
> kernel (see attached dmesg log below). 
> 
> The panic occurs almost every time I build the Linux kernel from source.
> 
> Steps to Reproduce:
> 
> make clean
> ./scripts/config -e LOCALVERSION_AUTO
> ./scripts/config --set-str LOCALVERSION -master-with-print
> make localmodconfig
> make -j8 -s vmlinux modules
> 
>>>From my investigation, it seems that the function pick_eevdf() can return NULL.
> Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> it can lead to a null pointer dereference. 
> 
> # objdump -S vmlinux | grep -C 5 c0000000001cfebc
> c0000000001cfeb0:       00 00 00 60     nop
>         struct sched_entity *se = pick_eevdf(cfs_rq);
> c0000000001cfeb4:       78 f3 c3 7f     mr      r3,r30
> c0000000001cfeb8:       01 46 ff 4b     bl      c0000000001c44b8 <pick_eevdf+0x8>
>         if (se->sched_delayed) {
> c0000000001cfebc:       51 00 23 89     lbz     r9,81(r3)  <<<<<<
>         struct sched_entity *se = pick_eevdf(cfs_rq);
> c0000000001cfec0:       78 1b 7f 7c     mr      r31,r3
>         if (se->sched_delayed) {
> c0000000001cfec4:       00 00 09 2c     cmpwi   r9,0
> c0000000001cfec8:       98 00 82 40     bne     c0000000001cff60 <pick_next_entity+0xe0>
> 
> r3 is NULL which can be verified from the register context shared in the
> dmesg logs
> 
> Here is the state of my git repository:
> # git log --oneline
> 684a64bf32b6 (HEAD -> master, origin/master, origin/HEAD) Merge tag 'nfs-for-6.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
> f7fccaa77271 Merge tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
> 4165cee7ecb1 Merge tag 'exfat-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
> 79952bdcbcea Merge tag 'f2fs-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
> fa8380a06bd0 Merge tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
> 68e5c7d4cefb Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
> 7f8de2bf0725 Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux
> cd3d64772981 Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
> 
> 
> To determine why pick_eevdf() would return NULL, I added a few printk statements
> Based on one of the printk logs in the shared dmesg log, it appears that if
> pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> are no eligible entities on that 'cfs_rq', it will return NULL. 
> 
> # git diff
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 225b31aaee55..8c5b96f1cd49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -48,6 +48,7 @@
>  #include <linux/ratelimit.h>
>  #include <linux/task_work.h>
>  #include <linux/rbtree_augmented.h>
> +#include <linux/delay.h>
> 
>  #include <asm/switch_to.h>
> 
> @@ -907,16 +908,25 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>  static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>  {
>         struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> +       struct rb_node *tmpnode = node;
>         struct sched_entity *se = __pick_first_entity(cfs_rq);
> +       struct sched_entity *tmpse = se;
>         struct sched_entity *curr = cfs_rq->curr;
> +       struct sched_entity *tmpcurr = curr;
>         struct sched_entity *best = NULL;
> -
> +       struct sched_entity *tmp = NULL;
>         /*
>          * We can safely skip eligibility check if there is only one entity
>          * in this cfs_rq, saving some cycles.
>          */
> -       if (cfs_rq->nr_running == 1)
> -               return curr && curr->on_rq ? curr : se;
> +       if (cfs_rq->nr_running == 1) {
> +               tmp = curr && curr->on_rq ? curr : se;
> +               if (!tmp) {
> +                       printk(KERN_INFO "pick_eevdf curr: %p, se: %p\n", curr, se);
> +                       mdelay(10);
> +               }
> +               return tmp;
> +       }
> 
>         if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>                 curr = NULL;
> @@ -966,6 +976,11 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>         if (!best || (curr && entity_before(curr, best)))
>                 best = curr;
> 
> +       if (!best) {
> +               printk(KERN_INFO "best=%p, curr=%p, se=%p, node=%p ocrr=%p, ose=%p, onode=%p\n", best, curr, se, node, tmpcurr, tmpse, tmpnode);
> +               mdelay(10);
> +       }
> +
>         return best;
>  }
> 
>>>From the logs below:
> [ 1355.763494] best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
> 
> I have not been able to think of a quick reproducer to trigger a panic
> https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/for this case. Hoping if someone can guide me on this.
> 
> Note: The following dmesg log also contains a warning reported too. Panic
> happens later.
> 
> ------------[ cut here ]------------
> !se->on_rq
> WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> Tainted: [W]=WARN
> Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> CFAR: c000000000156a10 IRQMASK: 1
> GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> Call Trace:
> [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
> --- interrupt: 3000 at 0x7fffb8f4a0c4
> NIP:  00007fffb8f4a0c4 LR: 00007fffb8f4a0c4 CTR: 0000000000000000
> REGS: c00000005c62fe80 TRAP: 3000   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48004222  XER: 00000000
> IRQMASK: 0
> GPR00: 0000000000000003 00007fffe27ca2d0 00007fffb9067100 0000000000000000
> GPR04: 000000003076051b 0000000000002000 00007fffb9060b50 0000000000000000
> GPR08: 000000000000006f 0000000000000000 0000000000000000 0000000000000000
> GPR12: 0000000000000000 00007fffb917a560 000000000000000b 0000000000000000
> GPR16: 00007fffb9173570 0000000000002000 000000000000000c 000000000000000b
> GPR20: 00000000307604c0 0000000030762509 00000000100484e8 0000000030762515
> GPR24: 0000000010022b80 0000000000000000 0000000000000005 0000000000002000
> GPR28: 000000003076051b 0000000000002000 00007fffb905e508 00007fffb9060b50
> NIP [00007fffb8f4a0c4] 0x7fffb8f4a0c4
> LR [00007fffb8f4a0c4] 0x7fffb8f4a0c4
> --- interrupt: 3000
> Code: 4e800020 3d220104 89297c19 2c090000 4082ff8c 3c62ff99 39200001 3d420104 38632d90 992a7c19 4bf88965 60000000 <0fe00000> 4bffff68 60000000 60000000
> ---[ end trace 0000000000000000 ]---
> best=0000000000000000, curr=0000000000000000, se=00000000be02c573, node=0000000000000000 ocrr=0000000000000000, ose=00000000b1d4c4d5, onode=0000000023eb8c00
> Kernel attempted to read user page (51) - exploit attempt? (uid: 0)
> BUG: Kernel NULL pointer dereference on read at 0x00000051
> Faulting instruction address: 0xc0000000001cfebc
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> Tainted: [W]=WARN
> Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
> NIP:  c0000000001cfebc LR: c0000000001cfebc CTR: 0000000000000000
> REGS: c000000002c13950 TRAP: 0300   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44022282  XER: 0000000d
> CFAR: c0000000001c4758 DAR: 0000000000000051 DSISR: 40000000 IRQMASK: 1
> GPR00: c0000000001cfebc c000000002c13bf0 c000000001b57400 0000000000000000
> GPR04: c0000007fd147108 c0000007fd1cd600 c000000002c13968 00000007fafb0000
> GPR08: 0000000000000027 00000000004e2002 0000896cdac1f9a0 0000000000002000
> GPR12: c000000002a18d88 c000000002f10000 0000000000000000 00000007fffe0000
> GPR16: 00000007fffd0000 0000000000000000 00000007fffe0114 0000000000000000
> GPR20: 0000000000000000 0000000000000000 c0000000010e95ec c000000002bd6380
> GPR24: c000000002bd6da8 c000000002c524e8 0000000000000000 c000000002bd6380
> GPR28: c000000002bd6380 c0000007fd1d3f80 c0000007fd1d4080 c0000007fd1d3f80
> NIP [c0000000001cfebc] pick_next_entity+0x3c/0x180
> LR [c0000000001cfebc] pick_next_entity+0x3c/0x180
> Call Trace:
> [c000000002c13bf0] [c0000000001cfebc] pick_next_entity+0x3c/0x180 (unreliable)
> [c000000002c13c70] [c0000000001d0064] pick_task_fair+0x64/0x130
> [c000000002c13cb0] [c0000000001d6cd8] pick_next_task_fair+0x48/0x4a0
> [c000000002c13d40] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> [c000000002c13da0] [c0000000010e8ce8] __schedule+0x198/0x840
> [c000000002c13e80] [c0000000010e95ec] schedule_idle+0x3c/0x70
> [c000000002c13eb0] [c0000000001eb1d0] do_idle+0x160/0x1b0
> [c000000002c13f00] [c0000000001eb4d0] cpu_startup_entry+0x50/0x60
> [c000000002c13f30] [c0000000000110e8] rest_init+0xf0/0xf4
> [c000000002c13f60] [c0000000020053a4] do_initcalls+0x0/0x190
> [c000000002c13fe0] [c00000000000e788] start_here_common+0x1c/0x20
> Code: 60000000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 f8010010 f821ff81 60000000 7fc3f378 4bff4601 <89230051> 7c7f1b78 2c090000 40820098
> ---[ end trace 0000000000000000 ]---
> pstore: backend (nvram) writing error (-1)
> 
> Kernel panic - not syncing: Fatal exception
> Rebooting in 10 seconds..
> 
> 

I have a feeling this might be what is described in this thread: https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/

Likely a result of an overflow for vruntime. And we come out of pick_eevdf without a valid entity.