lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <172769aab6901361f992a4ca67dd5dd8864f43ba.camel@linux.ibm.com>
Date: Thu, 06 Mar 2025 10:18:04 +0530
From: Aboorva Devarajan <aboorvad@...ux.ibm.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com,
        peterz@...radead.org, juri.lelli@...hat.com, riel@...riel.com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        vschneid@...hat.com, odin@...d.al, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] sched/fair: Fix invalid pointer dereference in
 child_cfs_rq_on_list()

On Wed, 2025-03-05 at 10:23 +0100, Dietmar Eggemann wrote:
> On 05/03/2025 09:21, Vincent Guittot wrote:
> > On Tue, 4 Mar 2025 at 18:00, Aboorva Devarajan <aboorvad@...ux.ibm.com> wrote:
> > > 
> > > In child_cfs_rq_on_list(), leaf_cfs_rq_list.prev is expected to point to
> > > a valid cfs_rq->leaf_cfs_rq_list in the hierarchy. However, when accessed
> > > from the first node in a list, leaf_cfs_rq_list.prev can incorrectly point
> > > back to the list head (rq->leaf_cfs_rq_list) instead of another
> > > cfs_rq->leaf_cfs_rq_list.
> > > 
> > > The function does not handle this case, leading to incorrect pointer
> > > calculations and unintended memory accesses, which can result in a kernel
> > > crash.
> > > 
> > > A recent attempt to reorder fields in struct rq exposed this issue by
> > > modifying memory offsets and affecting how pointer computations are
> > > resolved. While the problem existed before, it was previously masked by
> > > specific field arrangement. The reordering caused erroneous pointer
> > > accesses, leading to a NULL dereference and a crash, as seen in the
> 
> I'm running tip/sched/core on arm64 and I still only see the wrong
> pointer for 'prev_cfs_rq->tg->parent' in the 'prev ==
> &rq->leaf_cfs_rq_list' case?
> 
> ...
> cpu=5 prev_cfs_rq->tg=ffff00097efb63a0 parent=0000000000000010
> cfs_rq->tg=ffff000802084000
> ...
> 

Hi Dietmar,

Yes, you are right, I meant that we will still have invalid pointers and use it 
silently in the vanilla kernel, but it won't always lead to a crash.

The crash in this specific case happens if `prev_cfs_rq->tg` points to a memory
location that cannot be de-referenced. Otherwise, the function de-references and
uses memory locations that are not valid but did not cause a visible failure so far.

Here are more details on what I meant by reordering the runqueue:

With the system and kernel configuration, I encountered the crash while trying
to reorder the runqueue structure, here is the minimal change that caused the
crash on top of v6.14-rc5 kernel:

---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8512a9fb022..597c1e6a9b5d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1143,8 +1143,8 @@ struct rq {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
        /* list of leaf cfs_rq on this CPU: */
+       struct list_head        *tmp_alone_branch;
        struct list_head        leaf_cfs_rq_list;
-       struct list_head        *tmp_alone_branch;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
---

Here is the crash signature:

[    1.114431][  T552] Kernel attempted to read user page (10000012f) - exploit attempt? (uid: 0)
[    1.114440][  T552] BUG: Unable to handle kernel data access on read at 0x10000012f
[    1.114446][  T552] Faulting instruction address: 0xc0000000001c1044
[    1.116344][  T241] pstore: backend (nvram) writing error (-1)
[    1.116351][  T241] 
[    1.116354][  T552] Oops: Kernel access of bad area, sig: 11 [#2]
[    1.116354][  T241] note: kworker/44:0[241] exited with irqs disabled
[    1.116356][  T552] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[    1.116368][  T552] Modules linked in: autofs4
[    1.116374][  T552] CPU: 73 UID: 0 PID: 552 Comm: kworker/73:1 Tainted: G      D            6.14.0-rc5-dirty #193
[    1.116381][  T552] Tainted: [D]=DIE
[    1.116384][  T552] Hardware name: IBM,9080-HEX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_012) hv:phyp pSeries
[    1.116391][  T552] NIP:  c0000000001c1044 LR: c0000000001c0f98 CTR: c000000000027ad4
[    1.116396][  T552] REGS: c000000076627870 TRAP: 0300   Tainted: G      D             (6.14.0-rc5-dirty)
[    1.116401][  T552] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48008202  XER: 20040154
...
[    1.116467][  T552] NIP [c0000000001c1044] sched_balance_update_blocked_averages+0x35c/0x88c
[    1.116474][  T552] LR [c0000000001c0f98] sched_balance_update_blocked_averages+0x2b0/0x88c

~~~~

Before reordering, struct rq and cfs_rq had the following memory layout
(snippet from pahole):

struct rq {
	...                                                 (offset  bytes)
	struct list_head           leaf_cfs_rq_list;     /*  4048    16 */
	struct list_head *         tmp_alone_branch;     /*  4064     8 */
	unsigned int               nr_uninterruptible;   /*  4072     4 */
	...
}
struct cfs_rq {
	...
	struct list_head           leaf_cfs_rq_list;     /*   456    16 */
	struct task_group *        tg;                   /*   472     8 */
	...
}

In child_cfs_rq_on_list(), `prev_cfs_rq` is computed using the `container_of`
macro:

prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);

Since `prev == &rq->leaf_cfs_rq_list`, this results in:

prev_cfs_rq = rq->leaf_cfs_rq_list -  456 (offset of leaf_cfs_rq_list in cfs_rq)

Then, `prev_cfs_rq->tg` is accessed at an offset of 472 bytes from base cfs_rq:

prev_cfs_rq->tg = (rq->leaf_cfs_rq_list - 456) + 472
                 = rq->leaf_cfs_rq_list + 16
                 = rq->tmp_alone_branch

Since `tmp_alone_branch` is always at this point a valid pointer, dereferencing
`prev_cfs_rq->tg->parent` doesn't cause a crash, even though it is not
a valid task_group pointer.

~~~~

After reordering, the layout of `struct rq` changed as follows:

struct rq {
	...                                                (offset  bytes)
	struct list_head *         tmp_alone_branch;     /*  4048     8 */ -> this is shuffled up
	struct list_head           leaf_cfs_rq_list;     /*  4056    16 */
	unsigned int               nr_uninterruptible;   /*  4072     4 */
	...
}

The layout of `struct cfs_rq` is unchanged.

Now, when the same pointer arithmetic is performed:

prev_cfs_rq = rq->leaf_cfs_rq_list - 456

prev_cfs_rq->tg = (rq->leaf_cfs_rq_list - 456) + 472
                 = rq->leaf_cfs_rq_list + 16
                 = rq->nr_uninterruptible        # now this mem location corresponds to nr_uninterruptible.

Since nr_uninterruptible is an integer rather than a de-referenceable pointer, I presume because of
this, de-referencing parent from prev_cfs_rq->tg results in a crash. Otherwise,
incorrect pointers are silently used without a visible failure.


But looks like a patch similar to this is merged yesterday [1], so this can
be ignored :)


[1] https://lore.kernel.org/all/174119292742.14745.16827644501260146974.tip-bot2@tip-bot2/

Thanks,
Aboorva


> > > ...
> > > 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ