linux-kernel - Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <104BC9F8-AECA-470D-9A9D-C4AFA3D4184C@oracle.com>
Date: Wed, 23 Apr 2025 16:05:44 +0000
From: Libo Chen <libo.chen@...cle.com>
To: Steven Rostedt <rostedt@...dmis.org>
CC: "peterz@...radead.org" <peterz@...radead.org>,
        "mgorman@...e.de"
	<mgorman@...e.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
        "vincent.guittot@...aro.org"
	<vincent.guittot@...aro.org>,
        "tj@...nel.org" <tj@...nel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "llong@...hat.com"
	<llong@...hat.com>,
        "kprateek.nayak@....com" <kprateek.nayak@....com>,
        "raghavendra.kt@....com" <raghavendra.kt@....com>,
        "yu.c.chen@...el.com"
	<yu.c.chen@...el.com>,
        "tim.c.chen@...el.com" <tim.c.chen@...el.com>,
        "vineethr@...ux.ibm.com" <vineethr@...ux.ibm.com>,
        Chris Hyser
	<chris.hyser@...cle.com>,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        Lorenzo
 Stoakes <lorenzo.stoakes@...cle.com>,
        "mkoutny@...e.com" <mkoutny@...e.com>,
        Dhaval Giani <Dhaval.Giani@....com>,
        "cgroups@...r.kernel.org"
	<cgroups@...r.kernel.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the
 skipping of numa balancing due to cpuset memory pinning



> On Apr 23, 2025, at 8:34 AM, Steven Rostedt <rostedt@...dmis.org> wrote:
> 
> On Thu, 17 Apr 2025 12:15:43 -0700
> Libo Chen <libo.chen@...cle.com> wrote:
> 
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 8994e97d86c13..25ee542fa0063 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
>>  __entry->vm_end,
>>  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>> );
>> +
>> +TRACE_EVENT(sched_skip_cpuset_numa,
>> +
>> + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
>> +
>> + TP_ARGS(tsk, mem_allowed_ptr),
>> +
>> + TP_STRUCT__entry(
>> + __array( char, comm, TASK_COMM_LEN )
>> + __field( pid_t, pid )
>> + __field( pid_t, tgid )
>> + __field( pid_t, ngid )
>> + __field( nodemask_t *, mem_allowed_ptr )
>> + ),
>> +
>> + TP_fast_assign(
>> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> + __entry->pid = task_pid_nr(tsk);
>> + __entry->tgid = task_tgid_nr(tsk);
>> + __entry->ngid = task_numa_group_id(tsk);
>> + __entry->mem_allowed_ptr = mem_allowed_ptr;
> 
> This is a bug. You can't save random pointers in the TP_fast_assign() and
> reference it later in the TP_printk().
> 

Admittedly I was a bit nervous about dereferencing this pointer at TP_printk()
time. Will fix it!

Also wondering if we can fail the build in this scenario so it will be easier to
catch this bug at the build time.

Thanks
Libo
 
> The TP_fast_assign() is executed during the normal kernel workflow when the
> tracepoint is triggered. The pointer is saved into the ring buffer.
> 
>> + ),
>> +
>> + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
>> +  __entry->comm,
>> +  __entry->pid,
>> +  __entry->tgid,
>> +  __entry->ngid,
>> +  nodemask_pr_args(__entry->mem_allowed_ptr))
> 
> The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
> file. Which could be literally months later.
> 
> The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
> what was saved in the ring buffer, which the content it points to could
> have been freed days ago.
> 
> If that happens, then BOOM! Kernel goes bye-bye!
> 
> The trace event verifier is made to find bugs like his. And with the recent
> update to handle "%*p" it found this bug. ;-)
> 
> -- Steve
> 
> 
>> +);
>> #endif /* CONFIG_NUMA_BALANCING */