[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <104BC9F8-AECA-470D-9A9D-C4AFA3D4184C@oracle.com>
Date: Wed, 23 Apr 2025 16:05:44 +0000
From: Libo Chen <libo.chen@...cle.com>
To: Steven Rostedt <rostedt@...dmis.org>
CC: "peterz@...radead.org" <peterz@...radead.org>,
"mgorman@...e.de"
<mgorman@...e.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"juri.lelli@...hat.com" <juri.lelli@...hat.com>,
"vincent.guittot@...aro.org"
<vincent.guittot@...aro.org>,
"tj@...nel.org" <tj@...nel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"llong@...hat.com"
<llong@...hat.com>,
"kprateek.nayak@....com" <kprateek.nayak@....com>,
"raghavendra.kt@....com" <raghavendra.kt@....com>,
"yu.c.chen@...el.com"
<yu.c.chen@...el.com>,
"tim.c.chen@...el.com" <tim.c.chen@...el.com>,
"vineethr@...ux.ibm.com" <vineethr@...ux.ibm.com>,
Chris Hyser
<chris.hyser@...cle.com>,
Daniel Jordan <daniel.m.jordan@...cle.com>,
Lorenzo
Stoakes <lorenzo.stoakes@...cle.com>,
"mkoutny@...e.com" <mkoutny@...e.com>,
Dhaval Giani <Dhaval.Giani@....com>,
"cgroups@...r.kernel.org"
<cgroups@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the
skipping of numa balancing due to cpuset memory pinning
> On Apr 23, 2025, at 8:34 AM, Steven Rostedt <rostedt@...dmis.org> wrote:
>
> On Thu, 17 Apr 2025 12:15:43 -0700
> Libo Chen <libo.chen@...cle.com> wrote:
>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 8994e97d86c13..25ee542fa0063 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
>> __entry->vm_end,
>> __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>> );
>> +
>> +TRACE_EVENT(sched_skip_cpuset_numa,
>> +
>> + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
>> +
>> + TP_ARGS(tsk, mem_allowed_ptr),
>> +
>> + TP_STRUCT__entry(
>> + __array( char, comm, TASK_COMM_LEN )
>> + __field( pid_t, pid )
>> + __field( pid_t, tgid )
>> + __field( pid_t, ngid )
>> + __field( nodemask_t *, mem_allowed_ptr )
>> + ),
>> +
>> + TP_fast_assign(
>> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> + __entry->pid = task_pid_nr(tsk);
>> + __entry->tgid = task_tgid_nr(tsk);
>> + __entry->ngid = task_numa_group_id(tsk);
>> + __entry->mem_allowed_ptr = mem_allowed_ptr;
>
> This is a bug. You can't save random pointers in the TP_fast_assign() and
> reference it later in the TP_printk().
>
Admittedly I was a bit nervous about dereferencing this pointer at TP_printk()
time. Will fix it!
Also wondering if we can fail the build in this scenario so it will be easier to
catch this bug at the build time.
Thanks
Libo
> The TP_fast_assign() is executed during the normal kernel workflow when the
> tracepoint is triggered. The pointer is saved into the ring buffer.
>
>> + ),
>> +
>> + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
>> + __entry->comm,
>> + __entry->pid,
>> + __entry->tgid,
>> + __entry->ngid,
>> + nodemask_pr_args(__entry->mem_allowed_ptr))
>
> The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
> file. Which could be literally months later.
>
> The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
> what was saved in the ring buffer, which the content it points to could
> have been freed days ago.
>
> If that happens, then BOOM! Kernel goes bye-bye!
>
> The trace event verifier is made to find bugs like his. And with the recent
> update to handle "%*p" it found this bug. ;-)
>
> -- Steve
>
>
>> +);
>> #endif /* CONFIG_NUMA_BALANCING */
Powered by blists - more mailing lists