[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250423113459.0e53be50@gandalf.local.home>
Date: Wed, 23 Apr 2025 11:34:59 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Libo Chen <libo.chen@...cle.com>
Cc: peterz@...radead.org, mgorman@...e.de, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org, tj@...nel.org,
akpm@...ux-foundation.org, llong@...hat.com, kprateek.nayak@....com,
raghavendra.kt@....com, yu.c.chen@...el.com, tim.c.chen@...el.com,
vineethr@...ux.ibm.com, chris.hyser@...cle.com, daniel.m.jordan@...cle.com,
lorenzo.stoakes@...cle.com, mkoutny@...e.com, Dhaval.Giani@....com,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the
skipping of numa balancing due to cpuset memory pinning
On Thu, 17 Apr 2025 12:15:43 -0700
Libo Chen <libo.chen@...cle.com> wrote:
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c13..25ee542fa0063 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
> __entry->vm_end,
> __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
> );
> +
> +TRACE_EVENT(sched_skip_cpuset_numa,
> +
> + TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
> +
> + TP_ARGS(tsk, mem_allowed_ptr),
> +
> + TP_STRUCT__entry(
> + __array( char, comm, TASK_COMM_LEN )
> + __field( pid_t, pid )
> + __field( pid_t, tgid )
> + __field( pid_t, ngid )
> + __field( nodemask_t *, mem_allowed_ptr )
> + ),
> +
> + TP_fast_assign(
> + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> + __entry->pid = task_pid_nr(tsk);
> + __entry->tgid = task_tgid_nr(tsk);
> + __entry->ngid = task_numa_group_id(tsk);
> + __entry->mem_allowed_ptr = mem_allowed_ptr;
This is a bug. You can't save random pointers in the TP_fast_assign() and
reference it later in the TP_printk().
The TP_fast_assign() is executed during the normal kernel workflow when the
tracepoint is triggered. The pointer is saved into the ring buffer.
> + ),
> +
> + TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
> + __entry->comm,
> + __entry->pid,
> + __entry->tgid,
> + __entry->ngid,
> + nodemask_pr_args(__entry->mem_allowed_ptr))
The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
file. Which could be literally months later.
The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
what was saved in the ring buffer, which the content it points to could
have been freed days ago.
If that happens, then BOOM! Kernel goes bye-bye!
The trace event verifier is made to find bugs like his. And with the recent
update to handle "%*p" it found this bug. ;-)
-- Steve
> +);
> #endif /* CONFIG_NUMA_BALANCING */
Powered by blists - more mailing lists