lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250423113459.0e53be50@gandalf.local.home>
Date: Wed, 23 Apr 2025 11:34:59 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Libo Chen <libo.chen@...cle.com>
Cc: peterz@...radead.org, mgorman@...e.de, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, tj@...nel.org,
 akpm@...ux-foundation.org, llong@...hat.com, kprateek.nayak@....com,
 raghavendra.kt@....com, yu.c.chen@...el.com, tim.c.chen@...el.com,
 vineethr@...ux.ibm.com, chris.hyser@...cle.com, daniel.m.jordan@...cle.com,
 lorenzo.stoakes@...cle.com, mkoutny@...e.com, Dhaval.Giani@....com,
 cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 2/2] sched/numa: Add tracepoint that tracks the
 skipping of numa balancing due to cpuset memory pinning

On Thu, 17 Apr 2025 12:15:43 -0700
Libo Chen <libo.chen@...cle.com> wrote:

> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c13..25ee542fa0063 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -745,6 +745,36 @@ TRACE_EVENT(sched_skip_vma_numa,
>  		  __entry->vm_end,
>  		  __print_symbolic(__entry->reason, NUMAB_SKIP_REASON))
>  );
> +
> +TRACE_EVENT(sched_skip_cpuset_numa,
> +
> +	TP_PROTO(struct task_struct *tsk, nodemask_t *mem_allowed_ptr),
> +
> +	TP_ARGS(tsk, mem_allowed_ptr),
> +
> +	TP_STRUCT__entry(
> +		__array( char,		comm,		TASK_COMM_LEN	)
> +		__field( pid_t,		pid				)
> +		__field( pid_t,		tgid				)
> +		__field( pid_t,		ngid				)
> +		__field( nodemask_t *,	mem_allowed_ptr			)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
> +		__entry->pid		 = task_pid_nr(tsk);
> +		__entry->tgid		 = task_tgid_nr(tsk);
> +		__entry->ngid		 = task_numa_group_id(tsk);
> +		__entry->mem_allowed_ptr = mem_allowed_ptr;

This is a bug. You can't save random pointers in the TP_fast_assign() and
reference it later in the TP_printk().

The TP_fast_assign() is executed during the normal kernel workflow when the
tracepoint is triggered. The pointer is saved into the ring buffer.

> +	),
> +
> +	TP_printk("comm=%s pid=%d tgid=%d ngid=%d mem_nodes_allowed=%*pbl",
> +		  __entry->comm,
> +		  __entry->pid,
> +		  __entry->tgid,
> +		  __entry->ngid,
> +		  nodemask_pr_args(__entry->mem_allowed_ptr))

The TP_printk() is executed when a user reads the /sys/kernel/tracing/trace
file. Which could be literally months later.

The nodemask_pr_args() will dereference the __entry->mem_allowed_ptr from
what was saved in the ring buffer, which the content it points to could
have been freed days ago.

If that happens, then BOOM! Kernel goes bye-bye!

The trace event verifier is made to find bugs like his. And with the recent
update to handle "%*p" it found this bug. ;-)

-- Steve


> +);
>  #endif /* CONFIG_NUMA_BALANCING */

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ