linux-kernel - Re: [PATCH] trace/hwlat: prevent false sharing in get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260130101949.619907bf@gandalf.local.home>
Date: Fri, 30 Jan 2026 10:19:56 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Colin Lord <clord@...olab.com>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org, Masami
 Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>
Subject: Re: [PATCH] trace/hwlat: prevent false sharing in get_sample()

On Fri, 30 Jan 2026 01:09:06 -0800
Colin Lord <clord@...olab.com> wrote:

> The get_sample() function in the hwlat tracer assumes the caller holds
> hwlat_data.lock, but this is not actually happening. The result is
> unprotected data access to hwlat_data, and in per-cpu mode can result in
> false sharing. The false sharing can cause false positive latency
> events, since the sample_width member is involved and gets read as part
> of the main latency detection loop.
> 
> Lock before accessing hwlat_data members, and prevent false sharing by
> pulling sample_width into a local variable.
> 
> One system this was tested on was a dual socket server with 32 CPUs on
> each numa node. With settings of 1us threshold, 1000us width, and
> 2000us window, this change reduced the number of latency events from
> 500 per second down to approximately 1 event per minute. Some machines
> tested did not exhibit measurable latency from the false sharing.

Thanks for the report!

> 
> Signed-off-by: Colin Lord <clord@...olab.com>
> ---
> Hello, while debugging some poor hwlat results on a server I found this
> false sharing. I've tested the patch on multiple servers with many of
> the configs suggested by the patch submission checklist. A notable
> exception is I wasn't able to test with an SMP disabled build as
> multiple tags, including unmodified v6.18, were unable to finish booting
> with my config/hardware, however it did compile successfully. My
> understanding is that SMP is on its way to being required so I didn't
> spend more time on it, but I can do so if it's important. Thanks for
> your time and any feedback!

Don't worry about testing every flavor, that's my job ;-)

All I ask is basic tests, and see if it fixes the issue for you on your
machines.

> 
>  kernel/trace/trace_hwlat.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> index 2f7b94e98317..1a4b1409226b 100644
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -193,8 +193,7 @@ void trace_hwlat_callback(bool enter)
>   * get_sample - sample the CPU TSC and look for likely hardware latencies
>   *
>   * Used to repeatedly capture the CPU TSC (or similar), looking for potential
> - * hardware-induced latency. Called with interrupts disabled and with
> - * hwlat_data.lock held.
> + * hardware-induced latency. Disables interrupts during measurement.

I would remove the comment about the hwlat_data.lock held but still call
this with interrupts disabled.

>   */
>  static int get_sample(void)
>  {
> @@ -204,6 +203,7 @@ static int get_sample(void)
>  	time_type start, t1, t2, last_t2;
>  	s64 diff, outer_diff, total, last_total = 0;
>  	u64 sample = 0;
> +	u64 sample_width;
>  	u64 thresh = tracing_thresh;
>  	u64 outer_sample = 0;
>  	int ret = -1;
> @@ -211,6 +211,12 @@ static int get_sample(void)
>  
>  	do_div(thresh, NSEC_PER_USEC); /* modifies interval value */
>  
> +	mutex_lock(&hwlat_data.lock);
> +	sample_width = hwlat_data.sample_width;
> +	mutex_unlock(&hwlat_data.lock);

We don't need to take the mutex here, a simple READ_ONCE() would do.
sample_width isn't a critical number, and as long as we get a consistent
number per sample (one that existed just before a user updated it or the
new number the user entered).

	sample_width = READ_ONCE(hwlat_data.sample_width);

Especially since it doesn't even look like the hwlat_data.lock is used to
protect the updates of sample_width :-p

> +
> +	local_irq_disable();

Then we don't need to move the disabling of interrupts here.

> +
>  	kdata->nmi_total_ts = 0;
>  	kdata->nmi_count = 0;
>  	/* Make sure NMIs see this first */
> @@ -267,12 +273,14 @@ static int get_sample(void)
>  		if (diff > sample)
>  			sample = diff; /* only want highest value */
>  
> -	} while (total <= hwlat_data.sample_width);
> +	} while (total <= sample_width);
>  
>  	barrier(); /* finish the above in the view for NMIs */
>  	trace_hwlat_callback_enabled = false;
>  	barrier(); /* Make sure nmi_total_ts is no longer updated */
>  
> +	local_irq_enable();
> +
>  	ret = 0;
>  
>  	/* If we exceed the threshold value, we have found a hardware latency */
> @@ -285,8 +293,11 @@ static int get_sample(void)
>  		if (kdata->nmi_total_ts)
>  			do_div(kdata->nmi_total_ts, NSEC_PER_USEC);
>  
> +		mutex_lock(&hwlat_data.lock);
>  		hwlat_data.count++;
>  		s.seqnum = hwlat_data.count;
> +		mutex_unlock(&hwlat_data.lock);

Let's make the counter an atomic_t and use:

		s.seqnum = atomic_inc_return(&hwlat_data.count);


-- Steve

> +
>  		s.duration = sample;
>  		s.outer_duration = outer_sample;
>  		s.nmi_total_ts = kdata->nmi_total_ts;
> @@ -303,7 +314,10 @@ static int get_sample(void)
>  		}
>  	}
>  
> +	return ret;
> +
>  out:
> +	local_irq_enable();
>  	return ret;
>  }
>  
> @@ -361,9 +375,7 @@ static int kthread_fn(void *data)
>  		if (hwlat_data.thread_mode == MODE_ROUND_ROBIN)
>  			move_to_next_cpu();
>  
> -		local_irq_disable();
>  		get_sample();
> -		local_irq_enable();
>  
>  		mutex_lock(&hwlat_data.lock);
>  		interval = hwlat_data.sample_window - hwlat_data.sample_width;
> 
> base-commit: 24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7