linux-kernel - Re: [PATCH v2] trace/hwlat: prevent false sharing in get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DG7Q5K72T7AZ.2TGDSMRADSUD9@mykolab.com>
Date: Fri, 06 Feb 2026 00:09:49 -0800
From: "Colin Lord" <clord@...olab.com>
To: "Steven Rostedt" <rostedt@...dmis.org>, "Colin Lord" <clord@...olab.com>
Cc: <linux-kernel@...r.kernel.org>, <linux-trace-kernel@...r.kernel.org>,
 "Masami Hiramatsu" <mhiramat@...nel.org>, "Mathieu Desnoyers"
 <mathieu.desnoyers@...icios.com>
Subject: Re: [PATCH v2] trace/hwlat: prevent false sharing in get_sample()

On Thu Feb 5, 2026 at 6:47 AM PST, Steven Rostedt wrote:
> On Sun,  1 Feb 2026 18:58:38 -0800
> Colin Lord <clord@...olab.com> wrote:
>
> I was about to send this as part of my fixes, but taking a closer look at
> it, I have more questions.
>
>> The get_sample() function in the hwlat tracer assumes the caller holds
>> hwlat_data.lock, but this is not actually happening. The result is
>> unprotected data access to hwlat_data, and in per-cpu mode can result in
>> false sharing. The false sharing can cause false positive latency
>
> BTW, what exactly  do you mean by "false sharing"?
>

>> events, since the sample_width member is involved and gets read as part
>> of the main latency detection loop.
>
> I'm trying to figure out why the change in sample_width would cause any
> issue.
>

You described it below, but yes, by false sharing I am referring to
cache latencies that occur when reading sample_width. In this case, it's
primarily due to hwlat_data.count getting modified. Since count and
sample_width are just 8B apart from each other, they'll often share a
cache line, so one thread incrementing count due to a latency event
causes all other running threads to fetch the updated cache line when
reading sample_width.

On most systems I tested, this doesn't cause high enough latency to show
up, but I guess the system mentioned in the commit message has a slow
interconnect between the two sockets. Combined with large numbers of
threads, a single latency event would cause a cascade of cache related
latencies.

Another test I did to verify this is what was happening on that system
was setting tracing_cpumask to only contain CPUs on the same socket. If
I ran 32 threads on the same socket, there were far fewer latency events
compared to running 16 on one socket + 16 on the other (same number of
total threads just different distribution).

>> 
>> Convert hwlat_data.count to atomic64_t so it can be safely accessed
>> without locking, and prevent false sharing by pulling sample_width into
>> a local variable.
>
> The above still makes sense, but this only effects the seqnum, which may
> show either duplicate or skipped numbers. But shouldn't affect any of the
> other data.
>

Yes, the modifications made in this commit to hwlat_data.count do not
affect the latency. The change is related to the rest of the commit in
the sense that it is addressing the incorrect assumption that
hwlat_data.lock is held during get_sample(), but it would also make
sense to put the hwlat_data.count changes in a separate commit. I could
split it up if you'd prefer that.

For what it's worth, I did also see many cases of duplicate/skipped
sequence numbers on the machine that was affected by this latency issue,
but I wasn't sure how detailed I should make my commit message.

>> 
>> One system this was tested on was a dual socket server with 32 CPUs on
>> each numa node. With settings of 1us threshold, 1000us width, and
>> 2000us window, this change reduced the number of latency events from
>> 500 per second down to approximately 1 event per minute. Some machines
>> tested did not exhibit measurable latency from the false sharing.
>
> Is this because the read of hwlat_data.sample_width is a global variable
> and could possibly be causing a cache hit latency that shows up on large
> machines?
>
> Is that what you mean by "false sharing"?
>
> Oh, and subjects for the tracing subsystem should start with a capital, and
> should be something like:
>
>   tracing: Fix get_sample() in hwlat from ...

Got it, I can make that adjustment and resend. Would you also like me to
modify the commit message with any of the above clarifications?

thanks,
Colin

> -- Steve
>
>
>> 
>> Signed-off-by: Colin Lord <clord@...olab.com>
>> ---
>> Changes in v2:
>> - convert hwlat_data.count to atomic64_t
>> - leave irqs_disabled block where it originally was, outside of
>>   get_sample()
>> 
>> Thanks for the v1 review Steve, have updated and retested.
>> 
>> cheers,
>> Colin
>> 
>>  kernel/trace/trace_hwlat.c | 15 +++++++--------
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>> 
>> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
>> index 2f7b94e98317..3fe274b84f1c 100644
>> --- a/kernel/trace/trace_hwlat.c
>> +++ b/kernel/trace/trace_hwlat.c
>> @@ -102,9 +102,9 @@ struct hwlat_sample {
>>  /* keep the global state somewhere. */
>>  static struct hwlat_data {
>>  
>> -	struct mutex lock;		/* protect changes */
>> +	struct mutex	lock;		/* protect changes */
>>  
>> -	u64	count;			/* total since reset */
>> +	atomic64_t	count;		/* total since reset */
>>  
>>  	u64	sample_window;		/* total sampling window (on+off) */
>>  	u64	sample_width;		/* active sampling portion of window */
>> @@ -193,8 +193,7 @@ void trace_hwlat_callback(bool enter)
>>   * get_sample - sample the CPU TSC and look for likely hardware latencies
>>   *
>>   * Used to repeatedly capture the CPU TSC (or similar), looking for potential
>> - * hardware-induced latency. Called with interrupts disabled and with
>> - * hwlat_data.lock held.
>> + * hardware-induced latency. Called with interrupts disabled.
>>   */
>>  static int get_sample(void)
>>  {
>> @@ -204,6 +203,7 @@ static int get_sample(void)
>>  	time_type start, t1, t2, last_t2;
>>  	s64 diff, outer_diff, total, last_total = 0;
>>  	u64 sample = 0;
>> +	u64 sample_width = READ_ONCE(hwlat_data.sample_width);
>>  	u64 thresh = tracing_thresh;
>>  	u64 outer_sample = 0;
>>  	int ret = -1;
>> @@ -267,7 +267,7 @@ static int get_sample(void)
>>  		if (diff > sample)
>>  			sample = diff; /* only want highest value */
>>  
>> -	} while (total <= hwlat_data.sample_width);
>> +	} while (total <= sample_width);
>>  
>>  	barrier(); /* finish the above in the view for NMIs */
>>  	trace_hwlat_callback_enabled = false;
>> @@ -285,8 +285,7 @@ static int get_sample(void)
>>  		if (kdata->nmi_total_ts)
>>  			do_div(kdata->nmi_total_ts, NSEC_PER_USEC);
>>  
>> -		hwlat_data.count++;
>> -		s.seqnum = hwlat_data.count;
>> +		s.seqnum = atomic64_inc_return(&hwlat_data.count);
>>  		s.duration = sample;
>>  		s.outer_duration = outer_sample;
>>  		s.nmi_total_ts = kdata->nmi_total_ts;
>> @@ -832,7 +831,7 @@ static int hwlat_tracer_init(struct trace_array *tr)
>>  
>>  	hwlat_trace = tr;
>>  
>> -	hwlat_data.count = 0;
>> +	atomic64_set(&hwlat_data.count, 0);
>>  	tr->max_latency = 0;
>>  	save_tracing_thresh = tracing_thresh;
>>  
>> 
>> base-commit: 24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7