linux-kernel - Re: [PATCH] sched/fair: Avoid false sharing in nohz struct

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <23a1e55f-bf03-4c05-a954-28326b59b06b@intel.com>
Date: Fri, 19 Dec 2025 21:56:18 +0800
From: "Guo, Wangyang" <wangyang.guo@...el.com>
To: Shubhang Kaushik Prasanna Kumar <shkaushik@...erecomputing.com>
Cc: benjamin.lei@...el.com, bsegall@...gle.com, dietmar.eggemann@....com,
 juri.lelli@...hat.com, linux-kernel@...r.kernel.org, mgorman@...e.de,
 mingo@...hat.com, peterz@...radead.org, rostedt@...dmis.org,
 tianyou.li@...el.com, tim.c.chen@...ux.intel.com,
 vincent.guittot@...aro.org, vschneid@...hat.com
Subject: Re: [PATCH] sched/fair: Avoid false sharing in nohz struct

On 12/19/2025 9:38 AM, Shubhang Kaushik Prasanna Kumar wrote:

> While the intuition behind isolating the `nr_cpus` counter seems correct, could you please justify the added padding ? As this is a high contention path in the scheduler, we shouldn't be inflating global structures with padding on logic alone. I’d like to see some benchmarking such as `perf c2c` results from a multi-core system proving the `false sharing` scenario as a measurable bottleneck.

Here is the cache line view in c2c report:
     Num RmtHitm LclHitm  Offset records             Symbol
   6.25%   0.00%   0.00%   0x0       4   [k] _nohz_idle_balance.isra.0
  18.75% 100.00%   0.00%   0x8      14   [k] nohz_balance_exit_idle
   6.25%   0.00%   0.00%   0x8       8   [k] nohz_balance_enter_idle
   6.25%   0.00%   0.00%   0xc       8   [k] sched_balance_newidle
   6.25%   0.00%   0.00%  0x10      31   [k] nohz_balancer_kick
   6.25%   0.00%   0.00%  0x20      16   [k] sched_balance_newidle
  37.50%   0.00%   0.00%  0x38      50   [k] irqtime_account_irq
   6.25%   0.00%   0.00%  0x38      47   [k] account_process_tick
   6.25%   0.00%   0.00%  0x38      12   [k] account_idle_ticks

Offsets:
*  0x0 -- nohz.idle_cpu_mask (r)
*  0x8 -- nohz.nr_cpus (w)
* 0x38 -- sched_clock_irqtime (r), not in nohz, but share cacheline

The layout in /proc/kallsyms can also confirm that:
ffffffff88600d40 b nohz
ffffffff88600d68 B arch_needs_tick_broadcast
ffffffff88600d6c b __key.264
ffffffff88600d6c b __key.265
ffffffff88600d70 b dl_generation
ffffffff88600d78 b sched_clock_irqtime

In perf cycle hotspots, we noticed that irqtime_account_irq has 3% 
overhead caused by false sharing. With the patch applied, that hotspot 
disappears.

> I am also concerned about the internal layout. By sandwiching the timer fields between two `__cacheline_aligned` boundaries, we might just be shifting the contention rather than fixing it.  See to it that fields like `next_balance` aren't being squeezed into a new conflict zone. Would like to review the benchmark data and the struct layout before we move forward.

Before the patch, all the nohz members stay in the same cache line. 
After layout fix, fields like `next_balance` still in the same cache 
line, no new conflict zone created.

Since nohz is global, the size inflating is minimal - less than 64 bytes 
in total.

BR
Wangyang