linux-kernel - Re: [PATCH] Reorder some fields in struct rq.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <172e48ee-7d90-4704-ad29-476fa7ee0aab@linux.ibm.com>
Date: Wed, 30 Jul 2025 12:05:50 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Blake Jones <blakejones@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Josh Don <joshdon@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        linux-kernel@...r.kernel.org,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH] Reorder some fields in struct rq.

Hi Blake,

On 23/07/25 01:43, Blake Jones wrote:
> This colocates some hot fields in "struct rq" to be on the same cache line
> as others that are often accessed at the same time or in similar ways.
> 
> Using data from a Google-internal fleet-scale profiler, I found three
> distinct groups of hot fields in struct rq:
> 
> - (1) The runqueue lock: __lock.
> 
> - (2) Those accessed from hot code in pick_next_task_fair():
>       nr_running, nr_numa_running, nr_preferred_running,
>       ttwu_pending, cpu_capacity, curr, idle.
> 
> - (3) Those accessed from some other hot codepaths, e.g.
>       update_curr(), update_rq_clock(), and scheduler_tick():
>       clock_task, clock_pelt, clock, lost_idle_time,
>       clock_update_flags, clock_pelt_idle, clock_idle.
> 
> The cycles spent on accessing these different groups of fields broke down
> roughly as follows:
> 
> - 50% on group (1) (the runqueue lock, always read-write)
> - 39% on group (2) (load:store ratio around 38:1)
> -  8% on group (3) (load:store ratio around 5:1)
> -  3% on all the other fields
> 
> Most of the fields in group (3) are already in a cache line grouping; this
> patch just adds "clock" and "clock_update_flags" to that group. The fields
> in group (2) are scattered across several cache lines; the main effect of
> this patch is to group them together, on a single line at the beginning of
> the structure. A few other less performance-critical fields (nr_switches,
> numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
> were also reordered to reduce holes in the data structure.
> 
> Since the runqueue lock is acquired from so many different contexts, and is
> basically always accessed using an atomic operation, putting it on either
> of the cache lines for groups (2) or (3) would slow down accesses to those
> fields dramatically, since those groups are read-mostly accesses.
> 
> This patch does not change the size of "struct rq" on machines with 64-byte
> cache lines. The additional "____cacheline_aligned" to put the runqueue
> lock on the next cache line will add an additional 64 bytes of padding on
> machines with 128-byte cache lines; although this is unfortunate, it seemed
> more likely to lead to stably good performance than e.g. by just putting
> the runqueue lock somewhere in the middle of the structure and hoping it
> wasn't on an otherwise busy cache line.
> 
> I ran "hackbench" to test this change, but it didn't show very conclusive
> results.  Looking at a profile of the hackbench run, it was spending 95% of
> its cycles inside __alloc_skb(), __kfree_skb(), or kmem_cache_free() -
> almost all of which was spent updating memcg counters or contending on the
> list_lock in kmem_cache_node. In contrast, it spent less than 0.5% of its
> cycles inside either schedule() or try_to_wake_up(). So it's not surprising
> that it didn't show useful results here.
> 
> Instead, to test this, I wrote a focused load test that would put load on
> the pick_next_task_fair() path. A parent process would fork many child
> processes, and each child would nanosleep() for 1 msec many times in a
> loop. The load test was monitored with "perf", and I looked at the amount
> of cycles that were spent with sched_balance_rq() on the stack. The test
> reliably showed 15-25% of all cycles were spent there. I ran it 80 times on
> a pair of 2-socket Intel GNR machines (480 vCPUs per machine) - one running
> 6.16-rc7, the other running this change - using 4800 child processes and
> 8000 1-msec sleeps per child.  The mean cycle count dropped from 405.4B to
> 387.6B, or a 4.4% decrease.

Thanks for your patch.

Could you please share the test you mentioned? I would like to try it on a
Power11 system.

Also, I noticed the patch doesn't apply cleanly on the latest tip/sched/core,
since the CONFIG_SMP ifdefs have been removed. Could you rebase it on the current
tip/sched/core tree?

Thanks,
Madadi Vineeth Reddy

> 
> More importantly, given that this change reduces cache misses in a very hot
> kernel codepath, there's likely to be additional application performance
> improvement due to reduced cache conflicts from kernel data structures.
> 
> Signed-off-by: Blake Jones <blakejones@...gle.com>
> ---
>  kernel/sched/sched.h | 59 +++++++++++++++++++++++++++-----------------
>  1 file changed, 37 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 83e3aa9171429..b21be28823609 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1097,30 +1097,45 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
>   * acquire operations must be ordered by ascending &runqueue.
>   */
>  struct rq {
> -	/* runqueue lock: */
> -	raw_spinlock_t		__lock;
> -
> +	/*
> +	 * The following members are loaded together from pick_next_task(),
> +	 * and should be on an isolated cache line to avoid cache pollution.
> +	 */
>  	unsigned int		nr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int		nr_numa_running;
>  	unsigned int		nr_preferred_running;
> -	unsigned int		numa_migrate_on;
>  #endif
> +#ifdef CONFIG_SMP
> +	unsigned int		ttwu_pending;
> +	unsigned long		cpu_capacity;
> +#endif
> +	union {
> +		struct task_struct __rcu *donor; /* Scheduler context */
> +		struct task_struct __rcu *curr;  /* Execution context */
> +	};
> +	struct task_struct	*idle;
> +	/* padding left here deliberately */
> +
> +	/*
> +	 * The next cacheline holds the runqueue lock, as well as some
> +	 * other less performance-critical fields.
> +	 */
> +	u64			nr_switches	____cacheline_aligned;
> +
> +	/* runqueue lock: */
> +	raw_spinlock_t		__lock;
> +
>  #ifdef CONFIG_NO_HZ_COMMON
> +	unsigned int		nohz_tick_stopped;
> +	atomic_t		nohz_flags;
>  #ifdef CONFIG_SMP
> -	unsigned long		last_blocked_load_update_tick;
>  	unsigned int		has_blocked_load;
> +	unsigned long		last_blocked_load_update_tick;
>  	call_single_data_t	nohz_csd;
>  #endif /* CONFIG_SMP */
> -	unsigned int		nohz_tick_stopped;
> -	atomic_t		nohz_flags;
>  #endif /* CONFIG_NO_HZ_COMMON */
>  
> -#ifdef CONFIG_SMP
> -	unsigned int		ttwu_pending;
> -#endif
> -	u64			nr_switches;
> -
>  #ifdef CONFIG_UCLAMP_TASK
>  	/* Utilization clamp values based on CPU's RUNNABLE tasks */
>  	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
> @@ -1143,6 +1158,9 @@ struct rq {
>  	struct list_head	*tmp_alone_branch;
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +	unsigned int		numa_migrate_on;
> +#endif
>  	/*
>  	 * This is part of a global counter where only the total sum
>  	 * over all CPUs matters. A task can increase this counter on
> @@ -1151,24 +1169,23 @@ struct rq {
>  	 */
>  	unsigned long 		nr_uninterruptible;
>  
> -	union {
> -		struct task_struct __rcu *donor; /* Scheduler context */
> -		struct task_struct __rcu *curr;  /* Execution context */
> -	};
>  	struct sched_dl_entity	*dl_server;
> -	struct task_struct	*idle;
>  	struct task_struct	*stop;
>  	unsigned long		next_balance;
>  	struct mm_struct	*prev_mm;
>  
> -	unsigned int		clock_update_flags;
> -	u64			clock;
> -	/* Ensure that all clocks are in the same cache line */
> +	/*
> +	 * The following fields of clock data are frequently referenced
> +	 * and updated together, and should go on their own cache line.
> +	 */
>  	u64			clock_task ____cacheline_aligned;
>  	u64			clock_pelt;
> +	u64			clock;
>  	unsigned long		lost_idle_time;
> +	unsigned int		clock_update_flags;
>  	u64			clock_pelt_idle;
>  	u64			clock_idle;
> +
>  #ifndef CONFIG_64BIT
>  	u64			clock_pelt_idle_copy;
>  	u64			clock_idle_copy;
> @@ -1187,8 +1204,6 @@ struct rq {
>  	struct root_domain		*rd;
>  	struct sched_domain __rcu	*sd;
>  
> -	unsigned long		cpu_capacity;
> -
>  	struct balance_callback *balance_callback;
>  
>  	unsigned char		nohz_idle_balance;