[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <be48760d-b705-424e-b6b7-b75205e83567@linux.ibm.com>
Date: Thu, 30 Oct 2025 11:59:09 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Blake Jones <blakejones@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Josh Don <joshdon@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        linux-kernel@...r.kernel.org,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH v4] sched: reorder some fields in struct rq
Hi Blake,
On 30/10/25 00:24, Blake Jones wrote:
> This colocates some hot fields in "struct rq" to be on the same cache line
> as others that are often accessed at the same time or in similar ways.
> 
> Using data from a Google-internal fleet-scale profiler, I found three
> distinct groups of hot fields in struct rq:
> 
> - (1) The runqueue lock: __lock.
> 
> - (2) Those accessed from hot code in pick_next_task_fair():
>       nr_running, nr_numa_running, nr_preferred_running,
>       ttwu_pending, cpu_capacity, curr, idle.
> 
> - (3) Those accessed from some other hot codepaths, e.g.
>       update_curr(), update_rq_clock(), and scheduler_tick():
>       clock_task, clock_pelt, clock, lost_idle_time,
>       clock_update_flags, clock_pelt_idle, clock_idle.
> 
> The cycles spent on accessing these different groups of fields broke down
> roughly as follows:
> 
> - 50% on group (1) (the runqueue lock, always read-write)
> - 39% on group (2) (load:store ratio around 38:1)
> -  8% on group (3) (load:store ratio around 5:1)
> -  3% on all the other fields
> 
Thanks for the patch.
Quick question about the lock placement: Did you test Peter's suggestion of
co-locating __lock with the clock fields? If so, how did it compare to keeping
them separate?
Thanks,
Madadi Vineeth Reddy
> Most of the fields in group (3) are already in a cache line grouping; this
> patch just adds "clock" and "clock_update_flags" to that group. The fields
> in group (2) are scattered across several cache lines; the main effect of
> this patch is to group them together, on a single line at the beginning of
> the structure. A few other less performance-critical fields (nr_switches,
> numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
> were also reordered to reduce holes in the data structure.
> 
> Since the runqueue lock is acquired from so many different contexts, and is
> basically always accessed using an atomic operation, putting it on either
> of the cache lines for groups (2) or (3) would slow down accesses to those
> fields dramatically, since those groups are read-mostly accesses.
> 
> This patch does not change the size of "struct rq" on machines with 64-byte
> cache lines. The additional "____cacheline_aligned" to put the runqueue
> lock on the next cache line will add an additional 64 bytes of padding on
> machines with 128-byte cache lines; although this is unfortunate, it seemed
> more likely to lead to stably good performance than e.g. by just putting
> the runqueue lock somewhere in the middle of the structure and hoping it
> wasn't on an otherwise busy cache line.
> 
> The "__no_randomize_layout" was added to reflect the fact that performance
> of this data structure is unusually sensitive to placement of its members.
> 
> Signed-off-by: Blake Jones <blakejones@...gle.com>
> Reviewed-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> Tested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> ---
>  kernel/sched/sched.h | 77 +++++++++++++++++++++++++++-----------------
>  1 file changed, 47 insertions(+), 30 deletions(-)
> 
> v3 -> v4 changes:
> - Updated comment, moved "nr_iowait" to the end of the structure, and added
>   "__no_randomize_layout" tag to the structure.
>   Link to v3: https://lore.kernel.org/lkml/20251028080348.2177469-1-blakejones@google.com/T/#u
>   (v3 includes details of previous perf testing)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e7718f12bc55..e1c3fefb14b7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1114,28 +1114,50 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
>   * acquire operations must be ordered by ascending &runqueue.
>   */
>  struct rq {
> -	/* runqueue lock: */
> -	raw_spinlock_t		__lock;
> -
> -	/* Per class runqueue modification mask; bits in class order. */
> -	unsigned int		queue_mask;
> +	/*
> +	 * The following members are loaded together, without holding the
> +	 * rq->lock, in an extremely hot loop in update_sg_lb_stats()
> +	 * (called from pick_next_task()). To reduce cache pollution from
> +	 * this operation, they are placed together on this dedicated cache
> +	 * line. Even though some of them are frequently modified, they are
> +	 * loaded much more frequently than they are stored.
> +	 */
>  	unsigned int		nr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int		nr_numa_running;
>  	unsigned int		nr_preferred_running;
> -	unsigned int		numa_migrate_on;
>  #endif
> +	unsigned int		ttwu_pending;
> +	unsigned long		cpu_capacity;
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	struct task_struct __rcu	*donor;  /* Scheduling context */
> +	struct task_struct __rcu	*curr;   /* Execution context */
> +#else
> +	union {
> +		struct task_struct __rcu *donor; /* Scheduler context */
> +		struct task_struct __rcu *curr;  /* Execution context */
> +	};
> +#endif
> +	struct task_struct	*idle;
> +	/* padding left here deliberately */
> +
> +	/*
> +	 * The next cacheline holds the (hot) runqueue lock, as well as
> +	 * some other less performance-critical fields.
> +	 */
> +	u64			nr_switches	____cacheline_aligned;
> +
> +	/* runqueue lock: */
> +	raw_spinlock_t		__lock;
> +
>  #ifdef CONFIG_NO_HZ_COMMON
> -	unsigned long		last_blocked_load_update_tick;
> -	unsigned int		has_blocked_load;
> -	call_single_data_t	nohz_csd;
>  	unsigned int		nohz_tick_stopped;
>  	atomic_t		nohz_flags;
> +	unsigned int		has_blocked_load;
> +	unsigned long		last_blocked_load_update_tick;
> +	call_single_data_t	nohz_csd;
>  #endif /* CONFIG_NO_HZ_COMMON */
>  
> -	unsigned int		ttwu_pending;
> -	u64			nr_switches;
> -
>  #ifdef CONFIG_UCLAMP_TASK
>  	/* Utilization clamp values based on CPU's RUNNABLE tasks */
>  	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
> @@ -1158,6 +1180,9 @@ struct rq {
>  	struct list_head	*tmp_alone_branch;
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_NUMA_BALANCING
> +	unsigned int		numa_migrate_on;
> +#endif
>  	/*
>  	 * This is part of a global counter where only the total sum
>  	 * over all CPUs matters. A task can increase this counter on
> @@ -1166,36 +1191,31 @@ struct rq {
>  	 */
>  	unsigned long 		nr_uninterruptible;
>  
> -#ifdef CONFIG_SCHED_PROXY_EXEC
> -	struct task_struct __rcu	*donor;  /* Scheduling context */
> -	struct task_struct __rcu	*curr;   /* Execution context */
> -#else
> -	union {
> -		struct task_struct __rcu *donor; /* Scheduler context */
> -		struct task_struct __rcu *curr;  /* Execution context */
> -	};
> -#endif
>  	struct sched_dl_entity	*dl_server;
> -	struct task_struct	*idle;
>  	struct task_struct	*stop;
>  	unsigned long		next_balance;
>  	struct mm_struct	*prev_mm;
>  
> -	unsigned int		clock_update_flags;
> -	u64			clock;
> -	/* Ensure that all clocks are in the same cache line */
> +	/* Per class runqueue modification mask; bits in class order. */
> +	unsigned int		queue_mask;
> +
> +	/*
> +	 * The following fields of clock data are frequently referenced
> +	 * and updated together, and should go on their own cache line.
> +	 */
>  	u64			clock_task ____cacheline_aligned;
>  	u64			clock_pelt;
> +	u64			clock;
>  	unsigned long		lost_idle_time;
> +	unsigned int		clock_update_flags;
>  	u64			clock_pelt_idle;
>  	u64			clock_idle;
> +
>  #ifndef CONFIG_64BIT
>  	u64			clock_pelt_idle_copy;
>  	u64			clock_idle_copy;
>  #endif
>  
> -	atomic_t		nr_iowait;
> -
>  	u64 last_seen_need_resched_ns;
>  	int ticks_without_resched;
>  
> @@ -1206,8 +1226,6 @@ struct rq {
>  	struct root_domain		*rd;
>  	struct sched_domain __rcu	*sd;
>  
> -	unsigned long		cpu_capacity;
> -
>  	struct balance_callback *balance_callback;
>  
>  	unsigned char		nohz_idle_balance;
> @@ -1317,7 +1335,9 @@ struct rq {
>  	call_single_data_t	cfsb_csd;
>  	struct list_head	cfsb_csd_list;
>  #endif
> -};
> +
> +	atomic_t		nr_iowait;
> +} __no_randomize_layout;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  
Powered by blists - more mailing lists
 
