[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <be48760d-b705-424e-b6b7-b75205e83567@linux.ibm.com>
Date: Thu, 30 Oct 2025 11:59:09 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Blake Jones <blakejones@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Josh Don <joshdon@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH v4] sched: reorder some fields in struct rq
Hi Blake,
On 30/10/25 00:24, Blake Jones wrote:
> This colocates some hot fields in "struct rq" to be on the same cache line
> as others that are often accessed at the same time or in similar ways.
>
> Using data from a Google-internal fleet-scale profiler, I found three
> distinct groups of hot fields in struct rq:
>
> - (1) The runqueue lock: __lock.
>
> - (2) Those accessed from hot code in pick_next_task_fair():
> nr_running, nr_numa_running, nr_preferred_running,
> ttwu_pending, cpu_capacity, curr, idle.
>
> - (3) Those accessed from some other hot codepaths, e.g.
> update_curr(), update_rq_clock(), and scheduler_tick():
> clock_task, clock_pelt, clock, lost_idle_time,
> clock_update_flags, clock_pelt_idle, clock_idle.
>
> The cycles spent on accessing these different groups of fields broke down
> roughly as follows:
>
> - 50% on group (1) (the runqueue lock, always read-write)
> - 39% on group (2) (load:store ratio around 38:1)
> - 8% on group (3) (load:store ratio around 5:1)
> - 3% on all the other fields
>
Thanks for the patch.
Quick question about the lock placement: Did you test Peter's suggestion of
co-locating __lock with the clock fields? If so, how did it compare to keeping
them separate?
Thanks,
Madadi Vineeth Reddy
> Most of the fields in group (3) are already in a cache line grouping; this
> patch just adds "clock" and "clock_update_flags" to that group. The fields
> in group (2) are scattered across several cache lines; the main effect of
> this patch is to group them together, on a single line at the beginning of
> the structure. A few other less performance-critical fields (nr_switches,
> numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
> were also reordered to reduce holes in the data structure.
>
> Since the runqueue lock is acquired from so many different contexts, and is
> basically always accessed using an atomic operation, putting it on either
> of the cache lines for groups (2) or (3) would slow down accesses to those
> fields dramatically, since those groups are read-mostly accesses.
>
> This patch does not change the size of "struct rq" on machines with 64-byte
> cache lines. The additional "____cacheline_aligned" to put the runqueue
> lock on the next cache line will add an additional 64 bytes of padding on
> machines with 128-byte cache lines; although this is unfortunate, it seemed
> more likely to lead to stably good performance than e.g. by just putting
> the runqueue lock somewhere in the middle of the structure and hoping it
> wasn't on an otherwise busy cache line.
>
> The "__no_randomize_layout" was added to reflect the fact that performance
> of this data structure is unusually sensitive to placement of its members.
>
> Signed-off-by: Blake Jones <blakejones@...gle.com>
> Reviewed-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> Tested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> ---
> kernel/sched/sched.h | 77 +++++++++++++++++++++++++++-----------------
> 1 file changed, 47 insertions(+), 30 deletions(-)
>
> v3 -> v4 changes:
> - Updated comment, moved "nr_iowait" to the end of the structure, and added
> "__no_randomize_layout" tag to the structure.
> Link to v3: https://lore.kernel.org/lkml/20251028080348.2177469-1-blakejones@google.com/T/#u
> (v3 includes details of previous perf testing)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e7718f12bc55..e1c3fefb14b7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1114,28 +1114,50 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
> * acquire operations must be ordered by ascending &runqueue.
> */
> struct rq {
> - /* runqueue lock: */
> - raw_spinlock_t __lock;
> -
> - /* Per class runqueue modification mask; bits in class order. */
> - unsigned int queue_mask;
> + /*
> + * The following members are loaded together, without holding the
> + * rq->lock, in an extremely hot loop in update_sg_lb_stats()
> + * (called from pick_next_task()). To reduce cache pollution from
> + * this operation, they are placed together on this dedicated cache
> + * line. Even though some of them are frequently modified, they are
> + * loaded much more frequently than they are stored.
> + */
> unsigned int nr_running;
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> - unsigned int numa_migrate_on;
> #endif
> + unsigned int ttwu_pending;
> + unsigned long cpu_capacity;
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> + struct task_struct __rcu *donor; /* Scheduling context */
> + struct task_struct __rcu *curr; /* Execution context */
> +#else
> + union {
> + struct task_struct __rcu *donor; /* Scheduler context */
> + struct task_struct __rcu *curr; /* Execution context */
> + };
> +#endif
> + struct task_struct *idle;
> + /* padding left here deliberately */
> +
> + /*
> + * The next cacheline holds the (hot) runqueue lock, as well as
> + * some other less performance-critical fields.
> + */
> + u64 nr_switches ____cacheline_aligned;
> +
> + /* runqueue lock: */
> + raw_spinlock_t __lock;
> +
> #ifdef CONFIG_NO_HZ_COMMON
> - unsigned long last_blocked_load_update_tick;
> - unsigned int has_blocked_load;
> - call_single_data_t nohz_csd;
> unsigned int nohz_tick_stopped;
> atomic_t nohz_flags;
> + unsigned int has_blocked_load;
> + unsigned long last_blocked_load_update_tick;
> + call_single_data_t nohz_csd;
> #endif /* CONFIG_NO_HZ_COMMON */
>
> - unsigned int ttwu_pending;
> - u64 nr_switches;
> -
> #ifdef CONFIG_UCLAMP_TASK
> /* Utilization clamp values based on CPU's RUNNABLE tasks */
> struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
> @@ -1158,6 +1180,9 @@ struct rq {
> struct list_head *tmp_alone_branch;
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_NUMA_BALANCING
> + unsigned int numa_migrate_on;
> +#endif
> /*
> * This is part of a global counter where only the total sum
> * over all CPUs matters. A task can increase this counter on
> @@ -1166,36 +1191,31 @@ struct rq {
> */
> unsigned long nr_uninterruptible;
>
> -#ifdef CONFIG_SCHED_PROXY_EXEC
> - struct task_struct __rcu *donor; /* Scheduling context */
> - struct task_struct __rcu *curr; /* Execution context */
> -#else
> - union {
> - struct task_struct __rcu *donor; /* Scheduler context */
> - struct task_struct __rcu *curr; /* Execution context */
> - };
> -#endif
> struct sched_dl_entity *dl_server;
> - struct task_struct *idle;
> struct task_struct *stop;
> unsigned long next_balance;
> struct mm_struct *prev_mm;
>
> - unsigned int clock_update_flags;
> - u64 clock;
> - /* Ensure that all clocks are in the same cache line */
> + /* Per class runqueue modification mask; bits in class order. */
> + unsigned int queue_mask;
> +
> + /*
> + * The following fields of clock data are frequently referenced
> + * and updated together, and should go on their own cache line.
> + */
> u64 clock_task ____cacheline_aligned;
> u64 clock_pelt;
> + u64 clock;
> unsigned long lost_idle_time;
> + unsigned int clock_update_flags;
> u64 clock_pelt_idle;
> u64 clock_idle;
> +
> #ifndef CONFIG_64BIT
> u64 clock_pelt_idle_copy;
> u64 clock_idle_copy;
> #endif
>
> - atomic_t nr_iowait;
> -
> u64 last_seen_need_resched_ns;
> int ticks_without_resched;
>
> @@ -1206,8 +1226,6 @@ struct rq {
> struct root_domain *rd;
> struct sched_domain __rcu *sd;
>
> - unsigned long cpu_capacity;
> -
> struct balance_callback *balance_callback;
>
> unsigned char nohz_idle_balance;
> @@ -1317,7 +1335,9 @@ struct rq {
> call_single_data_t cfsb_csd;
> struct list_head cfsb_csd_list;
> #endif
> -};
> +
> + atomic_t nr_iowait;
> +} __no_randomize_layout;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
Powered by blists - more mailing lists