linux-kernel - Re: [PATCH v4] sched: reorder some fields in struct rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAP_z_CjGGU5O3JTnEdAadFAxfzXG08SxunB8XyOagW-hXwLqLg@mail.gmail.com>
Date: Mon, 3 Nov 2025 18:53:53 -0800
From: Blake Jones <blakejones@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, 
	Josh Don <joshdon@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v4] sched: reorder some fields in struct rq

Hi Peter,

Do you have any more questions or thoughts about this patch, after our
discussion of v3 last week?

Blake

On Wed, Oct 29, 2025 at 11:55 AM Blake Jones <blakejones@...gle.com> wrote:
> This colocates some hot fields in "struct rq" to be on the same cache line
> as others that are often accessed at the same time or in similar ways.
>
> Using data from a Google-internal fleet-scale profiler, I found three
> distinct groups of hot fields in struct rq:
>
> - (1) The runqueue lock: __lock.
>
> - (2) Those accessed from hot code in pick_next_task_fair():
>       nr_running, nr_numa_running, nr_preferred_running,
>       ttwu_pending, cpu_capacity, curr, idle.
>
> - (3) Those accessed from some other hot codepaths, e.g.
>       update_curr(), update_rq_clock(), and scheduler_tick():
>       clock_task, clock_pelt, clock, lost_idle_time,
>       clock_update_flags, clock_pelt_idle, clock_idle.
>
> The cycles spent on accessing these different groups of fields broke down
> roughly as follows:
>
> - 50% on group (1) (the runqueue lock, always read-write)
> - 39% on group (2) (load:store ratio around 38:1)
> -  8% on group (3) (load:store ratio around 5:1)
> -  3% on all the other fields
>
> Most of the fields in group (3) are already in a cache line grouping; this
> patch just adds "clock" and "clock_update_flags" to that group. The fields
> in group (2) are scattered across several cache lines; the main effect of
> this patch is to group them together, on a single line at the beginning of
> the structure. A few other less performance-critical fields (nr_switches,
> numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
> were also reordered to reduce holes in the data structure.
>
> Since the runqueue lock is acquired from so many different contexts, and is
> basically always accessed using an atomic operation, putting it on either
> of the cache lines for groups (2) or (3) would slow down accesses to those
> fields dramatically, since those groups are read-mostly accesses.
>
> This patch does not change the size of "struct rq" on machines with 64-byte
> cache lines. The additional "____cacheline_aligned" to put the runqueue
> lock on the next cache line will add an additional 64 bytes of padding on
> machines with 128-byte cache lines; although this is unfortunate, it seemed
> more likely to lead to stably good performance than e.g. by just putting
> the runqueue lock somewhere in the middle of the structure and hoping it
> wasn't on an otherwise busy cache line.
>
> The "__no_randomize_layout" was added to reflect the fact that performance
> of this data structure is unusually sensitive to placement of its members.
>
> Signed-off-by: Blake Jones <blakejones@...gle.com>
> Reviewed-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> Tested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> ---
>  kernel/sched/sched.h | 77 +++++++++++++++++++++++++++-----------------
>  1 file changed, 47 insertions(+), 30 deletions(-)
>
> v3 -> v4 changes:
> - Updated comment, moved "nr_iowait" to the end of the structure, and added
>   "__no_randomize_layout" tag to the structure.
>   Link to v3: https://lore.kernel.org/lkml/20251028080348.2177469-1-blakejones@google.com/T/#u
>   (v3 includes details of previous perf testing)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e7718f12bc55..e1c3fefb14b7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1114,28 +1114,50 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
>   * acquire operations must be ordered by ascending &runqueue.
>   */
>  struct rq {
> -       /* runqueue lock: */
> -       raw_spinlock_t          __lock;
> -
> -       /* Per class runqueue modification mask; bits in class order. */
> -       unsigned int            queue_mask;
> +       /*
> +        * The following members are loaded together, without holding the
> +        * rq->lock, in an extremely hot loop in update_sg_lb_stats()
> +        * (called from pick_next_task()). To reduce cache pollution from
> +        * this operation, they are placed together on this dedicated cache
> +        * line. Even though some of them are frequently modified, they are
> +        * loaded much more frequently than they are stored.
> +        */
>         unsigned int            nr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>         unsigned int            nr_numa_running;
>         unsigned int            nr_preferred_running;
> -       unsigned int            numa_migrate_on;
>  #endif
> +       unsigned int            ttwu_pending;
> +       unsigned long           cpu_capacity;
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +       struct task_struct __rcu        *donor;  /* Scheduling context */
> +       struct task_struct __rcu        *curr;   /* Execution context */
> +#else
> +       union {
> +               struct task_struct __rcu *donor; /* Scheduler context */
> +               struct task_struct __rcu *curr;  /* Execution context */
> +       };
> +#endif
> +       struct task_struct      *idle;
> +       /* padding left here deliberately */
> +
> +       /*
> +        * The next cacheline holds the (hot) runqueue lock, as well as
> +        * some other less performance-critical fields.
> +        */
> +       u64                     nr_switches     ____cacheline_aligned;
> +
> +       /* runqueue lock: */
> +       raw_spinlock_t          __lock;
> +
>  #ifdef CONFIG_NO_HZ_COMMON
> -       unsigned long           last_blocked_load_update_tick;
> -       unsigned int            has_blocked_load;
> -       call_single_data_t      nohz_csd;
>         unsigned int            nohz_tick_stopped;
>         atomic_t                nohz_flags;
> +       unsigned int            has_blocked_load;
> +       unsigned long           last_blocked_load_update_tick;
> +       call_single_data_t      nohz_csd;
>  #endif /* CONFIG_NO_HZ_COMMON */
>
> -       unsigned int            ttwu_pending;
> -       u64                     nr_switches;
> -
>  #ifdef CONFIG_UCLAMP_TASK
>         /* Utilization clamp values based on CPU's RUNNABLE tasks */
>         struct uclamp_rq        uclamp[UCLAMP_CNT] ____cacheline_aligned;
> @@ -1158,6 +1180,9 @@ struct rq {
>         struct list_head        *tmp_alone_branch;
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_NUMA_BALANCING
> +       unsigned int            numa_migrate_on;
> +#endif
>         /*
>          * This is part of a global counter where only the total sum
>          * over all CPUs matters. A task can increase this counter on
> @@ -1166,36 +1191,31 @@ struct rq {
>          */
>         unsigned long           nr_uninterruptible;
>
> -#ifdef CONFIG_SCHED_PROXY_EXEC
> -       struct task_struct __rcu        *donor;  /* Scheduling context */
> -       struct task_struct __rcu        *curr;   /* Execution context */
> -#else
> -       union {
> -               struct task_struct __rcu *donor; /* Scheduler context */
> -               struct task_struct __rcu *curr;  /* Execution context */
> -       };
> -#endif
>         struct sched_dl_entity  *dl_server;
> -       struct task_struct      *idle;
>         struct task_struct      *stop;
>         unsigned long           next_balance;
>         struct mm_struct        *prev_mm;
>
> -       unsigned int            clock_update_flags;
> -       u64                     clock;
> -       /* Ensure that all clocks are in the same cache line */
> +       /* Per class runqueue modification mask; bits in class order. */
> +       unsigned int            queue_mask;
> +
> +       /*
> +        * The following fields of clock data are frequently referenced
> +        * and updated together, and should go on their own cache line.
> +        */
>         u64                     clock_task ____cacheline_aligned;
>         u64                     clock_pelt;
> +       u64                     clock;
>         unsigned long           lost_idle_time;
> +       unsigned int            clock_update_flags;
>         u64                     clock_pelt_idle;
>         u64                     clock_idle;
> +
>  #ifndef CONFIG_64BIT
>         u64                     clock_pelt_idle_copy;
>         u64                     clock_idle_copy;
>  #endif
>
> -       atomic_t                nr_iowait;
> -
>         u64 last_seen_need_resched_ns;
>         int ticks_without_resched;
>
> @@ -1206,8 +1226,6 @@ struct rq {
>         struct root_domain              *rd;
>         struct sched_domain __rcu       *sd;
>
> -       unsigned long           cpu_capacity;
> -
>         struct balance_callback *balance_callback;
>
>         unsigned char           nohz_idle_balance;
> @@ -1317,7 +1335,9 @@ struct rq {
>         call_single_data_t      cfsb_csd;
>         struct list_head        cfsb_csd_list;
>  #endif
> -};
> +
> +       atomic_t                nr_iowait;
> +} __no_randomize_layout;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>
> --
> 2.51.2.997.g839fc31de9-goog
>