[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <172e48ee-7d90-4704-ad29-476fa7ee0aab@linux.ibm.com>
Date: Wed, 30 Jul 2025 12:05:50 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Blake Jones <blakejones@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Josh Don <joshdon@...gle.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH] Reorder some fields in struct rq.
Hi Blake,
On 23/07/25 01:43, Blake Jones wrote:
> This colocates some hot fields in "struct rq" to be on the same cache line
> as others that are often accessed at the same time or in similar ways.
>
> Using data from a Google-internal fleet-scale profiler, I found three
> distinct groups of hot fields in struct rq:
>
> - (1) The runqueue lock: __lock.
>
> - (2) Those accessed from hot code in pick_next_task_fair():
> nr_running, nr_numa_running, nr_preferred_running,
> ttwu_pending, cpu_capacity, curr, idle.
>
> - (3) Those accessed from some other hot codepaths, e.g.
> update_curr(), update_rq_clock(), and scheduler_tick():
> clock_task, clock_pelt, clock, lost_idle_time,
> clock_update_flags, clock_pelt_idle, clock_idle.
>
> The cycles spent on accessing these different groups of fields broke down
> roughly as follows:
>
> - 50% on group (1) (the runqueue lock, always read-write)
> - 39% on group (2) (load:store ratio around 38:1)
> - 8% on group (3) (load:store ratio around 5:1)
> - 3% on all the other fields
>
> Most of the fields in group (3) are already in a cache line grouping; this
> patch just adds "clock" and "clock_update_flags" to that group. The fields
> in group (2) are scattered across several cache lines; the main effect of
> this patch is to group them together, on a single line at the beginning of
> the structure. A few other less performance-critical fields (nr_switches,
> numa_migrate_on, has_blocked_load, nohz_csd, last_blocked_load_update_tick)
> were also reordered to reduce holes in the data structure.
>
> Since the runqueue lock is acquired from so many different contexts, and is
> basically always accessed using an atomic operation, putting it on either
> of the cache lines for groups (2) or (3) would slow down accesses to those
> fields dramatically, since those groups are read-mostly accesses.
>
> This patch does not change the size of "struct rq" on machines with 64-byte
> cache lines. The additional "____cacheline_aligned" to put the runqueue
> lock on the next cache line will add an additional 64 bytes of padding on
> machines with 128-byte cache lines; although this is unfortunate, it seemed
> more likely to lead to stably good performance than e.g. by just putting
> the runqueue lock somewhere in the middle of the structure and hoping it
> wasn't on an otherwise busy cache line.
>
> I ran "hackbench" to test this change, but it didn't show very conclusive
> results. Looking at a profile of the hackbench run, it was spending 95% of
> its cycles inside __alloc_skb(), __kfree_skb(), or kmem_cache_free() -
> almost all of which was spent updating memcg counters or contending on the
> list_lock in kmem_cache_node. In contrast, it spent less than 0.5% of its
> cycles inside either schedule() or try_to_wake_up(). So it's not surprising
> that it didn't show useful results here.
>
> Instead, to test this, I wrote a focused load test that would put load on
> the pick_next_task_fair() path. A parent process would fork many child
> processes, and each child would nanosleep() for 1 msec many times in a
> loop. The load test was monitored with "perf", and I looked at the amount
> of cycles that were spent with sched_balance_rq() on the stack. The test
> reliably showed 15-25% of all cycles were spent there. I ran it 80 times on
> a pair of 2-socket Intel GNR machines (480 vCPUs per machine) - one running
> 6.16-rc7, the other running this change - using 4800 child processes and
> 8000 1-msec sleeps per child. The mean cycle count dropped from 405.4B to
> 387.6B, or a 4.4% decrease.
Thanks for your patch.
Could you please share the test you mentioned? I would like to try it on a
Power11 system.
Also, I noticed the patch doesn't apply cleanly on the latest tip/sched/core,
since the CONFIG_SMP ifdefs have been removed. Could you rebase it on the current
tip/sched/core tree?
Thanks,
Madadi Vineeth Reddy
>
> More importantly, given that this change reduces cache misses in a very hot
> kernel codepath, there's likely to be additional application performance
> improvement due to reduced cache conflicts from kernel data structures.
>
> Signed-off-by: Blake Jones <blakejones@...gle.com>
> ---
> kernel/sched/sched.h | 59 +++++++++++++++++++++++++++-----------------
> 1 file changed, 37 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 83e3aa9171429..b21be28823609 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1097,30 +1097,45 @@ DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
> * acquire operations must be ordered by ascending &runqueue.
> */
> struct rq {
> - /* runqueue lock: */
> - raw_spinlock_t __lock;
> -
> + /*
> + * The following members are loaded together from pick_next_task(),
> + * and should be on an isolated cache line to avoid cache pollution.
> + */
> unsigned int nr_running;
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> unsigned int nr_preferred_running;
> - unsigned int numa_migrate_on;
> #endif
> +#ifdef CONFIG_SMP
> + unsigned int ttwu_pending;
> + unsigned long cpu_capacity;
> +#endif
> + union {
> + struct task_struct __rcu *donor; /* Scheduler context */
> + struct task_struct __rcu *curr; /* Execution context */
> + };
> + struct task_struct *idle;
> + /* padding left here deliberately */
> +
> + /*
> + * The next cacheline holds the runqueue lock, as well as some
> + * other less performance-critical fields.
> + */
> + u64 nr_switches ____cacheline_aligned;
> +
> + /* runqueue lock: */
> + raw_spinlock_t __lock;
> +
> #ifdef CONFIG_NO_HZ_COMMON
> + unsigned int nohz_tick_stopped;
> + atomic_t nohz_flags;
> #ifdef CONFIG_SMP
> - unsigned long last_blocked_load_update_tick;
> unsigned int has_blocked_load;
> + unsigned long last_blocked_load_update_tick;
> call_single_data_t nohz_csd;
> #endif /* CONFIG_SMP */
> - unsigned int nohz_tick_stopped;
> - atomic_t nohz_flags;
> #endif /* CONFIG_NO_HZ_COMMON */
>
> -#ifdef CONFIG_SMP
> - unsigned int ttwu_pending;
> -#endif
> - u64 nr_switches;
> -
> #ifdef CONFIG_UCLAMP_TASK
> /* Utilization clamp values based on CPU's RUNNABLE tasks */
> struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
> @@ -1143,6 +1158,9 @@ struct rq {
> struct list_head *tmp_alone_branch;
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> +#ifdef CONFIG_NUMA_BALANCING
> + unsigned int numa_migrate_on;
> +#endif
> /*
> * This is part of a global counter where only the total sum
> * over all CPUs matters. A task can increase this counter on
> @@ -1151,24 +1169,23 @@ struct rq {
> */
> unsigned long nr_uninterruptible;
>
> - union {
> - struct task_struct __rcu *donor; /* Scheduler context */
> - struct task_struct __rcu *curr; /* Execution context */
> - };
> struct sched_dl_entity *dl_server;
> - struct task_struct *idle;
> struct task_struct *stop;
> unsigned long next_balance;
> struct mm_struct *prev_mm;
>
> - unsigned int clock_update_flags;
> - u64 clock;
> - /* Ensure that all clocks are in the same cache line */
> + /*
> + * The following fields of clock data are frequently referenced
> + * and updated together, and should go on their own cache line.
> + */
> u64 clock_task ____cacheline_aligned;
> u64 clock_pelt;
> + u64 clock;
> unsigned long lost_idle_time;
> + unsigned int clock_update_flags;
> u64 clock_pelt_idle;
> u64 clock_idle;
> +
> #ifndef CONFIG_64BIT
> u64 clock_pelt_idle_copy;
> u64 clock_idle_copy;
> @@ -1187,8 +1204,6 @@ struct rq {
> struct root_domain *rd;
> struct sched_domain __rcu *sd;
>
> - unsigned long cpu_capacity;
> -
> struct balance_callback *balance_callback;
>
> unsigned char nohz_idle_balance;
Powered by blists - more mailing lists