[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <affdc6b1-9980-44d1-89db-d90730c1e384@linux.ibm.com>
Date: Wed, 13 Aug 2025 13:00:30 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Blake Jones <blakejones@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Cc: Josh Don <joshdon@...gle.com>,
Dietmar Eggemann
<dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH v2] Reorder some fields in struct rq.
Hi Blake,
On 31/07/25 02:26, Blake Jones wrote:
> This colocates some hot fields in "struct rq" to be on the same cache line
> as others that are often accessed at the same time or in similar ways.
>
[..snip..]
>
> This patch does not change the size of "struct rq" on machines with 64-byte
> cache lines. The additional "____cacheline_aligned" to put the runqueue
> lock on the next cache line will add an additional 64 bytes of padding on
> machines with 128-byte cache lines; although this is unfortunate, it seemed
> more likely to lead to stably good performance than e.g. by just putting
> the runqueue lock somewhere in the middle of the structure and hoping it
> wasn't on an otherwise busy cache line.
This change introduced an 88 byte hole due to having __lock in a different
cache line on Power11 which is 128 byte architecture which led to one cacheline
more than before.
Tested with your custom test case (thanks for sharing) and observed around
~5% decrease in the number of cycles, along with a slight increase in user
time — both are positive indicators.
Also ran ebizzy, which doesn’t seem to be impacted. I think it would be good
to run a set of standard benchmarks like schbench, ebizzy, hackbench, and
stress-ng, along with a real-life workload, to ensure there’s no negative
impact. I saw that hackbench was tried, but including those numbers would
be helpful.
Reviewed-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Tested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Thanks,
Madadi Vineeth Reddy
>
> I ran "hackbench" to test this change, but it didn't show very conclusive
> results. Looking at a profile of the hackbench run, it was spending 95% of
> its cycles inside __alloc_skb(), __kfree_skb(), or kmem_cache_free() -
> almost all of which was spent updating memcg counters or contending on the
> list_lock in kmem_cache_node. In contrast, it spent less than 0.5% of its
> cycles inside either schedule() or try_to_wake_up(). So it's not surprising
> that it didn't show useful results here.
>
[..snip..]
> @@ -1182,8 +1199,6 @@ struct rq {
> struct root_domain *rd;
> struct sched_domain __rcu *sd;
>
> - unsigned long cpu_capacity;
> -
> struct balance_callback *balance_callback;
>
> unsigned char nohz_idle_balance;
Powered by blists - more mailing lists