[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230420205201.36fphk5g3aolryjh@parnassus.localdomain>
Date: Thu, 20 Apr 2023 16:52:01 -0400
From: Daniel Jordan <daniel.m.jordan@...cle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Aaron Lu <aaron.lu@...el.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
Nitin Tekchandani <nitin.tekchandani@...el.com>,
Waiman Long <longman@...hat.com>,
Yu Chen <yu.c.chen@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node
On Wed, Apr 12, 2023 at 02:07:36PM +0200, Peter Zijlstra wrote:
> On Thu, Mar 30, 2023 at 01:45:57PM -0400, Daniel Jordan wrote:
>
> > The topology of my machine is different from yours, but it's the biggest
> > I have, and I'm assuming cpu count is more important than topology when
> > reproducing the remote accesses. I also tried on
>
> Core count definitely matters some, but the thing that really hurts is
> the cross-node (and cross-cache, which for intel happens to be the same
> set) atomics.
>
> I suppose the thing to measure is where this cost rises most sharply on
> the AMD platforms -- is that cross LLC or cross Node?
>
> I mean, setting up the split at boot time is fairly straight forward and
> we could equally well split at LLC.
To check the cross LLC case, I bound all postgres and sysbench tasks to
a node. The two functions aren't free then on either AMD or Intel,
multiple LLCs or not, but the pain is a bit greater in the cross node
(unbound) case.
The read side (update_cfs_group) gets more expensive with per-node tg
load_avg on AMD, especially cross node--those are the biggest diffs.
These are more containerized sysbench runs, just the same as before.
Base is 6.2, test is 6.2 plus this RFC. Each number under base or test
is the average over ten runs of the profile percent of the function
measured for 5 seconds, 60 seconds into the run. I ran the experiment a
second time, and the numbers were fairly similar to what's below.
AMD EPYC 7J13 64-Core Processor (NPS1)
2 sockets * 64 cores * 2 threads = 256 CPUs
update_load_avg profile% update_cfs_group profile%
affinity nr_threads base test diff base test diff
unbound 96 0.7 0.6 -0.1 0.3 0.6 0.4
unbound 128 0.8 0.7 0.0 0.3 0.7 0.4
unbound 160 2.4 1.7 -0.7 1.2 2.3 1.1
unbound 192 2.3 1.7 -0.6 0.9 2.4 1.5
unbound 224 0.9 0.9 0.0 0.3 0.6 0.3
unbound 256 0.4 0.4 0.0 0.1 0.2 0.1
node0 48 0.7 0.6 -0.1 0.3 0.6 0.3
node0 64 0.7 0.7 -0.1 0.3 0.6 0.3
node0 80 1.4 1.3 -0.1 0.3 0.6 0.3
node0 96 1.5 1.4 -0.1 0.3 0.6 0.3
node0 112 0.8 0.8 0.0 0.2 0.4 0.2
node0 128 0.4 0.4 0.0 0.1 0.2 0.1
node1 48 0.7 0.6 -0.1 0.3 0.6 0.3
node1 64 0.7 0.6 -0.1 0.3 0.6 0.3
node1 80 1.4 1.2 -0.1 0.3 0.6 0.3
node1 96 1.4 1.3 -0.2 0.3 0.6 0.3
node1 112 0.8 0.7 -0.1 0.2 0.3 0.2
node1 128 0.4 0.4 0.0 0.1 0.2 0.1
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
2 sockets * 32 cores * 2 thread = 128 CPUs
update_load_avg profile% update_cfs_group profile%
affinity nr_threads base test diff base test diff
unbound 48 0.4 0.4 0.0 0.4 0.5 0.1
unbound 64 0.5 0.5 0.0 0.5 0.6 0.1
unbound 80 2.0 1.8 -0.2 2.7 2.4 -0.3
unbound 96 3.3 2.8 -0.5 3.6 3.3 -0.3
unbound 112 2.8 2.6 -0.2 4.1 3.3 -0.8
unbound 128 0.4 0.4 0.0 0.4 0.4 0.1
node0 24 0.4 0.4 0.0 0.3 0.5 0.2
node0 32 0.5 0.5 0.0 0.3 0.4 0.2
node0 40 1.0 1.1 0.1 0.7 0.8 0.1
node0 48 1.5 1.6 0.1 0.8 0.9 0.1
node0 56 1.8 1.9 0.1 0.8 0.9 0.1
node0 64 0.4 0.4 0.0 0.2 0.4 0.1
node1 24 0.4 0.5 0.0 0.3 0.5 0.2
node1 32 0.4 0.5 0.0 0.3 0.5 0.2
node1 40 1.0 1.1 0.0 0.7 0.8 0.1
node1 48 1.6 1.6 0.1 0.8 0.9 0.1
node1 56 1.8 1.9 0.1 0.8 0.9 0.1
node1 64 0.4 0.4 0.0 0.2 0.4 0.1
Powered by blists - more mailing lists