[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230620173626.GA3027191@maniforge>
Date: Tue, 20 Jun 2023 12:36:26 -0500
From: David Vernet <void@...ifault.com>
To: Aaron Lu <aaron.lu@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
linux-kernel@...r.kernel.org, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
rostedt@...dmis.org, dietmar.eggemann@....com, bsegall@...gle.com,
mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
joshdon@...gle.com, roman.gushchin@...ux.dev, tj@...nel.org,
kernel-team@...a.com
Subject: Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS
On Fri, Jun 16, 2023 at 08:53:38AM +0800, Aaron Lu wrote:
> On Thu, Jun 15, 2023 at 06:26:05PM -0500, David Vernet wrote:
>
> > Ok, it seems that the issue is that I wasn't creating enough netperf
> > clients. I assumed that -n $(nproc) was sufficient. I was able to repro
>
> Yes that switch is confusing.
>
> > the contention on my 26 core / 52 thread skylake client as well:
> >
> >
>
> > Thanks for the help in getting the repro on my end.
>
> You are welcome.
>
> > So yes, there is certainly a scalability concern to bear in mind for
> > swqueue for LLCs with a lot of cores. If you have a lot of tasks quickly
> > e.g. blocking and waking on futexes in a tight loop, I expect a similar
> > issue would be observed.
> >
> > On the other hand, the issue did not occur on my 7950X. I also wasn't
>
> Using netperf/UDP_RR?
Correct
> > able to repro the contention on the Skylake if I ran with the default
> > netperf workload rather than UDP_RR (even with the additional clients).
>
> I also tried that on the 18cores/36threads/LLC Skylake and the contention
> is indeed much smaller than UDP_RR:
>
> 7.30% 7.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> But I wouldn't say it's entirely gone. Also consider Skylake has a lot
> fewer cores per LLC than later Intel servers like Icelake and Sapphire
> Rapids and I expect things would be worse on those two machines.
I cannot reproduce this contention locally, even on a slightly larger
Skylake. Not really sure what to make of the difference here. Perhaps
it's because you're running with CONFIG_SCHED_CORE=y? What is the
change in throughput when you run the default workload on your SKL?
> > I didn't bother to take the mean of all of the throughput results
> > between NO_SWQUEUE and SWQUEUE, but they looked roughly equal.
> >
> > So swqueue isn't ideal for every configuration, but I'll echo my
> > sentiment from [0] that this shouldn't on its own necessarily preclude
> > it from being merged given that it does help a large class of
> > configurations and workloads, and it's disabled by default.
> >
> > [0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/
>
> I was wondering: does it make sense to do some divide on machines with
> big LLCs? Like converting the per-LLC swqueue to per-group swqueue where
> the group can be made of ~8 cpus of the same LLC. This will have a
> similar effect of reducing the number of CPUs in a single LLC so the
> scalability issue can hopefully be fixed while at the same time, it
> might still help some workloads. I realized this isn't ideal in that
> wakeup happens at LLC scale so the group thing may not fit very well
> here.
>
> Just a thought, feel free to ignore it if you don't think this is
> feasible :-)
That's certainly an idea we could explore, but my inclination would be
to keep everything at a per-LLC granularity. It makes it easier to
reason about performance; both in terms of work conservation per-LLC
(again, not every workload suffers from having large LLCs even if others
do, and halving the size of a swqueue in an LLC could harm other
workloads which benefit from the increased work conservation), and in
terms of contention. To the latter point, I think it would be difficult
to choose an LLC size that wasn't somewhat artificial and workload
specific. If someone has that requirement, I think sched_ext would be a
better alternative.
Thanks,
David
Powered by blists - more mailing lists