[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpFHyzCwGov7YzrE1UDc+0jxKjfm_Kcn3NHR=NXPma3PnQ@mail.gmail.com>
Date: Mon, 12 Aug 2024 08:27:27 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Mateusz Guzik <mjguzik@...il.com>, Mel Gorman <mgorman@...e.de>
Cc: Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Liam.Howlett@...cle.com, pedro.falcato@...il.com,
Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [RFC PATCH] vm: align vma allocation and move the lock back into
the struct
On Sun, Aug 11, 2024 at 9:29 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>
> On Mon, Aug 12, 2024 at 12:50 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > Ok, disabling adjacent cacheline prefetching seems to do the trick (or
> > at least cuts down the regression drastically):
> >
> > Hmean faults/cpu-1 470577.6434 ( 0.00%) 470745.2649 * 0.04%*
> > Hmean faults/cpu-4 445862.9701 ( 0.00%) 445572.2252 * -0.07%*
> > Hmean faults/cpu-7 422516.4002 ( 0.00%) 422677.5591 * 0.04%*
> > Hmean faults/cpu-12 344483.7047 ( 0.00%) 330476.7911 * -4.07%*
> > Hmean faults/cpu-21 192836.0188 ( 0.00%) 195266.8071 * 1.26%*
> > Hmean faults/cpu-30 140745.9472 ( 0.00%) 140655.0459 * -0.06%*
> > Hmean faults/cpu-48 110507.4310 ( 0.00%) 103802.1839 * -6.07%*
> > Hmean faults/cpu-56 93507.7919 ( 0.00%) 95105.1875 * 1.71%*
> > Hmean faults/sec-1 470232.3887 ( 0.00%) 470404.6525 * 0.04%*
> > Hmean faults/sec-4 1757368.9266 ( 0.00%) 1752852.8697 * -0.26%*
> > Hmean faults/sec-7 2909554.8150 ( 0.00%) 2915885.8739 * 0.22%*
> > Hmean faults/sec-12 4033840.8719 ( 0.00%) 3845165.3277 * -4.68%*
> > Hmean faults/sec-21 3845857.7079 ( 0.00%) 3890316.8799 * 1.16%*
> > Hmean faults/sec-30 3838607.4530 ( 0.00%) 3838861.8142 * 0.01%*
> > Hmean faults/sec-48 4882118.9701 ( 0.00%) 4608985.0530 * -5.59%*
> > Hmean faults/sec-56 4933535.7567 ( 0.00%) 5004208.3329 * 1.43%*
> >
> > Now, how do we disable prefetching extra cachelines for vm_area_structs only?
>
> I'm unaware of any mechanism of the sort.
>
> The good news is that Broadwell is an old yeller and if memory serves
> right the impact is not anywhere near this bad on newer
> microarchitectures, making "merely" 64 alignment (used all over in the
> kernel for amd64) a practical choice (not just for vma).
That's indeed good news if other archs are not that sensitive to this.
>
> Also note that in your setup you are losing out on performance in
> other multithreaded cases, unrelated to anything vma.
>
> That aside as I mentioned earlier the dedicated vma lock cache results
> in false sharing between separate vmas, except this particular
> benchmark does not test for it (which in your setup should be visible
> even if the cache grows the SLAB_HWCACHE_ALIGN flag).
When implementing VMA locks I did experiment with SLAB_HWCACHE_ALIGN
for vm_lock cache using different benchmarks and didn't see
improvements above noise level. Do you know of some specific benchmark
that would possibly show improvement?
>
> I think the thing to do here is to bench on other cpus and ignore the
> Broadwell + adjacent cache line prefetcher result if they come back
> fine -- the code should not be held hostage by an old yeller.
That sounds like a good idea. Mel Gorman first reported this
regression when I was developing VMA locks and I believe he has a farm
of different machines to run mmtests on. CC'ing Mel.
Mel, would you be able to run PFT tests with the patch at
https://lore.kernel.org/all/20240808185949.1094891-1-mjguzik@gmail.com/
vs baseline on your farm? The goal is to see if any architecture other
than Broadwell shows performance regression.
>
> To that end I think it would be best to ask the LKP folks at Intel.
> They are very approachable so there should be no problem arranging it
> provided they have some spare capacity. I believe grabbing the From
> person and the cc list from this thread will do it:
> https://lore.kernel.org/oe-lkp/ZriCbCPF6I0JnbKi@xsang-OptiPlex-9020/ .
> By default they would run their own suite, which presumably has some
> overlap with this particular benchmark in terms of generated workload
> (but I don't think they run *this* particular benchmark itself,
> perhaps it would make sense to ask them to add it?). It's your call
> here.
Thanks for the suggestion. Let's see if Mel can use his farm first and
then will ask Intel folks.
>
> If there are still problems and the lock needs to remain separate, the
> bare minimum damage-controlling measure would be to hwalign the vma
> lock cache -- it wont affect the pts benchmark, but it should help
> others.
Sure but I'll need to measure the improvement and for that I need a
banchmark or a workload. Any suggestions?
>
> Should the decision be to bring the lock back into the struct, I'll
> note my patch is merely slapped together to a state where it can be
> benchmarked and I have no interest in beating it into a committable
> shape. You stated you already had an equivalent (modulo keeping
> something in a space previously occupied by the pointer to the vma
> lock), so as far as I'm concerned you can submit that with your
> authorship.
Thanks! If we end up doing that I'll keep you as Suggested-by and will
add a link to this thread.
Thanks,
Suren.
> --
> Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists