linux-kernel - Re: [RFC PATCH] vm: align vma allocation and move the lock back into the struct

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHF=oPXU1RaCn3G0Scqw8+yr_0-Mj4ENZSYMyyGwc5Cgcg@mail.gmail.com>
Date: Mon, 12 Aug 2024 06:29:38 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	Liam.Howlett@...cle.com, pedro.falcato@...il.com, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Subject: Re: [RFC PATCH] vm: align vma allocation and move the lock back into
 the struct

On Mon, Aug 12, 2024 at 12:50 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> Ok, disabling adjacent cacheline prefetching seems to do the trick (or
> at least cuts down the regression drastically):
>
> Hmean     faults/cpu-1    470577.6434 (   0.00%)   470745.2649 *   0.04%*
> Hmean     faults/cpu-4    445862.9701 (   0.00%)   445572.2252 *  -0.07%*
> Hmean     faults/cpu-7    422516.4002 (   0.00%)   422677.5591 *   0.04%*
> Hmean     faults/cpu-12   344483.7047 (   0.00%)   330476.7911 *  -4.07%*
> Hmean     faults/cpu-21   192836.0188 (   0.00%)   195266.8071 *   1.26%*
> Hmean     faults/cpu-30   140745.9472 (   0.00%)   140655.0459 *  -0.06%*
> Hmean     faults/cpu-48   110507.4310 (   0.00%)   103802.1839 *  -6.07%*
> Hmean     faults/cpu-56    93507.7919 (   0.00%)    95105.1875 *   1.71%*
> Hmean     faults/sec-1    470232.3887 (   0.00%)   470404.6525 *   0.04%*
> Hmean     faults/sec-4   1757368.9266 (   0.00%)  1752852.8697 *  -0.26%*
> Hmean     faults/sec-7   2909554.8150 (   0.00%)  2915885.8739 *   0.22%*
> Hmean     faults/sec-12  4033840.8719 (   0.00%)  3845165.3277 *  -4.68%*
> Hmean     faults/sec-21  3845857.7079 (   0.00%)  3890316.8799 *   1.16%*
> Hmean     faults/sec-30  3838607.4530 (   0.00%)  3838861.8142 *   0.01%*
> Hmean     faults/sec-48  4882118.9701 (   0.00%)  4608985.0530 *  -5.59%*
> Hmean     faults/sec-56  4933535.7567 (   0.00%)  5004208.3329 *   1.43%*
>
> Now, how do we disable prefetching extra cachelines for vm_area_structs only?

I'm unaware of any mechanism of the sort.

The good news is that Broadwell is an old yeller and if memory serves
right the impact is not anywhere near this bad on newer
microarchitectures, making "merely" 64 alignment (used all over in the
kernel for amd64) a practical choice (not just for vma).

Also note that in your setup you are losing out on performance in
other multithreaded cases, unrelated to anything vma.

That aside as I mentioned earlier the dedicated vma lock cache results
in false sharing between separate vmas, except this particular
benchmark does not test for it (which in your setup should be visible
even if the cache grows the  SLAB_HWCACHE_ALIGN flag).

I think the thing to do here is to bench on other cpus and ignore the
Broadwell + adjacent cache line prefetcher result if they come back
fine -- the code should not be held hostage by an old yeller.

To that end I think it would be best to ask the LKP folks at Intel.
They are very approachable so there should be no problem arranging it
provided they have some spare capacity. I believe grabbing the From
person and the cc list from this thread will do it:
https://lore.kernel.org/oe-lkp/ZriCbCPF6I0JnbKi@xsang-OptiPlex-9020/ .
By default they would run their own suite, which presumably has some
overlap with this particular benchmark in terms of generated workload
(but I don't think they run *this* particular benchmark itself,
perhaps it would make sense to ask them to add it?). It's your call
here.

If there are still problems and the lock needs to remain separate, the
bare minimum damage-controlling measure would be to hwalign the vma
lock cache -- it wont affect the pts benchmark, but it should help
others.

Should the decision be to bring the lock back into the struct, I'll
note my patch is merely slapped together to a state where it can be
benchmarked and I have no interest in beating it into a committable
shape. You stated you already had an equivalent (modulo keeping
something in a space previously occupied by the pointer to the vma
lock), so as far as I'm concerned you can submit that with your
authorship.
-- 
Mateusz Guzik <mjguzik gmail.com>