[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5aecdfec-5939-4627-a27b-f2057a95fb65@lucifer.local>
Date: Mon, 13 Jan 2025 17:11:32 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: akpm@...ux-foundation.org, peterz@...radead.org, willy@...radead.org,
liam.howlett@...cle.com, david.laight.linux@...il.com, mhocko@...e.com,
vbabka@...e.cz, hannes@...xchg.org, mjguzik@...il.com,
oliver.sang@...el.com, mgorman@...hsingularity.net, david@...hat.com,
peterx@...hat.com, oleg@...hat.com, dave@...olabs.net,
paulmck@...nel.org, brauner@...nel.org, dhowells@...hat.com,
hdanton@...a.com, hughd@...gle.com, lokeshgidra@...gle.com,
minchan@...gle.com, jannh@...gle.com, shakeel.butt@...ux.dev,
souravpanda@...gle.com, pasha.tatashin@...een.com,
klarasmodin@...il.com, richard.weiyang@...il.com, corbet@....net,
linux-doc@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Re: [PATCH v9 00/17] reimplement per-vma lock as a refcount
On Mon, Jan 13, 2025 at 08:58:37AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 13, 2025 at 4:14 AM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> >
> > A nit on subject, I mean this is part of what this series does, and hey -
> > we have only so much text to put in here - but isn't this both
> > reimplementing per-VMA lock as a refcount _and_ importantly allocating VMAs
> > using the RCU typesafe mechanism?
> >
> > Do we have to do both in one series? Can we split this out? I mean maybe
> > that's just churny and unnecessary, but to me this series is 'allocate VMAs
> > RCU safe and refcount VMA lock' or something like this. Maybe this is
> > nitty... but still :)
>
> There is "motivational dependency" because one of the main reasons I'm
> converting the vm_lock into vm_refcnt is to make it easier to add
> SLAB_TYPESAFE_BY_RCU (see my last reply to Hillf). But technically we
> can leave the SLAB_TYPESAFE_BY_RCU out of this series if that makes
> thighs easier. That would be the 2 patches at the end:
Right yeah... maybe it's better to do it in one hit.
>
> mm: prepare lock_vma_under_rcu() for vma reuse possibility
> mm: make vma cache SLAB_TYPESAFE_BY_RCU
>
> I made sure that each patch is bisectable, so there should not be a
> problem with tracking issues.
>
> >
> > One general comment here - this is a really major change in how this stuff
> > works, and yet I don't see any tests anywhere in the series.
>
> Hmm. I was diligently updating the tests to reflect the replacement of
> vm_lock with vm_refcnt and adding assertions for detach/attach cases.
> This actually reminds me that I missed updading vm_area_struct in
> vma_internal.h for the member regrouping patch; will add that. I think
> the only part that did not affect tests is SLAB_TYPESAFE_BY_RCU but I
> was not sure what kind of testing I can add for that. Any suggestions
> would be welcomed.
And to be clear I'm super grateful you did that :) thanks, be good to
change the member regrouping thing also.
But that doesn't change the fact that this series has exactly zero tests
for it. And for something so broad, it feels like a big issue, we really
want to be careful with something so big here.
You've also noticed that I've cleverly failed to _actually_ suggest
SLAB_TYPESAFE_BY_RCU tests, and mea culpa - it's super hard to think of how
to test that.
Liam has experience doing RCU testing this for the maple tree stuff, but it
wasn't pretty and wasn't easy and would probably require massive rework to
expose this stuff to some viable testing environment, or in other words -
is unworkable.
HOWEVER, I feel like maybe we could try to create scenarios where we might
trigger reuse bugs?
Perhaps some userland code, perhaps even constrained by cgroup, that maps a
ton of stuff and unmaps in a loop in parallel?
Perhaps create scenarios with shared memory where we up refcounts a lot too?
Anyway, this is necessarily nebulous without further investigation, what I
was thinking more concretely is:
Using the VMA userland testing:
1. Assert reference count correctness across locking scenarios and various
VMA operations.
2. Assert correct detached/not detached state across different scenarios.
This won't quite be complete as not everything is separated out quite
enough to allow things like process tear down/forking etc. to be explicitly
tested but you can unit tests the VMA bits at least.
One note on this, I intend to split the vma.c file into multiple files in
tools/testing/vma/ so if you add tests here it'd be worth probably putting
them into a new file.
I'm happy to help with this if you need any assistance, feel free to ping!
Sorry to put this on you so late in the series, I realise it's annoying,
but I feel like things have changed a lot and obviously aggregated with two
series in one in effect and these are genuine concerns that at this stage I
feel like we need to try to at least make some headway on.
>
> >
> > I know it's tricky to write tests for this, but the new VMA testing
> > environment should make it possible to test a _lot_ more than we previously
> > could.
> >
> > However due to some (*ahem*) interesting distribution of where functions
> > are, most notably stuff in kernel/fork.c, I guess we can't test
> > _everything_ there effectively.
> >
> > But I do feel like we should be able to do better than having absolutely no
> > testing added for this?
>
> Again, I'm open to suggestions for SLAB_TYPESAFE_BY_RCU testing but
> for the rest I thought the tests were modified accordingly.
See above ^
>
> >
> > I think there's definitely quite a bit you could test now, at least in
> > asserting fundamentals in tools/testing/vma/vma.c.
> >
> > This can cover at least detached state asserts in various scenarios.
>
> Ok, you mean to check that VMA re-attachment/re-detachment would
> trigger assertions? I'll look into adding tests for that.
Yeah this is one, see above :)
>
> >
> > But that won't cover off the really gnarly stuff here around RCU slab
> > allocation, and determining precisely how to test that in a sensible way is
> > maybe less clear.
> >
> > But I'd like to see _something_ here please, this is more or less
> > fundamentally changing how all VMAs are allocated and to just have nothing
> > feels unfortunate.
>
> Again, I'm open to suggestions on what kind of testing I can add for
> SLAB_TYPESAFE_BY_RCU change.
See above
>
> >
> > I'm already nervous because we've hit issues coming up to v9 and we're not
> > 100% sure if a recent syzkaller is related to these changes or not, I'm not
> > sure how much we can get assurances with tests but I'd like something.
>
> If you are referring to the issue at [1], I think David ran the
> syzcaller against mm-stable that does not contain this patchset and
> the issue still triggered (see [2]). This of course does not guarantee
> that this patchset has no other issues :) I'll try adding tests for
> re-attaching, re-detaching and welcome ideas on how to test
> SLAB_TYPESAFE_BY_RCU transition.
> Thanks,
> Suren.
OK that's reassuring!
>
> [1] https://lore.kernel.org/all/6758f0cc.050a0220.17f54a.0001.GAE@google.com/
> [2] https://lore.kernel.org/all/67823fba.050a0220.216c54.001c.GAE@google.com/
>
> >
> > Thanks!
> >
> > On Fri, Jan 10, 2025 at 08:25:47PM -0800, Suren Baghdasaryan wrote:
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > Splitting single logical structure into multiple ones leads to more
> > > complicated management, extra pointer dereferences and overall less
> > > maintainable code. When that split-away part is a lock, it complicates
> > > things even further. With no performance benefits, there are no reasons
> > > for this split. Merging the vm_lock back into vm_area_struct also allows
> > > vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
> > > This patchset:
> > > 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
> > > boundary and changing the cache to be cacheline-aligned to minimize
> > > cacheline sharing;
> > > 2. changes vm_area_struct initialization to mark new vma as detached until
> > > it is inserted into vma tree;
> > > 3. replaces vm_lock and vma->detached flag with a reference counter;
> > > 4. regroups vm_area_struct members to fit them into 3 cachelines;
> > > 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
> > > reuse and to minimize call_rcu() calls.
> > >
> > > Pagefault microbenchmarks show performance improvement:
> > > Hmean faults/cpu-1 507926.5547 ( 0.00%) 506519.3692 * -0.28%*
> > > Hmean faults/cpu-4 479119.7051 ( 0.00%) 481333.6802 * 0.46%*
> > > Hmean faults/cpu-7 452880.2961 ( 0.00%) 455845.6211 * 0.65%*
> > > Hmean faults/cpu-12 347639.1021 ( 0.00%) 352004.2254 * 1.26%*
> > > Hmean faults/cpu-21 200061.2238 ( 0.00%) 229597.0317 * 14.76%*
> > > Hmean faults/cpu-30 145251.2001 ( 0.00%) 164202.5067 * 13.05%*
> > > Hmean faults/cpu-48 106848.4434 ( 0.00%) 120641.5504 * 12.91%*
> > > Hmean faults/cpu-56 92472.3835 ( 0.00%) 103464.7916 * 11.89%*
> > > Hmean faults/sec-1 507566.1468 ( 0.00%) 506139.0811 * -0.28%*
> > > Hmean faults/sec-4 1880478.2402 ( 0.00%) 1886795.6329 * 0.34%*
> > > Hmean faults/sec-7 3106394.3438 ( 0.00%) 3140550.7485 * 1.10%*
> > > Hmean faults/sec-12 4061358.4795 ( 0.00%) 4112477.0206 * 1.26%*
> > > Hmean faults/sec-21 3988619.1169 ( 0.00%) 4577747.1436 * 14.77%*
> > > Hmean faults/sec-30 3909839.5449 ( 0.00%) 4311052.2787 * 10.26%*
> > > Hmean faults/sec-48 4761108.4691 ( 0.00%) 5283790.5026 * 10.98%*
> > > Hmean faults/sec-56 4885561.4590 ( 0.00%) 5415839.4045 * 10.85%*
> > >
> > > Changes since v8 [4]:
> > > - Change subject for the cover letter, per Vlastimil Babka
> > > - Added Reviewed-by and Acked-by, per Vlastimil Babka
> > > - Added static check for no-limit case in __refcount_add_not_zero_limited,
> > > per David Laight
> > > - Fixed vma_refcount_put() to call rwsem_release() unconditionally,
> > > per Hillf Danton and Vlastimil Babka
> > > - Use a copy of vma->vm_mm in vma_refcount_put() in case vma is freed from
> > > under us, per Vlastimil Babka
> > > - Removed extra rcu_read_lock()/rcu_read_unlock() in vma_end_read(),
> > > per Vlastimil Babka
> > > - Changed __vma_enter_locked() parameter to centralize refcount logic,
> > > per Vlastimil Babka
> > > - Amended description in vm_lock replacement patch explaining the effects
> > > of the patch on vm_area_struct size, per Vlastimil Babka
> > > - Added vm_area_struct member regrouping patch [5] into the series
> > > - Renamed vma_copy() into vm_area_init_from(), per Liam R. Howlett
> > > - Added a comment for vm_area_struct to update vm_area_init_from() when
> > > adding new members, per Vlastimil Babka
> > > - Updated a comment about unstable src->shared.rb when copying a vma in
> > > vm_area_init_from(), per Vlastimil Babka
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
> > > [4] https://lore.kernel.org/all/20250109023025.2242447-1-surenb@google.com/
> > > [5] https://lore.kernel.org/all/20241111205506.3404479-5-surenb@google.com/
> > >
> > > Patchset applies over mm-unstable after reverting v8
> > > (current SHA range: 235b5129cb7b - 9e6b24c58985)
> > >
> > > Suren Baghdasaryan (17):
> > > mm: introduce vma_start_read_locked{_nested} helpers
> > > mm: move per-vma lock into vm_area_struct
> > > mm: mark vma as detached until it's added into vma tree
> > > mm: introduce vma_iter_store_attached() to use with attached vmas
> > > mm: mark vmas detached upon exit
> > > types: move struct rcuwait into types.h
> > > mm: allow vma_start_read_locked/vma_start_read_locked_nested to fail
> > > mm: move mmap_init_lock() out of the header file
> > > mm: uninline the main body of vma_start_write()
> > > refcount: introduce __refcount_{add|inc}_not_zero_limited
> > > mm: replace vm_lock and detached flag with a reference count
> > > mm: move lesser used vma_area_struct members into the last cacheline
> > > mm/debug: print vm_refcnt state when dumping the vma
> > > mm: remove extra vma_numab_state_init() call
> > > mm: prepare lock_vma_under_rcu() for vma reuse possibility
> > > mm: make vma cache SLAB_TYPESAFE_BY_RCU
> > > docs/mm: document latest changes to vm_lock
> > >
> > > Documentation/mm/process_addrs.rst | 44 ++++----
> > > include/linux/mm.h | 156 ++++++++++++++++++++++-------
> > > include/linux/mm_types.h | 75 +++++++-------
> > > include/linux/mmap_lock.h | 6 --
> > > include/linux/rcuwait.h | 13 +--
> > > include/linux/refcount.h | 24 ++++-
> > > include/linux/slab.h | 6 --
> > > include/linux/types.h | 12 +++
> > > kernel/fork.c | 129 +++++++++++-------------
> > > mm/debug.c | 12 +++
> > > mm/init-mm.c | 1 +
> > > mm/memory.c | 97 ++++++++++++++++--
> > > mm/mmap.c | 3 +-
> > > mm/userfaultfd.c | 32 +++---
> > > mm/vma.c | 23 ++---
> > > mm/vma.h | 15 ++-
> > > tools/testing/vma/linux/atomic.h | 5 +
> > > tools/testing/vma/vma_internal.h | 93 ++++++++---------
> > > 18 files changed, 465 insertions(+), 281 deletions(-)
> > >
> > > --
> > > 2.47.1.613.gc27f4b7a9f-goog
> > >
Powered by blists - more mailing lists