[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <38ce85e9-9727-485e-b232-bb285ea24f31@lucifer.local>
Date: Thu, 23 Jan 2025 17:07:11 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Barry Song <21cnbao@...il.com>
Cc: lokeshgidra@...gle.com, Liam.Howlett@...cle.com, aarcange@...hat.com,
akpm@...ux-foundation.org, axelrasmussen@...gle.com,
bgeffon@...gle.com, david@...hat.com, jannh@...gle.com,
kaleshsingh@...gle.com, kernel-team@...roid.com,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, ngeoffray@...gle.com, peterx@...hat.com,
rppt@...nel.org, ryan.roberts@....com, selinux@...r.kernel.org,
surenb@...gle.com, timmurray@...gle.com, willy@...radead.org,
Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH v7 4/4] userfaultfd: use per-vma locks in userfaultfd
operations
On Thu, Jan 23, 2025 at 04:48:10PM +0000, Lorenzo Stoakes wrote:
> +cc vbabka
Sorry my mail client messing up, forgot to actually cc... fixed :)
>
> I realise this is resurrecting an old thread, but helpful to cc- us mmap
> maintainers as my mail at least is nightmare :P Thanks!
>
> On Thu, Jan 23, 2025 at 05:14:27PM +1300, Barry Song wrote:
> > > All userfaultfd operations, except write-protect, opportunistically use
> > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > critical section.
> > >
> > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > vmas.
> >
> > Hi Lokesh,
> >
> > Apologies for reviving this old thread. We truly appreciate the excellent work
> > you’ve done in transitioning many userfaultfd operations to per-VMA locks.
>
> Let me also say - thanks Lokesh!
>
> >
> > However, we’ve noticed that userfaultfd still remains one of the largest users
> > of mmap_lock for write operations, with the other—binder—having been recently
> > addressed by Carlos Llamas's "binder: faster page installations" series:
> >
> > https://lore.kernel.org/lkml/20241203215452.2820071-1-cmllamas@google.com/
> >
> > The HeapTaskDaemon(Java GC) might frequently perform userfaultfd_register()
> > and userfaultfd_unregister() operations, both of which require the mmap_lock
> > in write mode to either split or merge VMAs. Since HeapTaskDaemon is a
> > lower-priority background task, there are cases where, after acquiring the
> > mmap_lock, it gets preempted by other tasks. As a result, even high-priority
> > threads waiting for the mmap_lock — whether in writer or reader mode—can
> > end up experiencing significant delays(The delay can reach several hundred
> > milliseconds in the worst case.)
>
> Thanks for reporting - this strikes me as an important report, but I'm not
> sure about your proposed solution :)
>
> However I wonder if we can't look at more efficiently handling the locking
> around uffd register/unregister?
>
> I think uffd suffers from a complexity problem, in that there are multiple
> moving parts and complicated interactions especially for each of the
> different kinds of events uffd can deal with.
>
> Also ordering matters a great deal within uffd. Ideally we'd not have uffd
> effectively have open-coded hooks all over the mm code, but that ship
> sailed long ago and so we need to really carefully assess how locking
> changes impacts uffd behaviour.
>
> >
> > We haven’t yet identified an ideal solution for this. However, the Java heap
> > appears to behave like a "volatile" vma in its usage. A somewhat simplistic
> > idea would be to designate a specific region of the user address space as
> > "volatile" and restrict all "volatile" VMAs to this isolated region.
> >
> > We may have a MAP_VOLATILE flag to mmap. VMA regions with this flag will be
> > mapped to the volatile space, while those without it will be mapped to the
> > non-volatile space.
>
> This feels like a major thing to do just to suit specific uffd usages, a
> feature that not everybody makes use of.
>
> The virtual address space is a somewhat sensitive subject (see the
> relatively recent discussion on a proposed MAP_BELOW flag), and has a lot
> of landmines you can step on.
>
> How would this range be determined? How would this interact with people
> doing 'interesting' things with MAP_FIXED mappings? What if somebody
> MAP_FIXED into a volatile region and gets this behaviour by mistake?
>
> How do the two locks interact with one another and vma locks? I mean we
> already have a very complicated set of interactions between the mmap sem
> and vma locks where the mmap sem is taken to lock the mmap (of course), but
> now we'd have two?
>
> Also you have to write lock when you modify the VMA tree, would we have two
> maple trees now? And what about the overhead?
>
> I feel like this is not the right route for this.
>
> >
> > ┌────────────┐TASK_SIZE
> > │ │
> > │ │
> > │ │mmap VOLATILE
> > ┼────────────┤
> > │ │
> > │ │
> > │ │
> > │ │
> > │ │default mmap
> > │ │
> > │ │
> > └────────────┘
> >
> > VMAs in the volatile region are assigned their own volatile_mmap_lock,
> > which is independent of the mmap_lock for the non-volatile region.
> > Additionally, we ensure that no single VMA spans the boundary between
> > the volatile and non-volatile regions. This separation prevents the
> > frequent modifications of a small number of volatile VMAs from blocking
> > other operations on a large number of non-volatile VMAs.
>
> I think really overall this will be solving one can of worms by introducing
> another can of very large worms in space :P but perhaps I am missing
> details here.
>
> >
> > The implementation itself wouldn’t be overly complex, but the design
> > might come across as somewhat hacky.
> >
> > Lastly, I have two questions:
> >
> > 1. Have you observed similar issues where userfaultfd continues to
> > cause lock contention and priority inversion?
> >
> > 2. If so, do you have any ideas or suggestions on how to address this
> > problem?
>
> Addressing and investigating this however is VERY IMPORTANT. Let's perhaps
> investigate how we might find a uffd-specific means of addressing these
> problems.
>
> >
> > >
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@...gle.com>
> > > Reviewed-by: Liam R. Howlett <Liam.Howlett@...cle.com>
> > > ---
> > > fs/userfaultfd.c | 13 +-
> > > include/linux/userfaultfd_k.h | 5 +-
> > > mm/huge_memory.c | 5 +-
> > > mm/userfaultfd.c | 380 ++++++++++++++++++++++++++--------
> > > 4 files changed, 299 insertions(+), 104 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index c00a021bcce4..60dcfafdc11a 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> >
> > Thanks
> > Barry
> >
>
> Cheers, Lorenzo
Powered by blists - more mailing lists