lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpHkTScPqME=++tzv=qu0aev-ewprLa=GGw8Qp4PQtTL9w@mail.gmail.com>
Date: Wed, 18 Dec 2024 07:42:30 -0800
From: Suren Baghdasaryan <surenb@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: akpm@...ux-foundation.org, willy@...radead.org, liam.howlett@...cle.com, 
	lorenzo.stoakes@...cle.com, mhocko@...e.com, vbabka@...e.cz, 
	hannes@...xchg.org, mjguzik@...il.com, oliver.sang@...el.com, 
	mgorman@...hsingularity.net, david@...hat.com, peterx@...hat.com, 
	oleg@...hat.com, dave@...olabs.net, paulmck@...nel.org, brauner@...nel.org, 
	dhowells@...hat.com, hdanton@...a.com, hughd@...gle.com, 
	lokeshgidra@...gle.com, minchan@...gle.com, jannh@...gle.com, 
	shakeel.butt@...ux.dev, souravpanda@...gle.com, pasha.tatashin@...een.com, 
	klarasmodin@...il.com, corbet@....net, linux-doc@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Re: [PATCH v6 10/16] mm: replace vm_lock and detached flag with a
 reference count

On Wed, Dec 18, 2024 at 1:41 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Tue, Dec 17, 2024 at 08:27:46AM -0800, Suren Baghdasaryan wrote:
>
> > > So I just replied there, and no, I don't think it makes sense. Just put
> > > the kmem_cache_free() in vma_refcount_put(), to be done on 0.
> >
> > That's very appealing indeed and makes things much simpler. The
> > problem I see with that is the case when we detach a vma from the tree
> > to isolate it, then do some cleanup and only then free it. That's done
> > in vms_gather_munmap_vmas() here:
> > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1240 and we
> > even might reattach detached vmas back:
> > https://elixir.bootlin.com/linux/v6.12.5/source/mm/vma.c#L1312. IOW,
> > detached state is not final and we can't destroy the object that
> > reached this state.
>
> Urgh, so that's the munmap() path, but arguably when that fails, the
> map stays in place.
>
> I think this means you're marking detached too soon; you should only
> mark detached once you reach the point of no return.
>
> That said, once you've reached the point of no return; and are about to
> go remove the page-tables, you very much want to ensure a lack of
> concurrency.
>
> So perhaps waiting for out-standing readers at this point isn't crazy.
>
> Also, I'm having a very hard time reading this maple tree stuff :/
> Afaict vms_gather_munmap_vmas() only adds the VMAs to be removed to a
> second tree, it does not in fact unlink them from the mm yet.

Yes, I think you are correct.

>
> AFAICT it's vma_iter_clear_gfp() that actually wipes the vmas from the
> mm -- and that being able to fail is mind boggling and I suppose is what
> gives rise to much of this insanity :/
>
> Anyway, I would expect remove_vma() to be the one that marks it detached
> (it's already unreachable through vma_lookup() at this point) and there
> you should wait for concurrent readers to bugger off.

There is an issue with that. Note that vms_complete_munmap_vmas()
that's calling remove_vma() might drop the mmap write lock, so
detaching without a write lock would break current rules.

>
> > We could change states to: 0=unused (we can free
> > the object), 1=detached, 2=attached, etc. but then vma_start_read()
> > should do something like refcount_inc_more_than_one() instead of
> > refcount_inc_not_zero(). Would you be ok with such an approach?
>
> Urgh, I would strongly suggest ditching refcount_t if we go this route.
> The thing is; refcount_t should remain a 'simple' straight forward
> interface and not allow people to do the wrong thing. Its not meant to
> be the kitchen sink -- we have atomic_t for that.

Ack. If we go this route I'll use atomics directly.

>
> Anyway, the more common scheme at that point is using -1 for 'free', I
> think folio->_mapcount uses that even. For that see:
> atomic_add_negative*().

Thanks for the reference.

>
> > > Additionally, having vma_end_write() would allow you to put a lockdep
> > > annotation in vma_{start,end}_write() -- which was I think the original
> > > reason I proposed it a while back, that and having improved clarity when
> > > reading the code, since explicitly marking the end of a section is
> > > helpful.
> >
> > The vma->vmlock_dep_map is tracking vma->vm_refcnt, not the
> > vma->vm_lock_seq (similar to how today vma->vm_lock has its lockdep
> > tracking that rw_semaphore). If I implement vma_end_write() then it
> > will simply be something like:
> >
> > void vma_end_write(vma)
> > {
> >          vma_assert_write_locked(vma);
> >          vma->vm_lock_seq = UINT_MAX;
> > }
> >
> > so, vmlock_dep_map would not be involved.
>
> That's just weird; why would you not track vma_{start,end}_write() with
> the exclusive side of the 'rwsem' dep_map ?
>
> > If you want to track vma->vm_lock_seq with a separate lockdep, that
> > would be more complicated. Specifically for vma_end_write_all() that
> > would require us to call rwsem_release() on all locked vmas, however
> > we currently do not track individual locked vmas. vma_end_write_all()
> > allows us not to worry about tracking them, knowing that once we do
> > mmap_write_unlock() they all will get unlocked with one increment of
> > mm->mm_lock_seq. If your suggestion is to replace vma_end_write_all()
> > with vma_end_write() and unlock vmas individually across the mm code,
> > that would be a sizable effort. If that is indeed your ultimate goal,
> > I can do that as a separate project: introduce vma_end_write(),
> > gradually add them in required places (not yet sure how complex that
> > would be), then retire vma_end_write_all() and add a lockdep for
> > vma->vm_lock_seq.
>
> Yeah, so ultimately I think it would be clearer if you explicitly mark
> the point where the vma modification is 'done'. But I don't suppose we
> have to do that here.

Ack.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ