[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d770b94d-eded-48a4-9d9f-063704e91139@lucifer.local>
Date: Thu, 14 Nov 2024 20:23:51 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Jann Horn <jannh@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Jonathan Corbet <corbet@....net>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Alice Ryhl <aliceryhl@...gle.com>,
Boqun Feng <boqun.feng@...il.com>,
Matthew Wilcox <willy@...radead.org>, Mike Rapoport <rppt@...nel.org>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, Suren Baghdasaryan <surenb@...gle.com>,
Hillf Danton <hdanton@...a.com>, Qi Zheng <zhengqi.arch@...edance.com>,
SeongJae Park <sj@...nel.org>, Bagas Sanjaya <bagasdotme@...il.com>
Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation
On Wed, Nov 13, 2024 at 08:46:39PM +0100, Jann Horn wrote:
> On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes
> <lorenzo.stoakes@...cle.com> wrote:
> > Locking around VMAs is complicated and confusing. While we have a number of
> > disparate comments scattered around the place, we seem to be reaching a
> > level of complexity that justifies a serious effort at clearly documenting
> > how locks are expected to be used when it comes to interacting with
> > mm_struct and vm_area_struct objects.
> >
> > This is especially pertinent as regards the efforts to find sensible
> > abstractions for these fundamental objects in kernel rust code whose
> > compiler strictly requires some means of expressing these rules (and
> > through this expression, self-document these requirements as well as
> > enforce them).
> >
> > The document limits scope to mmap and VMA locks and those that are
> > immediately adjacent and relevant to them - so additionally covers page
> > table locking as this is so very closely tied to VMA operations (and relies
> > upon us handling these correctly).
> >
> > The document tries to cover some of the nastier and more confusing edge
> > cases and concerns especially around lock ordering and page table teardown.
> >
> > The document is split between generally useful information for users of mm
> > interfaces, and separately a section intended for mm kernel developers
> > providing a discussion around internal implementation details.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
>
> Reviewed-by: Jann Horn <jannh@...gle.com>
Thanks! Will do a quick respin to fixup what you raised (+ pull in the fix-patch
there too obv.)
>
> except some typos and one inaccuracy:
>
> > +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
> > + which locks at a process address space granularity which can be acquired via
> > + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
> > +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
> > + as a read/write semaphore in practice. A VMA read lock is obtained via
> > + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
> > + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
> > + automatically when the mmap write lock is released). To take a VMA write lock
> > + you **must** have already acquired an :c:func:`!mmap_write_lock`.
> > +* **rmap locks** - When trying to access VMAs through the reverse mapping via a
> > + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
> > + (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via
>
> missing dot between sentences?
Ack
>
> > +These fields describes the size, start and end of the VMA, and as such cannot be
> > +modified without first being hidden from the reverse mapping since these fields
> > +are used to locate VMAs within the reverse mapping interval trees.
>
> still a typo here, "these fields describes"
This is how we speak in Devon ;) will fix (the typo that is, not my
Devonshire dialect).
>
> > +.. note:: In instances where the architecture supports fewer page tables than
> > + five the kernel cleverly 'folds' page table levels, that is stubbing
> > + out functions related to the skipped levels. This allows us to
> > + conceptually act is if there were always five levels, even if the
>
> typo: s/is if/as if/
Ack fixed.
>
> > +1. **Traversing** page tables - Simply reading page tables in order to traverse
> > + them. This only requires that the VMA is kept stable, so a lock which
> > + establishes this suffices for traversal (there are also lockless variants
> > + which eliminate even this requirement, such as :c:func:`!gup_fast`).
> > +2. **Installing** page table mappings - Whether creating a new mapping or
> > + modifying an existing one. This requires that the VMA is kept stable via an
> > + mmap or VMA lock (explicitly not rmap locks).
>
> Arguably clearing A/D bits through the rmap, and switching PTEs
> between present entries and migration entries, counts as "modifying an
> existing one", but the locking for that is like for zapping/unmapping
> (both page_idle_clear_pte_refs and try_to_migrate go through the
> rmap). So "modifying an existing one" either needs more caveats or
> more likely should just be moved to point three.
Yeah I think this is a terminology thing, maybe better to say modifying which
changes _identity_.
I don't think it's right to classify these as zapping. Better to just put a
separate note underneath about this.
Hopefully that should suffice to keep things comprehensible here :)
>
> > +3. **Zapping/unmapping** page table entries - This is what the kernel calls
> > + clearing page table mappings at the leaf level only, whilst leaving all page
> > + tables in place. This is a very common operation in the kernel performed on
> > + file truncation, the :c:macro:`!MADV_DONTNEED` operation via
> > + :c:func:`!madvise`, and others. This is performed by a number of functions
> > + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`
> > + among others. The VMA need only be kept stable for this operation.
> > +4. **Freeing** page tables - When finally the kernel removes page tables from a
> > + userland process (typically via :c:func:`!free_pgtables`) extreme care must
> > + be taken to ensure this is done safely, as this logic finally frees all page
> > + tables in the specified range, ignoring existing leaf entries (it assumes the
> > + caller has both zapped the range and prevented any further faults or
> > + modifications within it).
> > +
> > +**Traversing** and **zapping** ranges can be performed holding any one of the
> > +locks described in the terminology section above - that is the mmap lock, the
> > +VMA lock or either of the reverse mapping locks.
> > +
> > +That is - as long as you keep the relevant VMA **stable** - you are good to go
> > +ahead and perform these operations on page tables (though internally, kernel
> > +operations that perform writes also acquire internal page table locks to
> > +serialise - see the page table implementation detail section for more details).
> > +
> > +When **installing** page table entries, the mmap or VMA lock mut be held to keep
>
> typo: "must be held"
Ack.
>
> > +When performing a page table traversal and keeping the VMA stable, whether a
> > +read must be performed once and only once or not depends on the architecture
> > +(for instance x86-64 does not require any special precautions).
>
> Nitpick: In theory that'd still be a data race with other threads
> concurrently installing page tables, though in practice compilers
> don't seem to do anything bad with stuff like that.
There's nothing to prevent anything at all... presumably there's some
characteristic of x86-64 that makes all this _somehow_ ok. Maybe somethinmg
to expand on in future.
>
> > +A typical pattern taken when traversing page table entries to install a new
> > +mapping is to optimistically determine whether the page table entry in the table
> > +above is empty, if so, only then acquiring the page table lock and checking
> > +again to see if it was allocated underneath is.
>
> s/ is// ?
s/is/us/ :>) fixed.
>
> > +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
> > +for the PTE, carefully checking it is as expected, before acquiring the
> > +PTE-specific lock, and then *again* checking that the PMD lock is as expected.
>
> s/PMD lock is/PMD entry is/ ?
ack, fixed.
>
> > +In these instances, it is required that **all** locks are taken, that is
> > +the mmap lock, the VMA lock and the relevant rmap lock.
>
> s/rmap lock/rmap locks/
Hm... you mean ref: the MAP_PRIVATE weirdness meaning we might need
two. Sigh.
I _hate_ anon_vma's. I may do something about them.
But also ack, fixing.
>
> > +VMA read locking is entirely optimistic - if the lock is contended or a competing
> > +write has started, then we do not obtain a read lock.
> > +
> > +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which
>
> "is obtained by lock_vma_under_rcu function" sounds weird, maybe
> either "is obtained by lock_vma_under_rcu" or "is obtained by the
> lock_vma_under_rcu function"?
Ack agreed, will fix.
Powered by blists - more mailing lists