[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez2=oP15_X_MqtDG22P6TZYpTr07-TZBk3Z_DvuwB6nJFQ@mail.gmail.com>
Date: Wed, 13 Nov 2024 20:46:39 +0100
From: Jann Horn <jannh@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Jonathan Corbet <corbet@....net>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka <vbabka@...e.cz>,
Alice Ryhl <aliceryhl@...gle.com>, Boqun Feng <boqun.feng@...il.com>,
Matthew Wilcox <willy@...radead.org>, Mike Rapoport <rppt@...nel.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
Suren Baghdasaryan <surenb@...gle.com>, Hillf Danton <hdanton@...a.com>,
Qi Zheng <zhengqi.arch@...edance.com>, SeongJae Park <sj@...nel.org>,
Bagas Sanjaya <bagasdotme@...il.com>
Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation
On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be used when it comes to interacting with
> mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards the efforts to find sensible
> abstractions for these fundamental objects in kernel rust code whose
> compiler strictly requires some means of expressing these rules (and
> through this expression, self-document these requirements as well as
> enforce them).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document is split between generally useful information for users of mm
> interfaces, and separately a section intended for mm kernel developers
> providing a discussion around internal implementation details.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Reviewed-by: Jann Horn <jannh@...gle.com>
except some typos and one inaccuracy:
> +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
> + which locks at a process address space granularity which can be acquired via
> + :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
> +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
> + as a read/write semaphore in practice. A VMA read lock is obtained via
> + :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
> + write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
> + automatically when the mmap write lock is released). To take a VMA write lock
> + you **must** have already acquired an :c:func:`!mmap_write_lock`.
> +* **rmap locks** - When trying to access VMAs through the reverse mapping via a
> + :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
> + (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via
missing dot between sentences?
> +These fields describes the size, start and end of the VMA, and as such cannot be
> +modified without first being hidden from the reverse mapping since these fields
> +are used to locate VMAs within the reverse mapping interval trees.
still a typo here, "these fields describes"
> +.. note:: In instances where the architecture supports fewer page tables than
> + five the kernel cleverly 'folds' page table levels, that is stubbing
> + out functions related to the skipped levels. This allows us to
> + conceptually act is if there were always five levels, even if the
typo: s/is if/as if/
> +1. **Traversing** page tables - Simply reading page tables in order to traverse
> + them. This only requires that the VMA is kept stable, so a lock which
> + establishes this suffices for traversal (there are also lockless variants
> + which eliminate even this requirement, such as :c:func:`!gup_fast`).
> +2. **Installing** page table mappings - Whether creating a new mapping or
> + modifying an existing one. This requires that the VMA is kept stable via an
> + mmap or VMA lock (explicitly not rmap locks).
Arguably clearing A/D bits through the rmap, and switching PTEs
between present entries and migration entries, counts as "modifying an
existing one", but the locking for that is like for zapping/unmapping
(both page_idle_clear_pte_refs and try_to_migrate go through the
rmap). So "modifying an existing one" either needs more caveats or
more likely should just be moved to point three.
> +3. **Zapping/unmapping** page table entries - This is what the kernel calls
> + clearing page table mappings at the leaf level only, whilst leaving all page
> + tables in place. This is a very common operation in the kernel performed on
> + file truncation, the :c:macro:`!MADV_DONTNEED` operation via
> + :c:func:`!madvise`, and others. This is performed by a number of functions
> + including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`
> + among others. The VMA need only be kept stable for this operation.
> +4. **Freeing** page tables - When finally the kernel removes page tables from a
> + userland process (typically via :c:func:`!free_pgtables`) extreme care must
> + be taken to ensure this is done safely, as this logic finally frees all page
> + tables in the specified range, ignoring existing leaf entries (it assumes the
> + caller has both zapped the range and prevented any further faults or
> + modifications within it).
> +
> +**Traversing** and **zapping** ranges can be performed holding any one of the
> +locks described in the terminology section above - that is the mmap lock, the
> +VMA lock or either of the reverse mapping locks.
> +
> +That is - as long as you keep the relevant VMA **stable** - you are good to go
> +ahead and perform these operations on page tables (though internally, kernel
> +operations that perform writes also acquire internal page table locks to
> +serialise - see the page table implementation detail section for more details).
> +
> +When **installing** page table entries, the mmap or VMA lock mut be held to keep
typo: "must be held"
> +When performing a page table traversal and keeping the VMA stable, whether a
> +read must be performed once and only once or not depends on the architecture
> +(for instance x86-64 does not require any special precautions).
Nitpick: In theory that'd still be a data race with other threads
concurrently installing page tables, though in practice compilers
don't seem to do anything bad with stuff like that.
> +A typical pattern taken when traversing page table entries to install a new
> +mapping is to optimistically determine whether the page table entry in the table
> +above is empty, if so, only then acquiring the page table lock and checking
> +again to see if it was allocated underneath is.
s/ is// ?
> +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
> +for the PTE, carefully checking it is as expected, before acquiring the
> +PTE-specific lock, and then *again* checking that the PMD lock is as expected.
s/PMD lock is/PMD entry is/ ?
> +In these instances, it is required that **all** locks are taken, that is
> +the mmap lock, the VMA lock and the relevant rmap lock.
s/rmap lock/rmap locks/
> +VMA read locking is entirely optimistic - if the lock is contended or a competing
> +write has started, then we do not obtain a read lock.
> +
> +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which
"is obtained by lock_vma_under_rcu function" sounds weird, maybe
either "is obtained by lock_vma_under_rcu" or "is obtained by the
lock_vma_under_rcu function"?
Powered by blists - more mailing lists