linux-kernel - Re: [PATCH v2] docs/mm: add VMA locks documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez2=oP15_X_MqtDG22P6TZYpTr07-TZBk3Z_DvuwB6nJFQ@mail.gmail.com>
Date: Wed, 13 Nov 2024 20:46:39 +0100
From: Jann Horn <jannh@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Jonathan Corbet <corbet@....net>, 
	"Liam R . Howlett" <Liam.Howlett@...cle.com>, Vlastimil Babka <vbabka@...e.cz>, 
	Alice Ryhl <aliceryhl@...gle.com>, Boqun Feng <boqun.feng@...il.com>, 
	Matthew Wilcox <willy@...radead.org>, Mike Rapoport <rppt@...nel.org>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org, 
	Suren Baghdasaryan <surenb@...gle.com>, Hillf Danton <hdanton@...a.com>, 
	Qi Zheng <zhengqi.arch@...edance.com>, SeongJae Park <sj@...nel.org>, 
	Bagas Sanjaya <bagasdotme@...il.com>
Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation

On Fri, Nov 8, 2024 at 2:57 PM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be used when it comes to interacting with
> mm_struct and vm_area_struct objects.
>
> This is especially pertinent as regards the efforts to find sensible
> abstractions for these fundamental objects in kernel rust code whose
> compiler strictly requires some means of expressing these rules (and
> through this expression, self-document these requirements as well as
> enforce them).
>
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
>
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
>
> The document is split between generally useful information for users of mm
> interfaces, and separately a section intended for mm kernel developers
> providing a discussion around internal implementation details.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>

Reviewed-by: Jann Horn <jannh@...gle.com>

except some typos and one inaccuracy:

> +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
> +  which locks at a process address space granularity which can be acquired via
> +  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
> +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
> +  as a read/write semaphore in practice. A VMA read lock is obtained via
> +  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
> +  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
> +  automatically when the mmap write lock is released). To take a VMA write lock
> +  you **must** have already acquired an :c:func:`!mmap_write_lock`.
> +* **rmap locks** - When trying to access VMAs through the reverse mapping via a
> +  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
> +  (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via

missing dot between sentences?

> +These fields describes the size, start and end of the VMA, and as such cannot be
> +modified without first being hidden from the reverse mapping since these fields
> +are used to locate VMAs within the reverse mapping interval trees.

still a typo here, "these fields describes"

> +.. note:: In instances where the architecture supports fewer page tables than
> +         five the kernel cleverly 'folds' page table levels, that is stubbing
> +         out functions related to the skipped levels. This allows us to
> +         conceptually act is if there were always five levels, even if the

typo: s/is if/as if/

> +1. **Traversing** page tables - Simply reading page tables in order to traverse
> +   them. This only requires that the VMA is kept stable, so a lock which
> +   establishes this suffices for traversal (there are also lockless variants
> +   which eliminate even this requirement, such as :c:func:`!gup_fast`).
> +2. **Installing** page table mappings - Whether creating a new mapping or
> +   modifying an existing one. This requires that the VMA is kept stable via an
> +   mmap or VMA lock (explicitly not rmap locks).

Arguably clearing A/D bits through the rmap, and switching PTEs
between present entries and migration entries, counts as "modifying an
existing one", but the locking for that is like for zapping/unmapping
(both page_idle_clear_pte_refs and try_to_migrate go through the
rmap). So "modifying an existing one" either needs more caveats or
more likely should just be moved to point three.

> +3. **Zapping/unmapping** page table entries - This is what the kernel calls
> +   clearing page table mappings at the leaf level only, whilst leaving all page
> +   tables in place. This is a very common operation in the kernel performed on
> +   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
> +   :c:func:`!madvise`, and others. This is performed by a number of functions
> +   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`
> +   among others. The VMA need only be kept stable for this operation.
> +4. **Freeing** page tables - When finally the kernel removes page tables from a
> +   userland process (typically via :c:func:`!free_pgtables`) extreme care must
> +   be taken to ensure this is done safely, as this logic finally frees all page
> +   tables in the specified range, ignoring existing leaf entries (it assumes the
> +   caller has both zapped the range and prevented any further faults or
> +   modifications within it).
> +
> +**Traversing** and **zapping** ranges can be performed holding any one of the
> +locks described in the terminology section above - that is the mmap lock, the
> +VMA lock or either of the reverse mapping locks.
> +
> +That is - as long as you keep the relevant VMA **stable** - you are good to go
> +ahead and perform these operations on page tables (though internally, kernel
> +operations that perform writes also acquire internal page table locks to
> +serialise - see the page table implementation detail section for more details).
> +
> +When **installing** page table entries, the mmap or VMA lock mut be held to keep

typo: "must be held"

> +When performing a page table traversal and keeping the VMA stable, whether a
> +read must be performed once and only once or not depends on the architecture
> +(for instance x86-64 does not require any special precautions).

Nitpick: In theory that'd still be a data race with other threads
concurrently installing page tables, though in practice compilers
don't seem to do anything bad with stuff like that.

> +A typical pattern taken when traversing page table entries to install a new
> +mapping is to optimistically determine whether the page table entry in the table
> +above is empty, if so, only then acquiring the page table lock and checking
> +again to see if it was allocated underneath is.

s/ is// ?

> +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
> +for the PTE, carefully checking it is as expected, before acquiring the
> +PTE-specific lock, and then *again* checking that the PMD lock is as expected.

s/PMD lock is/PMD entry is/ ?

> +In these instances, it is required that **all** locks are taken, that is
> +the mmap lock, the VMA lock and the relevant rmap lock.

s/rmap lock/rmap locks/

> +VMA read locking is entirely optimistic - if the lock is contended or a competing
> +write has started, then we do not obtain a read lock.
> +
> +A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu` function, which

"is obtained by lock_vma_under_rcu function" sounds weird, maybe
either "is obtained by lock_vma_under_rcu" or "is obtained by the
lock_vma_under_rcu function"?