lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <le3ykvrau2lbncrjsqll7z6ck43bf3shon4g5ohchxcvcs4fuy@h3pq646xgoz6>
Date: Tue, 12 Nov 2024 10:15:44 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        Jonathan Corbet <corbet@....net>, Vlastimil Babka <vbabka@...e.cz>,
        Jann Horn <jannh@...gle.com>, Alice Ryhl <aliceryhl@...gle.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Matthew Wilcox <willy@...radead.org>, Mike Rapoport <rppt@...nel.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, Suren Baghdasaryan <surenb@...gle.com>,
        Hillf Danton <hdanton@...a.com>, Qi Zheng <zhengqi.arch@...edance.com>,
        SeongJae Park <sj@...nel.org>, Bagas Sanjaya <bagasdotme@...il.com>
Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation

* Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [241108 08:57]:
> Locking around VMAs is complicated and confusing. While we have a number of
> disparate comments scattered around the place, we seem to be reaching a
> level of complexity that justifies a serious effort at clearly documenting
> how locks are expected to be used when it comes to interacting with
> mm_struct and vm_area_struct objects.
> 
> This is especially pertinent as regards the efforts to find sensible
> abstractions for these fundamental objects in kernel rust code whose
> compiler strictly requires some means of expressing these rules (and
> through this expression, self-document these requirements as well as
> enforce them).
> 
> The document limits scope to mmap and VMA locks and those that are
> immediately adjacent and relevant to them - so additionally covers page
> table locking as this is so very closely tied to VMA operations (and relies
> upon us handling these correctly).
> 
> The document tries to cover some of the nastier and more confusing edge
> cases and concerns especially around lock ordering and page table teardown.
> 
> The document is split between generally useful information for users of mm
> interfaces, and separately a section intended for mm kernel developers
> providing a discussion around internal implementation details.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> ---
> 
> REVIEWERS NOTES:
> * As before, for convenience, I've uploaded a render of this document to my
>   website at https://ljs.io/v2/mm/process_addrs
> * You can speed up doc builds by running `make SPHINXDIRS=mm htmldocs`.
> 
> v2:
> * Fixed grammar and silly typos in various places.
> * Further sharpening up of prose.
> * Updated remark about empty -> populated requiring mmap lock not rmap -
>   this goes for populating _anything_, as we don't want to race the gap
>   between zap and freeing of page tables which _assumes_ you can't do this.
> * Clarified point on installing page table entries with rmap locks only.
> * Updated swap_readahead_info and numab state entries to mention other
>   locks/atomicity as per Kirill.
> * Improved description of vma->anon_vma and vma->anon_vma_chain as per
>   Jann.
> * Expanded vma->anon-vma to add more details.
> * Various typos/small tweaks via Jann.
> * Clarified mremap() higher page table lock requirements as per Jann.
> * Clarified that lock_vma_under_rcu() _looks up_ the VMA under RCU as per
>   Jann.
> * Clarified RCU requirement for VMA read lock in VMA lock implementation
>   detail section as per Jann.
> * Removed reference to seqnumber increment on mmap write lock as out of
>   scope at the moment, and incorrect explanation on this (is intended for
>   speculation going forward) as per Jann.
> * Added filemap.c lock ordering also as per Kirill.
> * Made the reference to anon/file-backed interval tree root nodes more
>   explicit in implementation detail section.
> * Added note about `MAP_PRIVATE` being in both anon_vma and i_mmap trees.
> * Expanded description of page table folding as per Bagas.
> * Added missing details about _traversing_ page tables.
> * Added the caveat that we can just go ahead and read higher page table
>   levels if we are simply _traversing_, but if we are to install page table
>   locks must be acquired and the read double-checked.
> * Corrected the comments about gup-fast - we are simply traversing in
>   gup-fast, which like other page table traversal logic does not acquire
>   page table locks, but _also_ does not keep the VMA stable.
> * Added more details about PMD/PTE lock acquisition in
>   __pte__offset_map_lock().
> 
> v1:
> * Removed RFC tag as I think we are iterating towards something workable
>   and there is interest.
> * Cleaned up and sharpened the language, structure and layout. Separated
>   into top-level details and implementation sections as per Alice.
> * Replaced links with rather more readable formatting.
> * Improved valid mmap/VMA lock state table.
> * Put VMA locks section into the process addresses document as per SJ and
>   Mike.
> * Made clear as to read/write operations against VMA object rather than
>   userland memory, as per Mike's suggestion, also that it does not refer to
>   page tables as per Jann.
> * Moved note into main section as per Mike's suggestion.
> * Fixed grammar mistake as per Mike.
> * Converted list-table to table as per Mike.
> * Corrected various typos as per Jann, Suren.
> * Updated reference to page fault arches as per Jann.
> * Corrected mistaken write lock criteria for vm_lock_seq as per Jann.
> * Updated vm_pgoff description to reference CONFIG_ARCH_HAS_PTE_SPECIAL as
>   per Jann.
> * Updated write lock to mmap read for vma->numab_state as per Jann.
> * Clarified that the write lock is on the mmap and VMA lock at VMA
>   granularity earlier in description as per Suren.
> * Added explicit note at top of VMA lock section to explicitly highlight
>   VMA lock semantics as per Suren.
> * Updated required locking for vma lock fields to N/A to avoid confusion as
>   per Suren.
> * Corrected description of mmap_downgrade() as per Suren.
> * Added a note on gate VMAs as per Jann.
> * Explained that taking mmap read lock under VMA lock is a bad idea due to
>   deadlock as per Jann.
> * Discussed atomicity in page table operations as per Jann.
> * Adapted the well thought out page table locking rules as provided by Jann.
> * Added a comment about pte mapping maintaining an RCU read lock.
> * Added clarification on moving page tables as informed by Jann's comments
>   (though it turns out mremap() doesn't necessarily hold all locks if it
>   can resolve races other ways :)
> * Added Jann's diagram showing lock exclusivity characteristics.
> https://lore.kernel.org/all/20241107190137.58000-1-lorenzo.stoakes@oracle.com/
> 
> RFC:
> https://lore.kernel.org/all/20241101185033.131880-1-lorenzo.stoakes@oracle.com/
> 
>  Documentation/mm/process_addrs.rst | 813 +++++++++++++++++++++++++++++
>  1 file changed, 813 insertions(+)
> 
> diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
> index e8618fbc62c9..5aef4fd0e0e9 100644
> --- a/Documentation/mm/process_addrs.rst
> +++ b/Documentation/mm/process_addrs.rst
> @@ -3,3 +3,816 @@
>  =================
>  Process Addresses
>  =================
> +
> +.. toctree::
> +   :maxdepth: 3
> +
> +
> +Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
> +'VMA's of type :c:struct:`!struct vm_area_struct`.
> +
> +Each VMA describes a virtually contiguous memory range with identical
> +attributes, each described by a :c:struct:`!struct vm_area_struct`
> +object. Userland access outside of VMAs is invalid except in the case where an
> +adjacent stack VMA could be extended to contain the accessed address.
> +
> +All VMAs are contained within one and only one virtual address space, described
> +by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
> +threads) which share the virtual address space. We refer to this as the
> +:c:struct:`!mm`.
> +
> +Each mm object contains a maple tree data structure which describes all VMAs
> +within the virtual address space.
> +
> +.. note:: An exception to this is the 'gate' VMA which is provided by
> +          architectures which use :c:struct:`!vsyscall` and is a global static
> +          object which does not belong to any specific mm.

vvars too?

> +
> +-------
> +Locking
> +-------
> +
> +The kernel is designed to be highly scalable against concurrent read operations
> +on VMA **metadata** so a complicated set of locks are required to ensure memory
> +corruption does not occur.
> +
> +.. note:: Locking VMAs for their metadata does not have any impact on the memory
> +          they describe nor the page tables that map them.
> +
> +Terminology
> +-----------
> +
> +* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
> +  which locks at a process address space granularity which can be acquired via
> +  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
> +* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
> +  as a read/write semaphore in practice. A VMA read lock is obtained via
> +  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
> +  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
> +  automatically when the mmap write lock is released). To take a VMA write lock
> +  you **must** have already acquired an :c:func:`!mmap_write_lock`.
> +* **rmap locks** - When trying to access VMAs through the reverse mapping via a
> +  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
> +  (reachable from a folio via :c:member:`!folio->mapping`) VMAs must be stabilised via
> +  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
> +  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
> +  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
> +  locks as the reverse mapping locks, or 'rmap locks' for brevity.
> +
> +We discuss page table locks separately in the dedicated section below.
> +
> +The first thing **any** of these locks achieve is to **stabilise** the VMA
> +within the MM tree. That is, guaranteeing that the VMA object will not be
> +deleted from under you nor modified (except for some specific fields
> +described below).
> +
> +Stabilising a VMA also keeps the address space described by it around.
> +
> +Using address space locks
> +-------------------------
> +
> +If you want to **read** VMA metadata fields or just keep the VMA stable, you
> +must do one of the following:
> +
> +* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
> +  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
> +  you're done with the VMA, *or*
> +* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
> +  acquire the lock atomically so might fail, in which case fall-back logic is
> +  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
> +  *or*
> +* Acquire an rmap lock before traversing the locked interval tree (whether
> +  anonymous or file-backed) to obtain the required VMA.
> +
> +If you want to **write** VMA metadata fields, then things vary depending on the
> +field (we explore each VMA field in detail below). For the majority you must:
> +
> +* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
> +  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
> +  you're done with the VMA, *and*
> +* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
> +  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
> +  called.
> +* If you want to be able to write to **any** field, you must also hide the VMA
> +  from the reverse mapping by obtaining an **rmap write lock**.
> +
> +VMA locks are special in that you must obtain an mmap **write** lock **first**
> +in order to obtain a VMA **write** lock. A VMA **read** lock however can be
> +obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
> +release an RCU lock to lookup the VMA for you).

This reduces the impact of a writer on readers by only impacting
conflicting areas of the vma tree.

> +
> +.. note:: The primary users of VMA read locks are page fault handlers, which
> +          means that without a VMA write lock, page faults will run concurrent with
> +          whatever you are doing.

This is the primary user in that it's the most frequent, but as we
unwind other lock messes it is becoming a pattern.

Maybe "the most frequent users" ?


> +
> +Examining all valid lock states:
> +
> +.. table::
> +
> +   ========= ======== ========= ======= ===== =========== ==========
> +   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
> +   ========= ======== ========= ======= ===== =========== ==========
> +   \-        \-       \-        N       N     N           N
> +   \-        R        \-        Y       Y     N           N
> +   \-        \-       R/W       Y       Y     N           N
> +   R/W       \-/R     \-/R/W    Y       Y     N           N
> +   W         W        \-/R      Y       Y     Y           N
> +   W         W        W         Y       Y     Y           Y
> +   ========= ======== ========= ======= ===== =========== ==========
> +
> +.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
> +             attempting to do the reverse is invalid as it can result in deadlock - if
> +             another task already holds an mmap write lock and attempts to acquire a VMA
> +             write lock that will deadlock on the VMA read lock.
> +
> +All of these locks behave as read/write semaphores in practice, so you can
> +obtain either a read or a write lock for each of these.
> +
> +.. note:: Generally speaking, a read/write semaphore is a class of lock which
> +          permits concurrent readers. However a write lock can only be obtained
> +          once all readers have left the critical region (and pending readers
> +          made to wait).
> +
> +          This renders read locks on a read/write semaphore concurrent with other
> +          readers and write locks exclusive against all others holding the semaphore.
> +
> +VMA fields
> +^^^^^^^^^^
> +
> +We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
> +easier to explore their locking characteristics:
> +
> +.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
> +          are in effect an internal implementation detail.
> +
> +.. table:: Virtual layout fields
> +
> +   ===================== ======================================== ===========
> +   Field                 Description                              Write lock
> +   ===================== ======================================== ===========
> +   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
> +                         VMA describes.                           VMA write,
> +                                                                  rmap write.
> +   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
> +                         VMA describes.                           VMA write,
> +                                                                  rmap write.
> +   :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
> +                         the original page offset within the      VMA write,
> +                         virtual address space (prior to any      rmap write.
> +                         :c:func:`!mremap`), or PFN if a PFN map
> +                         and the architecture does not support
> +                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
> +   ===================== ======================================== ===========
> +
> +These fields describes the size, start and end of the VMA, and as such cannot be
> +modified without first being hidden from the reverse mapping since these fields
> +are used to locate VMAs within the reverse mapping interval trees.
> +
> +.. table:: Core fields
> +
> +   ============================ ======================================== =========================
> +   Field                        Description                              Write lock
> +   ============================ ======================================== =========================
> +   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
> +                                                                         initial map.
> +   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
> +                                protection bits determined from VMA
> +                                flags.
> +   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
> +                                attributes of the VMA, in union with
> +                                private writable
> +                                :c:member:`!__vm_flags`.
> +   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
> +                                field, updated by
> +                                :c:func:`!vm_flags_*` functions.
> +   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
> +                                struct file object describing the        initial map.
> +                                underlying file, if anonymous then
> +                                :c:macro:`!NULL`.
> +   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
> +                                the driver or file-system provides a     initial map by
> +                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
> +                                object describing callbacks to be
> +                                invoked on VMA lifetime events.
> +   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
> +                                driver-specific metadata.
> +   ============================ ======================================== =========================
> +
> +These are the core fields which describe the MM the VMA belongs to and its attributes.
> +
> +.. table:: Config-specific fields
> +
> +   ================================= ===================== ======================================== ===============
> +   Field                             Configuration option  Description                              Write lock
> +   ================================= ===================== ======================================== ===============
> +   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
> +                                                           :c:struct:`!struct anon_vma_name`        VMA write.
> +                                                           object providing a name for anonymous
> +                                                           mappings, or :c:macro:`!NULL` if none
> +                                                           is set or the VMA is file-backed.

These are ref counted and can be shared by more than one vma for
scalability.

> +   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read,
> +                                                           to perform readahead. This field is      swap-specific
> +                                                           accessed atomically.                     lock.
> +   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
> +                                                           describes the NUMA behaviour of the      VMA write.
> +                                                           VMA.

These are also ref counted for scalability.

> +   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read,
> +                                                           describes the current state of           numab-specific
> +                                                           NUMA balancing in relation to this VMA.  lock.
> +                                                           Updated under mmap read lock by
> +                                                           :c:func:`!task_numa_work`.
> +   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
> +                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
> +                                                           either of zero size if userfaultfd is
> +                                                           disabled, or containing a pointer
> +                                                           to an underlying
> +                                                           :c:type:`!userfaultfd_ctx` object which
> +                                                           describes userfaultfd metadata.
> +   ================================= ===================== ======================================== ===============
> +
> +These fields are present or not depending on whether the relevant kernel
> +configuration option is set.
> +
> +.. table:: Reverse mapping fields
> +
> +   =================================== ========================================= ============================
> +   Field                               Description                               Write lock
> +   =================================== ========================================= ============================
> +   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
> +                                       mapping is file-backed, to place the VMA  i_mmap write.
> +                                       in the
> +                                       :c:member:`!struct address_space->i_mmap`
> +                                       red/black interval tree.
> +   :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
> +                                       interval tree if the VMA is file-backed.  i_mmap write.
> +   :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
> +                                       :c:type:`!anon_vma` objects and
> +                                       :c:member:`!vma->anon_vma` if it is
> +                                       non-:c:macro:`!NULL`.
> +   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        When :c:macro:`NULL` and
> +                                       anonymous folios mapped exclusively to    setting non-:c:macro:`NULL`:
> +                                       this VMA. Initially set by                mmap read, page_table_lock.
> +                                       :c:func:`!anon_vma_prepare` serialised
> +                                       by the :c:macro:`!page_table_lock`. This  When non-:c:macro:`NULL` and
> +                                       is set as soon as any page is faulted in. setting :c:macro:`NULL`:
> +                                                                                 mmap write, VMA write,
> +                                                                                 anon_vma write.
> +   =================================== ========================================= ============================
> +
> +These fields are used to both place the VMA within the reverse mapping, and for
> +anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
> +and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
> +reside.
> +
> +.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
> +          then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
> +          trees at the same time, so all of these fields might be utilised at
> +          once.
> +
> +Page tables
> +-----------
> +
> +We won't speak exhaustively on the subject but broadly speaking, page tables map
> +virtual addresses to physical ones through a series of page tables, each of
> +which contain entries with physical addresses for the next page table level
> +(along with flags), and at the leaf level the physical addresses of the
> +underlying physical data pages or a special entry such as a swap entry,
> +migration entry or other special marker. Offsets into these pages are provided
> +by the virtual address itself.
> +
> +In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
> +pages might eliminate one or two of these levels, but when this is the case we
> +typically refer to the leaf level as the PTE level regardless.
> +
> +.. note:: In instances where the architecture supports fewer page tables than
> +	  five the kernel cleverly 'folds' page table levels, that is stubbing
> +	  out functions related to the skipped levels. This allows us to
> +	  conceptually act is if there were always five levels, even if the
> +	  compiler might, in practice, eliminate any code relating to missing
> +	  ones.
> +
> +There are free key operations typically performed on page tables:
> +
> +1. **Traversing** page tables - Simply reading page tables in order to traverse
> +   them. This only requires that the VMA is kept stable, so a lock which
> +   establishes this suffices for traversal (there are also lockless variants
> +   which eliminate even this requirement, such as :c:func:`!gup_fast`).
> +2. **Installing** page table mappings - Whether creating a new mapping or
> +   modifying an existing one. This requires that the VMA is kept stable via an
> +   mmap or VMA lock (explicitly not rmap locks).
> +3. **Zapping/unmapping** page table entries - This is what the kernel calls
> +   clearing page table mappings at the leaf level only, whilst leaving all page
> +   tables in place. This is a very common operation in the kernel performed on
> +   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
> +   :c:func:`!madvise`, and others. This is performed by a number of functions
> +   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`
> +   among others. The VMA need only be kept stable for this operation.
> +4. **Freeing** page tables - When finally the kernel removes page tables from a
> +   userland process (typically via :c:func:`!free_pgtables`) extreme care must
> +   be taken to ensure this is done safely, as this logic finally frees all page
> +   tables in the specified range, ignoring existing leaf entries (it assumes the
> +   caller has both zapped the range and prevented any further faults or
> +   modifications within it).
> +
> +**Traversing** and **zapping** ranges can be performed holding any one of the
> +locks described in the terminology section above - that is the mmap lock, the
> +VMA lock or either of the reverse mapping locks.
> +
> +That is - as long as you keep the relevant VMA **stable** - you are good to go
> +ahead and perform these operations on page tables (though internally, kernel
> +operations that perform writes also acquire internal page table locks to
> +serialise - see the page table implementation detail section for more details).
> +
> +When **installing** page table entries, the mmap or VMA lock mut be held to keep
> +the VMA stable. We explore why this is in the page table locking details section
> +below.
> +
> +**Freeing** page tables is an entirely internal memory management operation and
> +has special requirements (see the page freeing section below for more details).
> +
> +.. warning:: When **freeing** page tables, it must not be possible for VMAs
> +             containing the ranges those page tables map to be accessible via
> +             the reverse mapping.
> +
> +             The :c:func:`!free_pgtables` function removes the relevant VMAs
> +             from the reverse mappings, but no other VMAs can be permitted to be
> +             accessible and span the specified range.
> +
> +Lock ordering
> +-------------
> +
> +As we have multiple locks across the kernel which may or may not be taken at the
> +same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
> +the **order** in which locks are acquired and released becomes very important.
> +
> +.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
> +   but in doing so inadvertently cause a mutual deadlock.
> +
> +   For example, consider thread 1 which holds lock A and tries to acquire lock B,
> +   while thread 2 holds lock B and tries to acquire lock A.
> +
> +   Both threads are now deadlocked on each other. However, had they attempted to
> +   acquire locks in the same order, one would have waited for the other to
> +   complete its work and no deadlock would have occurred.
> +
> +The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
> +ordering of locks within memory management code:
> +
> +.. code-block::
> +
> +  inode->i_rwsem        (while writing or truncating, not reading or faulting)
> +    mm->mmap_lock
> +      mapping->invalidate_lock (in filemap_fault)
> +        folio_lock
> +          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
> +            vma_start_write
> +              mapping->i_mmap_rwsem
> +                anon_vma->rwsem
> +                  mm->page_table_lock or pte_lock
> +                    swap_lock (in swap_duplicate, swap_info_get)
> +                      mmlist_lock (in mmput, drain_mmlist and others)
> +                      mapping->private_lock (in block_dirty_folio)
> +                          i_pages lock (widely used)
> +                            lruvec->lru_lock (in folio_lruvec_lock_irq)
> +                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> +                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> +                        sb_lock (within inode_lock in fs/fs-writeback.c)
> +                        i_pages lock (widely used, in set_page_dirty,
> +                                  in arch-dependent flush_dcache_mmap_lock,
> +                                  within bdi.wb->list_lock in __sync_single_inode)
> +
> +There is also a file-system specific lock ordering comment located at the top of
> +:c:macro:`!mm/filemap.c`:
> +
> +.. code-block::
> +
> +  ->i_mmap_rwsem                (truncate_pagecache)
> +    ->private_lock              (__free_pte->block_dirty_folio)
> +      ->swap_lock               (exclusive_swap_page, others)
> +        ->i_pages lock
> +
> +  ->i_rwsem
> +    ->invalidate_lock           (acquired by fs in truncate path)
> +      ->i_mmap_rwsem            (truncate->unmap_mapping_range)
> +
> +  ->mmap_lock
> +    ->i_mmap_rwsem
> +      ->page_table_lock or pte_lock     (various, mainly in memory.c)
> +        ->i_pages lock  (arch-dependent flush_dcache_mmap_lock)
> +
> +  ->mmap_lock
> +    ->invalidate_lock           (filemap_fault)
> +      ->lock_page               (filemap_fault, access_process_vm)
> +
> +  ->i_rwsem                     (generic_perform_write)
> +    ->mmap_lock         (fault_in_readable->do_page_fault)
> +
> +  bdi->wb.list_lock
> +    sb_lock                     (fs/fs-writeback.c)
> +    ->i_pages lock              (__sync_single_inode)
> +
> +  ->i_mmap_rwsem
> +    ->anon_vma.lock             (vma_merge)
> +
> +  ->anon_vma.lock
> +    ->page_table_lock or pte_lock       (anon_vma_prepare and various)
> +
> +  ->page_table_lock or pte_lock
> +    ->swap_lock         (try_to_unmap_one)
> +    ->private_lock              (try_to_unmap_one)
> +    ->i_pages lock              (try_to_unmap_one)
> +    ->lruvec->lru_lock  (follow_page_mask->mark_page_accessed)
> +    ->lruvec->lru_lock  (check_pte_range->folio_isolate_lru)
> +    ->private_lock              (folio_remove_rmap_pte->set_page_dirty)
> +    ->i_pages lock              (folio_remove_rmap_pte->set_page_dirty)
> +    bdi.wb->list_lock           (folio_remove_rmap_pte->set_page_dirty)
> +    ->inode->i_lock             (folio_remove_rmap_pte->set_page_dirty)
> +    bdi.wb->list_lock           (zap_pte_range->set_page_dirty)
> +    ->inode->i_lock             (zap_pte_range->set_page_dirty)
> +    ->private_lock              (zap_pte_range->block_dirty_folio)
> +
> +Please check the current state of these comments which may have changed since
> +the time of writing of this document.

hugetlbfs has its own locking and is out of scope.

> +
> +------------------------------
> +Locking Implementation Details
> +------------------------------
> +
> +Page table locking details
> +--------------------------
> +
> +In addition to the locks described in the terminology section above, we have
> +additional locks dedicated to page tables:
> +
> +* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
> +  and PUD each make use of the process address space granularity
> +  :c:member:`!mm->page_table_lock` lock when modified.
> +
> +* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
> +  either kept within the folios describing the page tables or allocated
> +  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
> +  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
> +  mapped into higher memory (if a 32-bit system) and carefully locked via
> +  :c:func:`!pte_offset_map_lock`.
> +
> +These locks represent the minimum required to interact with each page table
> +level, but there are further requirements.
> +
> +Importantly, note that on a **traversal** of page tables, no such locks are
> +taken. Whether care is taken on reading the page table entries depends on the
> +architecture, see the section on atomicity below.
> +
> +Locking rules
> +^^^^^^^^^^^^^
> +
> +We establish basic locking rules when interacting with page tables:
> +
> +* When changing a page table entry the page table lock for that page table
> +  **must** be held, except if you can safely assume nobody can access the page
> +  tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
> +* Reads from and writes to page table entries must be *appropriately*
> +  atomic. See the section on atomicity below for details.
> +* Populating previously empty entries requires that the mmap or VMA locks are
> +  held (read or write), doing so with only rmap locks would be dangerous (see
> +  the warning below).

Which is the rmap lock?  It's not listed as rmap lock in the rmap file.

> +* As mentioned previously, zapping can be performed while simply keeping the VMA
> +  stable, that is holding any one of the mmap, VMA or rmap locks.
> +* Special care is required for PTEs, as on 32-bit architectures these must be
> +  mapped into high memory and additionally, careful consideration must be
> +  applied to racing with THP, migration or other concurrent kernel operations
> +  that might steal the entire PTE table from under us. All this is handled by
> +  :c:func:`!pte_offset_map_lock` (see the section on page table installation
> +  below for more details).
> +
> +.. warning:: Populating previously empty entries is dangerous as, when unmapping
> +             VMAs, :c:func:`!vms_clear_ptes` has a window of time between
> +             zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
> +             :c:func:`!free_pgtables`), where the VMA is still visible in the
> +             rmap tree. :c:func:`!free_pgtables` assumes that the zap has
> +             already been performed and removes PTEs unconditionally (along with
> +             all other page tables in the freed range), so installing new PTE
> +             entries could leak memory and also cause other unexpected and
> +             dangerous behaviour.
> +
> +There are additional rules applicable when moving page tables, which we discuss
> +in the section on this topic below.
> +
> +.. note:: Interestingly, :c:func:`!pte_offset_map_lock` holds an RCU read lock
> +          while the PTE page table lock is held.
> +
> +Atomicity
> +^^^^^^^^^
> +
> +Regardless of page table locks, the MMU hardware concurrently updates accessed
> +and dirty bits (perhaps more, depending on architecture). Additionally, page
> +table traversal operations in parallel (though holding the VMA stable) and
> +functionality like GUP-fast locklessly traverses (that is reads) page tables,
> +without even keeping the VMA stable at all.
> +
> +When performing a page table traversal and keeping the VMA stable, whether a
> +read must be performed once and only once or not depends on the architecture
> +(for instance x86-64 does not require any special precautions).
> +
> +It is on the write side, or if a read informs whether a write takes place (on an
> +installation of a page table entry say, for instance in
> +:c:func:`!__pud_install`), where special care must always be taken. In these
> +cases we can never assume that page table locks give us entirely exclusive
> +access, and must retrieve page table entries once and only once.
> +
> +If we are reading page table entries, then we need only ensure that the compiler
> +does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
> +functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
> +:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
> +
> +Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
> +the page table entry only once.
> +
> +However, if we wish to manipulate an existing page table entry and care about
> +the previously stored data, we must go further and use an hardware atomic
> +operation as, for example, in :c:func:`!ptep_get_and_clear`.
> +
> +Equally, operations that do not rely on the VMA being held stable, such as
> +GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
> +:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
> +entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
> +higher level page table levels.
> +
> +Writes to page table entries must also be appropriately atomic, as established
> +by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
> +:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
> +
> +Equally functions which clear page table entries must be appropriately atomic,
> +as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
> +:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
> +:c:func:`!pte_clear`.
> +
> +Page table installation
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Page table installation is performed with the VMA held stable explicitly by an
> +mmap or VMA lock in read or write mode (see the warning in the locking rules
> +section for details as to why).
> +
> +When allocating a P4D, PUD or PMD and setting the relevant entry in the above
> +PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
> +acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
> +:c:func:`!__pmd_alloc` respectively.
> +
> +.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
> +   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
> +   references the :c:member:`!mm->page_table_lock`.
> +
> +Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
> +:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
> +physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
> +:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
> +:c:func:`!__pte_alloc`.
> +
> +Finally, modifying the contents of the PTE requires special treatment, as the
> +PTE page table lock must be acquired whenever we want stable and exclusive
> +access to entries contained within a PTE, especially when we wish to modify
> +them.
> +
> +This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
> +ensure that the PTE hasn't changed from under us, ultimately invoking
> +:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
> +the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
> +must be released via :c:func:`!pte_unmap_unlock`.
> +
> +.. note:: There are some variants on this, such as
> +   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
> +   for brevity we do not explore this.  See the comment for
> +   :c:func:`!__pte_offset_map_lock` for more details.
> +
> +When modifying data in ranges we typically only wish to allocate higher page
> +tables as necessary, using these locks to avoid races or overwriting anything,
> +and set/clear data at the PTE level as required (for instance when page faulting
> +or zapping).
> +
> +A typical pattern taken when traversing page table entries to install a new
> +mapping is to optimistically determine whether the page table entry in the table
> +above is empty, if so, only then acquiring the page table lock and checking
> +again to see if it was allocated underneath is.
> +
> +This allows for a traversal with page table locks only being taken when
> +required. An example of this is :c:func:`!__pud_alloc`.
> +
> +At the leaf page table, that is the PTE, we can't entirely rely on this pattern
> +as we have separate PMD and PTE locks and a THP collapse for instance might have
> +eliminated the PMD entry as well as the PTE from under us.
> +
> +This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
> +for the PTE, carefully checking it is as expected, before acquiring the
> +PTE-specific lock, and then *again* checking that the PMD lock is as expected.
> +
> +If a THP collapse (or similar) were to occur then the lock on both pages would
> +be acquired, so we can ensure this is prevented while the PTE lock is held.
> +
> +Installing entries this way ensures mutual exclusion on write.
> +

I stopped here, but missed the v1 comment time so I'm sending this now.
...

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ