lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3593f7ab-c37d-4c1d-bf2f-e47c30bb5d2b@lucifer.local>
Date: Wed, 13 Nov 2024 15:44:11 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Jonathan Corbet <corbet@....net>, Vlastimil Babka <vbabka@...e.cz>,
        Jann Horn <jannh@...gle.com>, Alice Ryhl <aliceryhl@...gle.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Matthew Wilcox <willy@...radead.org>, Mike Rapoport <rppt@...nel.org>,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, Suren Baghdasaryan <surenb@...gle.com>,
        Hillf Danton <hdanton@...a.com>, Qi Zheng <zhengqi.arch@...edance.com>,
        SeongJae Park <sj@...nel.org>, Bagas Sanjaya <bagasdotme@...il.com>
Subject: Re: [PATCH v2] docs/mm: add VMA locks documentation

On Tue, Nov 12, 2024 at 10:15:44AM -0500, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@...cle.com> [241108 08:57]:
[snip]
> > +Each mm object contains a maple tree data structure which describes all VMAs
> > +within the virtual address space.
> > +
> > +.. note:: An exception to this is the 'gate' VMA which is provided by
> > +          architectures which use :c:struct:`!vsyscall` and is a global static
> > +          object which does not belong to any specific mm.
>
> vvars too?

I'm not sure if that's the case? For instance for x86-64 we have:

/*
 * A pseudo VMA to allow ptrace access for the vsyscall page.  This only
 * covers the 64bit vsyscall page now. 32bit has a real VMA now and does
 * not need special handling anymore:
 */
static const char *gate_vma_name(struct vm_area_struct *vma)
{
	return "[vsyscall]";
}

With no reference to vvar. Also vsyscall exists in a kernel range, and vvar
in a userland mapping (along with the vdso).

I am not hugely expert about this area however and may be missing something.

[snip]

> > +
> > +VMA locks are special in that you must obtain an mmap **write** lock **first**
> > +in order to obtain a VMA **write** lock. A VMA **read** lock however can be
> > +obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
> > +release an RCU lock to lookup the VMA for you).
>
> This reduces the impact of a writer on readers by only impacting
> conflicting areas of the vma tree.
>
> > +
> > +.. note:: The primary users of VMA read locks are page fault handlers, which
> > +          means that without a VMA write lock, page faults will run concurrent with
> > +          whatever you are doing.
>
> This is the primary user in that it's the most frequent, but as we
> unwind other lock messes it is becoming a pattern.
>
> Maybe "the most frequent users" ?
>
>

I mean the only other users (at least in 6.11.y) are uffd and ipv4 tcp, so
I think 'primary' is fine, this doesn't preclude other users but I think
'the most frequent users' vs. 'primary users' is likely to be equally
outdated should we move sufficient other stuff to using VMA locks (and then
docs can be updated too :)

[snip]

> > +.. table:: Config-specific fields
> > +
> > +   ================================= ===================== ======================================== ===============
> > +   Field                             Configuration option  Description                              Write lock
> > +   ================================= ===================== ======================================== ===============
> > +   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
> > +                                                           :c:struct:`!struct anon_vma_name`        VMA write.
> > +                                                           object providing a name for anonymous
> > +                                                           mappings, or :c:macro:`!NULL` if none
> > +                                                           is set or the VMA is file-backed.
>
> These are ref counted and can be shared by more than one vma for
> scalability.

I'm talking about the field, I think this can be treated as an implementation
detail, perhaps?

In this opening section I tried to avoid too much detail as v1 of this
series sort of went all-in and people (rightly!) raised the issue that you
couldn't pull usable information out of it right away and it was too
weighted towards internal mm kernel devs.

I can add a little thing about it here if you want but I feel probably not
necessary?

>
> > +   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read,
> > +                                                           to perform readahead. This field is      swap-specific
> > +                                                           accessed atomically.                     lock.
> > +   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
> > +                                                           describes the NUMA behaviour of the      VMA write.
> > +                                                           VMA.
>
> These are also ref counted for scalability.

See above.

>
> > +   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read,
> > +                                                           describes the current state of           numab-specific
> > +                                                           NUMA balancing in relation to this VMA.  lock.
> > +                                                           Updated under mmap read lock by
> > +                                                           :c:func:`!task_numa_work`.
> > +   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
> > +                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
> > +                                                           either of zero size if userfaultfd is
> > +                                                           disabled, or containing a pointer
> > +                                                           to an underlying
> > +                                                           :c:type:`!userfaultfd_ctx` object which
> > +                                                           describes userfaultfd metadata.
> > +   ================================= ===================== ======================================== ===============

[snip]

> > +There is also a file-system specific lock ordering comment located at the top of
> > +:c:macro:`!mm/filemap.c`:
> > +
> > +.. code-block::
> > +
> > +  ->i_mmap_rwsem                (truncate_pagecache)
> > +    ->private_lock              (__free_pte->block_dirty_folio)
> > +      ->swap_lock               (exclusive_swap_page, others)
> > +        ->i_pages lock
> > +
> > +  ->i_rwsem
> > +    ->invalidate_lock           (acquired by fs in truncate path)
> > +      ->i_mmap_rwsem            (truncate->unmap_mapping_range)
> > +
> > +  ->mmap_lock
> > +    ->i_mmap_rwsem
> > +      ->page_table_lock or pte_lock     (various, mainly in memory.c)
> > +        ->i_pages lock  (arch-dependent flush_dcache_mmap_lock)
> > +
> > +  ->mmap_lock
> > +    ->invalidate_lock           (filemap_fault)
> > +      ->lock_page               (filemap_fault, access_process_vm)
> > +
> > +  ->i_rwsem                     (generic_perform_write)
> > +    ->mmap_lock         (fault_in_readable->do_page_fault)
> > +
> > +  bdi->wb.list_lock
> > +    sb_lock                     (fs/fs-writeback.c)
> > +    ->i_pages lock              (__sync_single_inode)
> > +
> > +  ->i_mmap_rwsem
> > +    ->anon_vma.lock             (vma_merge)
> > +
> > +  ->anon_vma.lock
> > +    ->page_table_lock or pte_lock       (anon_vma_prepare and various)
> > +
> > +  ->page_table_lock or pte_lock
> > +    ->swap_lock         (try_to_unmap_one)
> > +    ->private_lock              (try_to_unmap_one)
> > +    ->i_pages lock              (try_to_unmap_one)
> > +    ->lruvec->lru_lock  (follow_page_mask->mark_page_accessed)
> > +    ->lruvec->lru_lock  (check_pte_range->folio_isolate_lru)
> > +    ->private_lock              (folio_remove_rmap_pte->set_page_dirty)
> > +    ->i_pages lock              (folio_remove_rmap_pte->set_page_dirty)
> > +    bdi.wb->list_lock           (folio_remove_rmap_pte->set_page_dirty)
> > +    ->inode->i_lock             (folio_remove_rmap_pte->set_page_dirty)
> > +    bdi.wb->list_lock           (zap_pte_range->set_page_dirty)
> > +    ->inode->i_lock             (zap_pte_range->set_page_dirty)
> > +    ->private_lock              (zap_pte_range->block_dirty_folio)
> > +
> > +Please check the current state of these comments which may have changed since
> > +the time of writing of this document.
>
> hugetlbfs has its own locking and is out of scope.
>

Yeah, I've generally avoided this here as kinda implicitly out of scope, I
mean hugetlbfs has its own VMA locking too.

I think somebody reading this document interested in hugetlbfs internals
would kinda get the point that the lack of reference to it implies we don't
cover it.

But can add a note somewhere in doc to say 'except hugetlbfs'? :>)

> > +
> > +------------------------------
> > +Locking Implementation Details
> > +------------------------------
> > +
> > +Page table locking details
> > +--------------------------
> > +
> > +In addition to the locks described in the terminology section above, we have
> > +additional locks dedicated to page tables:
> > +
> > +* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
> > +  and PUD each make use of the process address space granularity
> > +  :c:member:`!mm->page_table_lock` lock when modified.
> > +
> > +* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
> > +  either kept within the folios describing the page tables or allocated
> > +  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
> > +  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
> > +  mapped into higher memory (if a 32-bit system) and carefully locked via
> > +  :c:func:`!pte_offset_map_lock`.
> > +
> > +These locks represent the minimum required to interact with each page table
> > +level, but there are further requirements.
> > +
> > +Importantly, note that on a **traversal** of page tables, no such locks are
> > +taken. Whether care is taken on reading the page table entries depends on the
> > +architecture, see the section on atomicity below.
> > +
> > +Locking rules
> > +^^^^^^^^^^^^^
> > +
> > +We establish basic locking rules when interacting with page tables:
> > +
> > +* When changing a page table entry the page table lock for that page table
> > +  **must** be held, except if you can safely assume nobody can access the page
> > +  tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
> > +* Reads from and writes to page table entries must be *appropriately*
> > +  atomic. See the section on atomicity below for details.
> > +* Populating previously empty entries requires that the mmap or VMA locks are
> > +  held (read or write), doing so with only rmap locks would be dangerous (see
> > +  the warning below).
>
> Which is the rmap lock?  It's not listed as rmap lock in the rmap file.
>

I explicitly describe what the rmap locks are in the terminology
section. It's just shorthand so I don't have to say 'i_mmap_rwsem read,
write, mmap_lock read, write locks' over + over again :>)

I know this was a bugbear of yours with recent stuff relating to these
locks and I totally get that it's annoying to have this sort of non
standard way of referring to them but I somewhat gave in.

I sort of have a (crazy) desire to do something about this separation at
some point and make it one lock by eliminating the anon_vma stuff... but
that's perhaps an RFC in the future...

[snip]

> > +If a THP collapse (or similar) were to occur then the lock on both pages would
> > +be acquired, so we can ensure this is prevented while the PTE lock is held.
> > +
> > +Installing entries this way ensures mutual exclusion on write.
> > +
>
> I stopped here, but missed the v1 comment time so I'm sending this now.
> ...

Thanks, appreciate your feedback!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ