[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <lvirfevzcrnkddmdsp456dzbb2f7ahd547zv4yy5syq3en6sjz@htyzuesvvezr>
Date: Mon, 11 Nov 2024 21:48:23 -0500
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: akpm@...ux-foundation.org, willy@...radead.org, lorenzo.stoakes@...cle.com,
mhocko@...e.com, vbabka@...e.cz, hannes@...xchg.org, mjguzik@...il.com,
oliver.sang@...el.com, mgorman@...hsingularity.net, david@...hat.com,
peterx@...hat.com, oleg@...hat.com, dave@...olabs.net,
paulmck@...nel.org, brauner@...nel.org, dhowells@...hat.com,
hdanton@...a.com, hughd@...gle.com, minchan@...gle.com,
jannh@...gle.com, shakeel.butt@...ux.dev, souravpanda@...gle.com,
pasha.tatashin@...een.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Re: [PATCH 0/4] move per-vma lock into vm_area_struct
* Suren Baghdasaryan <surenb@...gle.com> [241111 16:41]:
> On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> > slabinfo before:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > vma_lock ... 40 102 1 : ...
> > vm_area_struct ... 160 51 2 : ...
> >
> > slabinfo after moving vm_lock:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > vm_area_struct ... 256 32 2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs.
> > To minimize memory overhead, vm_lock implementation is changed from
> > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > vm_area_struct members are moved into the last cacheline, resulting
> > in a less fragmented structure:
Wait a second, this is taking 40B down to 8B, but the alignment of the
vma will surely absorb that 32B difference? The struct sum is 153B
according to what you have below so we won't go over 192B. What am I
missing?
> >
> > struct vm_area_struct {
> > union {
> > struct {
> > long unsigned int vm_start; /* 0 8 */
> > long unsigned int vm_end; /* 8 8 */
> > }; /* 0 16 */
> > struct callback_head vm_rcu ; /* 0 16 */
> > } __attribute__((__aligned__(8))); /* 0 16 */
> > struct mm_struct * vm_mm; /* 16 8 */
> > pgprot_t vm_page_prot; /* 24 8 */
> > union {
> > const vm_flags_t vm_flags; /* 32 8 */
> > vm_flags_t __vm_flags; /* 32 8 */
> > }; /* 32 8 */
> > bool detached; /* 40 1 */
> >
> > /* XXX 3 bytes hole, try to pack */
> >
> > unsigned int vm_lock_seq; /* 44 4 */
> > struct list_head anon_vma_chain; /* 48 16 */
> > /* --- cacheline 1 boundary (64 bytes) --- */
> > struct anon_vma * anon_vma; /* 64 8 */
> > const struct vm_operations_struct * vm_ops; /* 72 8 */
> > long unsigned int vm_pgoff; /* 80 8 */
> > struct file * vm_file; /* 88 8 */
> > void * vm_private_data; /* 96 8 */
> > atomic_long_t swap_readahead_info; /* 104 8 */
> > struct mempolicy * vm_policy; /* 112 8 */
> >
> > /* XXX 8 bytes hole, try to pack */
> >
> > /* --- cacheline 2 boundary (128 bytes) --- */
> > struct vma_lock vm_lock (__aligned__(64)); /* 128 4 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > struct {
> > struct rb_node rb (__aligned__(8)); /* 136 24 */
> > long unsigned int rb_subtree_last; /* 160 8 */
> > } __attribute__((__aligned__(8))) shared; /* 136 32 */
> > struct vm_userfaultfd_ctx vm_userfaultfd_ctx; /* 168 0 */
> >
> > /* size: 192, cachelines: 3, members: 17 */
> > /* sum members: 153, holes: 3, sum holes: 15 */
> > /* padding: 24 */
> > /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > to the 50 pages in the baseline:
> >
> > slabinfo after vm_area_struct changes:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > vm_area_struct ... 192 42 2 : ...
> >
> > Performance measurements using pft test on x86 do not show considerable
> > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > faults per second.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
>
> And of course I forgot to update Lorenzo's new locking documentation :/
> Will add that in the next version.
>
> >
> > Suren Baghdasaryan (4):
> > mm: introduce vma_start_read_locked{_nested} helpers
> > mm: move per-vma lock into vm_area_struct
> > mm: replace rw_semaphore with atomic_t in vma_lock
> > mm: move lesser used vma_area_struct members into the last cacheline
> >
> > include/linux/mm.h | 163 +++++++++++++++++++++++++++++++++++---
> > include/linux/mm_types.h | 59 +++++++++-----
> > include/linux/mmap_lock.h | 3 +
> > kernel/fork.c | 50 ++----------
> > mm/init-mm.c | 2 +
> > mm/userfaultfd.c | 14 ++--
> > 6 files changed, 205 insertions(+), 86 deletions(-)
> >
> >
> > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > --
> > 2.47.0.277.g8800431eea-goog
> >
Powered by blists - more mailing lists