linux-kernel - Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit mm_lock_seq/vm_lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbOXrbixQA=fpg17QPBv+4myAQrHvCX42hVye0Ww9W2Aw@mail.gmail.com>
Date: Thu, 17 Oct 2024 11:55:35 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Suren Baghdasaryan <surenb@...gle.com>
Cc: Shakeel Butt <shakeel.butt@...ux.dev>, Andrii Nakryiko <andrii@...nel.org>, 
	linux-trace-kernel@...r.kernel.org, linux-mm@...ck.org, peterz@...radead.org, 
	oleg@...hat.com, rostedt@...dmis.org, mhiramat@...nel.org, 
	bpf@...r.kernel.org, linux-kernel@...r.kernel.org, jolsa@...nel.org, 
	paulmck@...nel.org, willy@...radead.org, akpm@...ux-foundation.org, 
	mjguzik@...il.com, brauner@...nel.org, jannh@...gle.com, mhocko@...nel.org, 
	vbabka@...e.cz, hannes@...xchg.org, Liam.Howlett@...cle.com, 
	lorenzo.stoakes@...cle.com
Subject: Re: [PATCH v3 tip/perf/core 2/4] mm: switch to 64-bit
 mm_lock_seq/vm_lock_seq on 64-bit architectures

On Wed, Oct 16, 2024 at 7:02 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
>
> On Sun, Oct 13, 2024 at 12:56 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> >
> > On Thu, Oct 10, 2024 at 01:56:42PM GMT, Andrii Nakryiko wrote:
> > > To increase mm->mm_lock_seq robustness, switch it from int to long, so
> > > that it's a 64-bit counter on 64-bit systems and we can stop worrying
> > > about it wrapping around in just ~4 billion iterations. Same goes for
> > > VMA's matching vm_lock_seq, which is derived from mm_lock_seq.
>
> vm_lock_seq does not need to be long but for consistency I guess that

How come, we literally assign vm_lock_seq from mm_lock_seq and do
direct comparisons. They have to be exactly the same type, no?

> makes sense. While at it, can you please change these seq counters to
> be unsigned?

There is `vma->vm_lock_seq = -1;` in kernel/fork.c, should it be
switched to ULONG_MAX then? In general, unless this is critical for
correctness, I'd very much like stuff like this to be done in the mm
tree afterwards, but it seems trivial enough, so if you insist I'll do
it.

> Also, did you check with pahole if the vm_area_struct layout change
> pushes some members into a difference cacheline or creates new gaps?
>

Just did. We had 3 byte hole after `bool detached;`, it now grew to 7
bytes (so +4) and then vm_lock_seq itself is now 8 bytes (so +4),
which now does push rb and rb_subtree_last into *THE SAME* cache line
(which sounds like an improvement to me). vm_lock_seq and vm_lock stay
in the same cache line. vm_pgoff and vm_file are now in the same cache
line, and given they are probably always accessed together, seems like
a good accidental change as well. See below pahole outputs before and
after.

That singular detached bool looks like a complete waste, tbh. Maybe it
would be better to roll it into vm_flags and save 8 bytes? (not that I
want to do those mm changes in this patch set, of course...).
vm_area_struct is otherwise nicely tightly packed.

tl;dr, seems fine, and detached would be best to get rid of, if
possible (but that's a completely separate thing)

BEFORE
======
struct vm_area_struct {
        union {
                struct {
                        long unsigned int vm_start;      /*     0     8 */
                        long unsigned int vm_end;        /*     8     8 */
                };                                       /*     0    16 */
                struct callback_head vm_rcu;             /*     0    16 */
        } __attribute__((__aligned__(8)));               /*     0    16 */
        struct mm_struct *         vm_mm;                /*    16     8 */
        pgprot_t                   vm_page_prot;         /*    24     8 */
        union {
                const vm_flags_t   vm_flags;             /*    32     8 */
                vm_flags_t         __vm_flags;           /*    32     8 */
        };                                               /*    32     8 */
        bool                       detached;             /*    40     1 */

        /* XXX 3 bytes hole, try to pack */

        int                        vm_lock_seq;          /*    44     4 */
        struct vma_lock *          vm_lock;              /*    48     8 */
        struct {
                struct rb_node     rb;                   /*    56    24 */
                /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
                long unsigned int  rb_subtree_last;      /*    80     8 */
        }                                                /*    56    32 */
        struct list_head           anon_vma_chain;       /*    88    16 */
        struct anon_vma *          anon_vma;             /*   104     8 */
        const struct vm_operations_struct  * vm_ops;     /*   112     8 */
        long unsigned int          vm_pgoff;             /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        struct file *              vm_file;              /*   128     8 */
        void *                     vm_private_data;      /*   136     8 */
        atomic_long_t              swap_readahead_info;  /*   144     8 */
        struct mempolicy *         vm_policy;            /*   152     8 */
        struct vma_numab_state *   numab_state;          /*   160     8 */
        struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     8 */

        /* size: 176, cachelines: 3, members: 18 */
        /* sum members: 173, holes: 1, sum holes: 3 */
        /* forced alignments: 2 */
        /* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));

AFTER
=====
struct vm_area_struct {
        union {
                struct {
                        long unsigned int vm_start;      /*     0     8 */
                        long unsigned int vm_end;        /*     8     8 */
                };                                       /*     0    16 */
                struct callback_head vm_rcu;             /*     0    16 */
        } __attribute__((__aligned__(8)));               /*     0    16 */
        struct mm_struct *         vm_mm;                /*    16     8 */
        pgprot_t                   vm_page_prot;         /*    24     8 */
        union {
                const vm_flags_t   vm_flags;             /*    32     8 */
                vm_flags_t         __vm_flags;           /*    32     8 */
        };                                               /*    32     8 */
        bool                       detached;             /*    40     1 */

        /* XXX 7 bytes hole, try to pack */

        long int                   vm_lock_seq;          /*    48     8 */
        struct vma_lock *          vm_lock;              /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct {
                struct rb_node     rb;                   /*    64    24 */
                long unsigned int  rb_subtree_last;      /*    88     8 */
        }                                                /*    64    32 */
        struct list_head           anon_vma_chain;       /*    96    16 */
        struct anon_vma *          anon_vma;             /*   112     8 */
        const struct vm_operations_struct  * vm_ops;     /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        long unsigned int          vm_pgoff;             /*   128     8 */
        struct file *              vm_file;              /*   136     8 */
        void *                     vm_private_data;      /*   144     8 */
        atomic_long_t              swap_readahead_info;  /*   152     8 */
        struct mempolicy *         vm_policy;            /*   160     8 */
        struct vma_numab_state *   numab_state;          /*   168     8 */
        struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   176     8 */

        /* size: 184, cachelines: 3, members: 18 */
        /* sum members: 177, holes: 1, sum holes: 7 */
        /* forced alignments: 2 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));


> > >
> > > I didn't use __u64 outright to keep 32-bit architectures unaffected, but
> > > if it seems important enough, I have nothing against using __u64.
> > >
> > > Suggested-by: Jann Horn <jannh@...gle.com>
> > > Signed-off-by: Andrii Nakryiko <andrii@...nel.org>
> >
> > Reviewed-by: Shakeel Butt <shakeel.butt@...ux.dev>
>