linux-kernel - Re: [PATCH v9 17/24] mm: Protect mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 14 Mar 2018 09:48:44 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Laurent Dufour <ldufour@...ux.vnet.ibm.com>
Cc:     paulmck@...ux.vnet.ibm.com, akpm@...ux-foundation.org,
        kirill@...temov.name, ak@...ux.intel.com, mhocko@...nel.org,
        dave@...olabs.net, jack@...e.cz,
        Matthew Wilcox <willy@...radead.org>, benh@...nel.crashing.org,
        mpe@...erman.id.au, paulus@...ba.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, hpa@...or.com,
        Will Deacon <will.deacon@....com>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        kemi.wang@...el.com, sergey.senozhatsky.work@...il.com,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        haren@...ux.vnet.ibm.com, khandual@...ux.vnet.ibm.com,
        npiggin@...il.com, bsingharora@...il.com,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        linuxppc-dev@...ts.ozlabs.org, x86@...nel.org
Subject: Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

On Tue, Mar 13, 2018 at 06:59:47PM +0100, Laurent Dufour wrote:
> This change is inspired by the Peter's proposal patch [1] which was
> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
> that particular case, and it is introducing major performance degradation
> due to excessive scheduling operations.

Do you happen to have a little more detail on that?

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 34fde7111e88..28c763ea1036 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -335,6 +335,7 @@ struct vm_area_struct {
>  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>  #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>  	seqcount_t vm_sequence;
> +	atomic_t vm_ref_count;		/* see vma_get(), vma_put() */
>  #endif
>  } __randomize_layout;
>  
> @@ -353,6 +354,9 @@ struct kioctx_table;
>  struct mm_struct {
>  	struct vm_area_struct *mmap;		/* list of VMAs */
>  	struct rb_root mm_rb;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +	rwlock_t mm_rb_lock;
> +#endif
>  	u32 vmacache_seqnum;                   /* per-thread vmacache */
>  #ifdef CONFIG_MMU
>  	unsigned long (*get_unmapped_area) (struct file *filp,

When I tried this, it simply traded contention on mmap_sem for
contention on these two cachelines.

This was for the concurrent fault benchmark, where mmap_sem is only ever
acquired for reading (so no blocking ever happens) and the bottle-neck
was really pure cacheline access.

Only by using RCU can you avoid that thrashing.

Also note that if your database allocates the one giant mapping, it'll
be _one_ VMA and that vm_ref_count gets _very_ hot indeed.