linux-kernel - Re: [PATCH v4 3/9] mm/lru: replace pgdat lru

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191119160456.GD382712@cmpxchg.org>
Date:   Tue, 19 Nov 2019 11:04:56 -0500
From:   Johannes Weiner <hannes@...xchg.org>
To:     Alex Shi <alex.shi@...ux.alibaba.com>
Cc:     cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, akpm@...ux-foundation.org,
        mgorman@...hsingularity.net, tj@...nel.org, hughd@...gle.com,
        khlebnikov@...dex-team.ru, daniel.m.jordan@...cle.com,
        yang.shi@...ux.alibaba.com, willy@...radead.org,
        shakeelb@...gle.com, Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Roman Gushchin <guro@...com>,
        Chris Down <chris@...isdown.name>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vlastimil Babka <vbabka@...e.cz>, Qian Cai <cai@....pw>,
        Andrey Ryabinin <aryabinin@...tuozzo.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Jérôme Glisse <jglisse@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        swkhack <swkhack@...il.com>,
        "Potyra, Stefan" <Stefan.Potyra@...ktrobit.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Stephen Rothwell <sfr@...b.auug.org.au>,
        Colin Ian King <colin.king@...onical.com>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
        Peng Fan <peng.fan@....com>,
        Nikolay Borisov <nborisov@...e.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Yafang Shao <laoar.shao@...il.com>
Subject: Re: [PATCH v4 3/9] mm/lru: replace pgdat lru_lock with lruvec lock

On Tue, Nov 19, 2019 at 08:23:17PM +0800, Alex Shi wrote:
> This patchset move lru_lock into lruvec, give a lru_lock for each of
> lruvec, thus bring a lru_lock for each of memcg per node.
> 
> This is the main patch to replace per node lru_lock with per memcg
> lruvec lock.
> 
> We introduce function lock_page_lruvec, it's same as vanilla pgdat lock
> when memory cgroup unset, w/o memcg, the function will keep repin the
> lruvec's lock to guard from page->mem_cgroup changes in page
> migrations between memcgs. (Thanks Hugh Dickins and Konstantin
> Khlebnikov reminder on this. Than the core logical is same as their
> previous patchs)
> 
> According to Daniel Jordan's suggestion, I run 64 'dd' with on 32
> containers on my 2s* 8 core * HT box with the modefied case:
>   https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice
> 
> With this and later patches, the dd performance is 144MB/s, the vanilla
> kernel performance is 123MB/s. 17% performance increased.
> 
> Signed-off-by: Alex Shi <alex.shi@...ux.alibaba.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> Cc: Michal Hocko <mhocko@...nel.org>
> Cc: Vladimir Davydov <vdavydov.dev@...il.com>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Roman Gushchin <guro@...com>
> Cc: Shakeel Butt <shakeelb@...gle.com>
> Cc: Chris Down <chris@...isdown.name>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Mel Gorman <mgorman@...hsingularity.net>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Qian Cai <cai@....pw>
> Cc: Andrey Ryabinin <aryabinin@...tuozzo.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
> Cc: "Jérôme Glisse" <jglisse@...hat.com>
> Cc: Andrea Arcangeli <aarcange@...hat.com>
> Cc: Yang Shi <yang.shi@...ux.alibaba.com>
> Cc: David Rientjes <rientjes@...gle.com>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>
> Cc: swkhack <swkhack@...il.com>
> Cc: "Potyra, Stefan" <Stefan.Potyra@...ktrobit.com>
> Cc: Mike Rapoport <rppt@...ux.vnet.ibm.com>
> Cc: Stephen Rothwell <sfr@...b.auug.org.au>
> Cc: Colin Ian King <colin.king@...onical.com>
> Cc: Jason Gunthorpe <jgg@...pe.ca>
> Cc: Mauro Carvalho Chehab <mchehab+samsung@...nel.org>
> Cc: Matthew Wilcox <willy@...radead.org>
> Cc: Peng Fan <peng.fan@....com>
> Cc: Nikolay Borisov <nborisov@...e.com>
> Cc: Ira Weiny <ira.weiny@...el.com>
> Cc: Kirill Tkhai <ktkhai@...tuozzo.com>
> Cc: Yafang Shao <laoar.shao@...il.com>
> Cc: Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
> Cc: Hugh Dickins <hughd@...gle.com>
> Cc: Tejun Heo <tj@...nel.org>
> Cc: linux-kernel@...r.kernel.org
> Cc: linux-mm@...ck.org
> Cc: cgroups@...r.kernel.org
> ---
>  include/linux/memcontrol.h | 24 +++++++++++++++
>  include/linux/mmzone.h     |  2 ++
>  mm/compaction.c            | 67 ++++++++++++++++++++++++++++-------------
>  mm/huge_memory.c           | 15 ++++------
>  mm/memcontrol.c            | 75 +++++++++++++++++++++++++++++++++++-----------
>  mm/mlock.c                 | 31 ++++++++++---------
>  mm/mmzone.c                |  1 +
>  mm/page_idle.c             |  5 ++--
>  mm/swap.c                  | 74 +++++++++++++++++++--------------------------
>  mm/vmscan.c                | 58 +++++++++++++++++------------------
>  10 files changed, 214 insertions(+), 138 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5b86287fa069..9538253998a6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -418,6 +418,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>  
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>  
> +struct lruvec *lock_page_lruvec_irq(struct page *, struct pglist_data *);
> +struct lruvec *lock_page_lruvec_irqsave(struct page *, struct pglist_data *,
> +					unsigned long*);
> +
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  
>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -901,6 +905,26 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  	return &pgdat->__lruvec;
>  }
>  
> +static inline struct lruvec *lock_page_lruvec_irq(struct page *page,
> +						struct pglist_data *pgdat)
> +{
> +	struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	return lruvec;

While this works in practice, it looks wrong because it doesn't follow
the mem_cgroup_page_lruvec() rules.

Please open-code spin_lock_irq(&pgdat->__lruvec->lru_lock) instead.

> @@ -1246,6 +1245,46 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	return lruvec;
>  }
>  
> +struct lruvec *lock_page_lruvec_irq(struct page *page,
> +					struct pglist_data *pgdat)
> +{
> +	struct lruvec *lruvec;
> +
> +again:
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +	spin_lock_irq(&lruvec->lru_lock);
> +	rcu_read_unlock();

The spinlock doesn't prevent the lruvec from being freed.

You deleted the rules from the mem_cgroup_page_lruvec() documentation,
but they still apply: if the page is already !PageLRU() by the time
you get here, it could get reclaimed or migrated to another cgroup,
and that can free the memcg/lruvec. Merely having the lru_lock held
does not prevent this.

Either the page needs to be locked, or the page needs to be PageLRU
with the lru_lock held to prevent somebody else from isolating
it. Otherwise, the lruvec is not safe to use.