linux-kernel - Re: [PATCH v8 03/10] mm/lru: replace pgdat lru

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200116215222.GA64230@cmpxchg.org>
Date:   Thu, 16 Jan 2020 16:52:22 -0500
From:   Johannes Weiner <hannes@...xchg.org>
To:     Alex Shi <alex.shi@...ux.alibaba.com>
Cc:     cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, akpm@...ux-foundation.org,
        mgorman@...hsingularity.net, tj@...nel.org, hughd@...gle.com,
        khlebnikov@...dex-team.ru, daniel.m.jordan@...cle.com,
        yang.shi@...ux.alibaba.com, willy@...radead.org,
        shakeelb@...gle.com, Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Roman Gushchin <guro@...com>,
        Chris Down <chris@...isdown.name>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vlastimil Babka <vbabka@...e.cz>, Qian Cai <cai@....pw>,
        Andrey Ryabinin <aryabinin@...tuozzo.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Jérôme Glisse <jglisse@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        swkhack <swkhack@...il.com>,
        "Potyra, Stefan" <Stefan.Potyra@...ktrobit.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Stephen Rothwell <sfr@...b.auug.org.au>,
        Colin Ian King <colin.king@...onical.com>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
        Peng Fan <peng.fan@....com>,
        Nikolay Borisov <nborisov@...e.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Yafang Shao <laoar.shao@...il.com>
Subject: Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock

On Thu, Jan 16, 2020 at 11:05:02AM +0800, Alex Shi wrote:
> @@ -948,10 +956,20 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>  			goto isolate_fail;
>  
> +		lruvec = mem_cgroup_page_lruvec(page, pgdat);
> +
>  		/* If we already hold the lock, we can skip some rechecking */
> -		if (!locked) {
> -			locked = compact_lock_irqsave(&pgdat->lru_lock,
> -								&flags, cc);
> +		if (lruvec != locked_lruvec) {
> +			struct mem_cgroup *memcg = lock_page_memcg(page);
> +
> +			if (locked_lruvec) {
> +				unlock_page_lruvec_irqrestore(locked_lruvec, flags);
> +				locked_lruvec = NULL;
> +			}
> +			/* reget lruvec with a locked memcg */
> +			lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
> +			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
> +			locked_lruvec = lruvec;
>  
>  			/* Try get exclusive access under lock */
>  			if (!skip_updated) {

In a previous review, I pointed out the following race condition
between page charging and compaction:

compaction:				generic_file_buffered_read:

					page_cache_alloc()

!PageBuddy()

lock_page_lruvec(page)
  lruvec = mem_cgroup_page_lruvec()
  spin_lock(&lruvec->lru_lock)
  if lruvec != mem_cgroup_page_lruvec()
    goto again

					add_to_page_cache_lru()
					  mem_cgroup_commit_charge()
					    page->mem_cgroup = foo
					  lru_cache_add()
					    __pagevec_lru_add()
					      SetPageLRU()

if PageLRU(page):
  __isolate_lru_page()

As far as I can see, you have not addressed this. You have added
lock_page_memcg(), but that prevents charged pages from moving between
cgroups, it does not prevent newly allocated pages from being charged.

It doesn't matter how many times you check the lruvec before and after
locking - if you're looking at a free page, it might get allocated,
charged and put on a new lruvec after you're done checking, and then
you isolate a page from an unlocked lruvec.

You simply cannot serialize on page->mem_cgroup->lruvec when
page->mem_cgroup isn't stable. You need to serialize on the page
itself, one way or another, to make this work.


So here is a crazy idea that may be worth exploring:

Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
linked list.

Can we make PageLRU atomic and use it to stabilize the lru_lock
instead, and then use the lru_lock only serialize list operations?

I.e. in compaction, you'd do

	if (!TestClearPageLRU(page))
		goto isolate_fail;
	/*
	 * We isolated the page's LRU state and thereby locked out all
	 * other isolators, including cgroup page moving, page reclaim,
	 * page freeing etc. That means page->mem_cgroup is now stable
	 * and we can safely look up the correct lruvec and take the
	 * page off its physical LRU list.
	 */
	lruvec = mem_cgroup_page_lruvec(page);
	spin_lock_irq(&lruvec->lru_lock);
	del_page_from_lru_list(page, lruvec, page_lru(page));

Putback would mostly remain the same (although you could take the
PageLRU setting out of the list update locked section, as long as it's
set after the page is physically linked):

	/* LRU isolation pins page->mem_cgroup */
	lruvec = mem_cgroup_page_lruvec(page)
	spin_lock_irq(&lruvec->lru_lock);
	add_page_to_lru_list(...);
	spin_unlock_irq(&lruvec->lru_lock);

	SetPageLRU(page);

And you'd have to carefully review and rework other sites that rely on
PageLRU: reclaim, __page_cache_release(), __activate_page() etc.

Especially things like activate_page(), which used to only check
PageLRU to shuffle the page on the LRU list would now have to briefly
clear PageLRU and then set it again afterwards.

However, aside from a bit more churn in those cases, and the
unfortunate additional atomic operations, I currently can't think of a
fundamental reason why this wouldn't work.

Hugh, what do you think?