linux-kernel - Re: [PATCHv12 34/37] thp: introduce deferred_split_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160121012237.GE7119@redhat.com>
Date:	Thu, 21 Jan 2016 02:22:37 +0100
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Hugh Dickins <hughd@...gle.com>,
	Dave Hansen <dave.hansen@...el.com>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Christoph Lameter <cl@...two.org>,
	Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
	Steve Capper <steve.capper@...aro.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.cz>,
	Jerome Marchand <jmarchan@...hat.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCHv12 34/37] thp: introduce deferred_split_huge_page()

Hello Kirill,

On Tue, Oct 06, 2015 at 06:24:01PM +0300, Kirill A. Shutemov wrote:
> +static unsigned long deferred_split_scan(struct shrinker *shrink,
> +		struct shrink_control *sc)
> +{
> +	unsigned long flags;
> +	LIST_HEAD(list), *pos, *next;
> +	struct page *page;
> +	int split = 0;
> +
> +	spin_lock_irqsave(&split_queue_lock, flags);
> +	list_splice_init(&split_queue, &list);
> +
> +	/* Take pin on all head pages to avoid freeing them under us */
> +	list_for_each_safe(pos, next, &list) {
> +		page = list_entry((void *)pos, struct page, mapping);
> +		page = compound_head(page);
> +		/* race with put_compound_page() */
> +		if (!get_page_unless_zero(page)) {
> +			list_del_init(page_deferred_list(page));
> +			split_queue_len--;
> +		}
> +	}
> +	spin_unlock_irqrestore(&split_queue_lock, flags);

While rebasing I noticed this loop looks a bit too heavy. There's no
lockbreak and no cap on the list size, and million of THP pages could
have been partially unmapped but not be entirely freed yet, and sit
there for a while (there are other scenarios but this is the one that
could more realistically happen with certain allocators). Then as
result of random memory pressure we'd be calling millions of
get_page_unless_zero across multiple NUMA nodes thrashing cachelines
at every list entry, with irq disabled too for the whole period.

I haven't verified it, but I guess that in some large NUMA (i.e. 4TiB)
system that could take down a CPU for a second or more with irq
disabled.

I think it needs to isolate a certain number of pages, not splice
(userland programs can invoke the shrinker through direct reclaim too
and they can't stuck there for too long) and perhaps use
sc->nr_to_scan to achieve that.

The split_queue can also be moved from global to the "struct
pglist_data" and then you can do NODE_DATA(sc->nid)->split_queue, same
for the spinlock. That will make it more scalable for the lock and
more efficient in freeing memory so we don't split THP from nodes
reclaim isn't currently interested about (reclaim will later try again
on the zones in the other nodes by itself if needed).

Thanks,
Andrea