linux-kernel - Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160225195613.GZ2854@techsingularity.net>
Date:	Thu, 25 Feb 2016 19:56:13 +0000
From:	Mel Gorman <mgorman@...hsingularity.net>
To:	Andrea Arcangeli <aarcange@...hat.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Vlastimil Babka <vbabka@...e.cz>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour
 disable it by default

On Thu, Feb 25, 2016 at 08:01:44PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 25, 2016 at 05:12:19PM +0000, Mel Gorman wrote:
> > some cases, this will reduce THP usage but the benefit of THP is hard to
> > measure and not a universal win where as a stall to reclaim/compaction is
> 
> It depends on the workload: with virtual machines THP is essential
> from the start without having to wait half a khugepaged cycle in
> average, especially on large systems.

Which is a specialised case that does not apply to all users. Remember
that the data showed that a basic streaming write of an anon mapping on
a freshly booted NUMA system was enough to stall the process for long
periods of time.

Even in the specialised case, a single VM reaching its peak performance
may rely on getting THP but if that's at the cost of reclaiming other
pages that may be hot to a second VM then it's an overall loss.

Finally, for the specialised case, if it really is that critical then
pages could be freed preemptively from userspace before the VM starts.
For example, allocate and free X hugetlbfs pages before the migration.

Right now, there are numerous tuning guides out there that are suggest
disabling THP entirely due to the stalls. On my own desktop, I occasionally
see a new process halt the system for a few seconds and it was possible
to see that THP allocations were happening at the time.

> We see this effect for example
> in postcopy live migraiton where --postcopy-after-precopy is essential
> to reach peak performance during database workloads in guest,
> immediately after postcopy completes. With --postcopy-after-precopy
> only those pages that may be triggering userfaults will need to be
> collapsed with khugepaged and all the rest that was previously passed
> over with precopy has an high probability to be immediately THP backed
> also thanks to defrag/direct-compaction. Failing at starting
> the destination node largely THP backed is very visible in benchmark
> (even if a full precopy pass is done first). Later on the performance
> increases again as khugepaged fixes things, but it takes some time.
> 

If it's critical that the performance is identical then I would suggest
a pre-migration step of alloc/free of hugetlbfs pages to force the
defragmentation. Alternatively trigger compaction from proc and if
necessary use memhog to allocate/free the required memory followed by a
proc compaction. It's a little less tidy but it solves the corner case
while leaving the common case free of stalls.

> So unless we've a very good kcompatd or a workqueue doing the job of
> providing enough THP for page faults, I'm skeptical of this.

Unfortunately, it'll never be perfect. We went through a cycle of having
really high success rates of allocations in 3.0 days and the cost in
reclaim and disruption was way too high.

> Another problem is that khugepaged isn't able to collapse shared
> readonly anon pages, mostly because of the rmap complexities.  I agree
> with Kirill we should be looking into how make this work, although I
> doubt the simpler refcounting is going to help much in this regard as
> the problem is in dealing with rmap, not so much with refcounts.

I think that's important but I'm not seeing right now how it's related
to preventing processes stalling for long periods of time in direct
reclaim and compaction.

-- 
Mel Gorman
SUSE Labs