linux-kernel - Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190219024611.GC24502@redhat.com>
Date:   Mon, 18 Feb 2019 21:46:11 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     David Hildenbrand <david@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Nitesh Narayan Lal <nitesh@...hat.com>,
        kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, wei.w.wang@...el.com,
        Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>, dodgen@...gle.com,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com
Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

Hello,

On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

With relatively recent kernels MADV_DONTNEED doesn't necessarily free
anything when it's applied to a THP subpage, it only splits the
pagetables and queues the THP for deferred splitting. If there's
memory pressure a shrinker will be invoked and the queue is scanned
and the THPs are physically splitted, but to be reassembled/collapsed
after a physical split it requires at least one young pte.

If this is particularly problematic for page hinting, this behavior
where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
being frequently accessed), can be turned off by setting
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
0. Then the THP will only be collapsed if all 512 subpages are mapped
(i.e. they've all be re-allocated by the guest).

Regardless of the max_ptes_none default, keeping the smaller guest
buddy orders as the last target for page hinting should be good for
performance.

> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

The guest allocation behavior should not be influenced by MADV_FREE vs
MADV_DONTNEED, the guest can't see the difference anyway, so why
should it limit the allocations?

The benefit of MADV_FREE should be that when the same guest frees and
reallocates an huge amount of RAM (i.e. guest app allocating and
freeing lots of RAM in a loop, not so uncommon), there will be no KVM
page fault during guest re-allocations. So in absence of memory
pressure in the host it should be a major win. Overall it sounds like
a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
notifiers and forces host allocations and KVM page faults in order to
reallocate the same RAM in the same guest.

When there's memory pressure it's up to the host Linux VM to notice
there's plenty of MADV_FREE material to free at zero I/O cost before
starting swapping.

Thanks,
Andrea