linux-kernel - Re: On guest free page hinting and OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f6332928-d6a4-7a75-245d-2c534cf6e710@redhat.com>
Date:   Fri, 29 Mar 2019 15:24:24 +0100
From:   David Hildenbrand <david@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>,
        Nitesh Narayan Lal <nitesh@...hat.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, pbonzini@...hat.com, lcapitulino@...hat.com,
        pagupta@...hat.com, wei.w.wang@...el.com, yang.zhang.wz@...il.com,
        riel@...riel.com, dodgen@...gle.com, konrad.wilk@...cle.com,
        dhildenb@...hat.com, aarcange@...hat.com, alexander.duyck@...il.com
Subject: Re: On guest free page hinting and OOM

On 29.03.19 14:26, Michael S. Tsirkin wrote:
> On Wed, Mar 06, 2019 at 10:50:42AM -0500, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Sorry about breaking the thread: the original subject was
> 	KVM: Guest Free Page Hinting
> but the following isn't in a response to a specific patch
> so I thought it's reasonable to start a new one.
> 
> What bothers both me (and others) with both Nitesh's asynchronous approach
> to hinting and the hinting that is already supported in the balloon
> driver right now is that it seems to have the potential to create a fake OOM situation:
> the page that is in the process of being hinted can not be used.  How
> likely that is would depend on the workload so is hard to predict.

We had a very simple idea in mind: As long as a hinting request is
pending, don't actually trigger any OOM activity, but wait for it to be
processed. Can be done using simple atomic variable.

This is a scenario that will only pop up when already pretty low on
memory. And the main difference to ballooning is that we *know* we will
get more memory soon.


> 
> Alex's patches do not have this problem as they block the
> VCPUs from attempting to get new pages during hinting. Solves the fake OOM
> issue but adds blocking which most of the time is not necessary.

+ not going via QEMU which I consider problematic in the future when it
comes to various things
1) VFIO notifications if we ever want to support it
2) Verifying that the memory may actually be hinted. Remember where
people started to madvise(DONTNEED) the BIOS and we had to fix that in QEMU.

> 
> With both approaches there's a tradeoff: hinting is more efficient if it
> hints about large sized chunks of memory at a time, but as that size
> increases, chances of being able to hold on to that much memory at a
> time decrease. One can claim that this is a regular performance/memory
> tradeoff however there is a difference here: normally
> guest performance is traded off for host memory (which host
> knows how much there is of), this trades guest performance
> for guest memory, but the benefit is on the host, not on
> the guest. Thus this is harder to manage.

One nice thing is that, when only hinting larger chunks, the probability
of smaller chunks being available is more likely. It would be more of an
issue when hinting any granularity.

> 
> I have an idea: how about allocating extra guest memory on the host?  An
> extra hinting buffer would be appended to guest memory, with the
> understanding that it is destined specifically to improve page hinting.
> Balloon device would get an extra parameter specifying the
> hinting buffer size - e.g. in the config space of the driver.
> At driver startup, it would get hold of the amount of
> memory specified by host as the hinting buffer size, and keep it around in a
> buffer list - if no action is taken - forever.  Whenever balloon would
> want to get hold of a page of memory and send it to host for hinting, it
> would release a page of the same size from the buffer into the free
> list: a new page swaps places with a page in the buffer.
> 
> In this way the amount of useful free memory stays constant.
> 
> Once hinting is done page can be swapped back - or just stay
> in the hinting buffer until the next hint.
> 
> Clearly this is a memory/performance tradeoff: the more memory host can
> allocate for the hinting buffer, the more batching we'll get so hints
> become cheaper. One notes that:
> - if guest memory isn't pinned, this memory is virtual and can
>   be reclaimed by host. In partucular guest can hint about the
>   memory within the hinting buffer at startup.
> - guest performance/host memory tradeoffs are reasonably well understood, and
>   so it's easier to manage: host knows how much memory it can
>   sacrifice to gain the benefit of hinting.
> 
> Thoughts?
> 

I first want to

a) See it being a real issue. Reproduce it.
b) See that we can't fix it using a simple approach (loop when requests
not processed yet, always keep X pages ...).
c) See that an easy fix is not sufficient and actually an issue.
d) See if we can document it and people that care about can life without
hinting, like they would live without ballooning.

What you describe sounds interesting, but really involved. And really
problematic. I consider many things about your approach not realistic.

"appended to guest memory", "global list of memory", malicious guests
always using that memory like what about NUMA? What about different page
granularity? What about malicious guests? What about more hitning
requests than the buffer is capable to handle? and much much much more.

Honestly, requiring page hinting to make use of actual ballooning or
additional memory makes me shiver. I hope I don't get nightmares ;) In
the long term we might want to get rid of the inflation/deflation side
of virtio-balloon, not require it.

Please don't over-engineer an issue we haven't even see yet. Especially
not using a mechanism that sounds more involved than actual hinting.


As always, I might be very wrong, but this sounds way too complicated to
me, both on the guest and the hypervisor side.

-- 

Thanks,

David / dhildenb