linux-kernel - Re: On guest free page hinting and OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a3baa90-5963-e6e2-c862-9cd9cc1b5f60@redhat.com>
Date:   Fri, 29 Mar 2019 16:37:46 +0100
From:   David Hildenbrand <david@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     Nitesh Narayan Lal <nitesh@...hat.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        pbonzini@...hat.com, lcapitulino@...hat.com, pagupta@...hat.com,
        wei.w.wang@...el.com, yang.zhang.wz@...il.com, riel@...riel.com,
        dodgen@...gle.com, konrad.wilk@...cle.com, dhildenb@...hat.com,
        aarcange@...hat.com, alexander.duyck@...il.com
Subject: Re: On guest free page hinting and OOM

On 29.03.19 16:08, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
>>
>> We had a very simple idea in mind: As long as a hinting request is
>> pending, don't actually trigger any OOM activity, but wait for it to be
>> processed. Can be done using simple atomic variable.
>>
>> This is a scenario that will only pop up when already pretty low on
>> memory. And the main difference to ballooning is that we *know* we will
>> get more memory soon.
> 
> No we don't.  If we keep polling we are quite possibly keeping the CPU
> busy so delaying the hint request processing.  Again the issue it's a

You can always yield. But that's a different topic.

> tradeoff. One performance for the other. Very hard to know which path do
> you hit in advance, and in the real world no one has the time to profile
> and tune things. By comparison trading memory for performance is well
> understood.
> 
> 
>> "appended to guest memory", "global list of memory", malicious guests
>> always using that memory like what about NUMA?
> 
> This can be up to the guest. A good approach would be to take
> a chunk out of each node and add to the hints buffer.

This might lead to you not using the buffer efficiently. But also,
different topic.

> 
>> What about different page
>> granularity?
> 
> Seems like an orthogonal issue to me.

It is similar, yes. But if you support multiple granularities (e.g.
MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
of buddy for the buffer. This is different than just a list for each node.

> 
>> What about malicious guests?
> 
> That's an interesting question.  Host can actually enforce that # of
> hinted free pages at least matches the hint buffer size.

Well, you will have to assume that the hinting buffer might be
completely used by any guest. Just like ballooning, there are no
guarantees. If the buffer memory is part of initial boot memory, any
guest without proper modifications will simply make use of it. See more
on that below. At first, my impression was that you would actually
somehow want to expose new memory ("buffer") to the guest, which is why
I shivered.

> 
> 
>> What about more hitning
>> requests than the buffer is capable to handle?
> 
> The idea is that we don't send more hints than in the buffer.
> In this way host can actually control the overhead which
> is probably a good thing - host knows how much benefit
> can be derived from hinting. Guest doesn't.

So the buffer will require a global lock, just so we are on the same
page. "this way host can actually control the overhead" - you can
implement that easier by specifying a desired limit via virtio and
tracking it in the guest. Which is what your buffer would implicitly do.

> 
>> Honestly, requiring page hinting to make use of actual ballooning or
>> additional memory makes me shiver. I hope I don't get nightmares ;) In
>> the long term we might want to get rid of the inflation/deflation side
>> of virtio-balloon, not require it.
>>
>> Please don't over-engineer an issue we haven't even see yet.
> 
> All hinting patches are very lightly tested as it is. OOM especially is
> very hard to test properly.  So *I* will sleep better at night if we
> don't have corner cases.  Balloon is already involved in MM for
> isolation and somehow we live with that.  So wait until you see actual
> code before worrying about nightmares.

Yes, but I want to see attempts with simple solutions first.

> 
>> Especially
>> not using a mechanism that sounds more involved than actual hinting.
> 
> That would depend on the implementation.
> It's just moving a page between two lists.

Except NUMA and different granularities.

> 
> 
>>
>> As always, I might be very wrong, but this sounds way too complicated to
>> me, both on the guest and the hypervisor side.
> 
> On the hypervisor side it can be literally nothing if we don't
> want to enforce buffer size.


Just so we understand each other. What you mean with "appended to guest
memory" is "append to the guest memory size", not actually "append
memory via virtio-balloon", like adding memory regions and stuff.

Instead of "-m 4G" you would do "-m 5G -device virtio-balloon,hint_size=1G".


So to summarize my opinion, this looks like a *possible* improvement for
the future *if* we realize that a simple approach does not work. If we
simply add more memory for the hinting buffer to the VM memory size, we
will have to assume it will be used by guests, even if they are not
malicious (e.g. guest without virtio-balloon driver or without the
enhancement). NUMA and different page granularities in the guest are
tricky parts.

-- 

Thanks,

David / dhildenb