linux-kernel - Re: On guest free page hinting and OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29e11829-c9ac-a21b-b2f1-ed833e4ca449@redhat.com>
Date:   Mon, 1 Apr 2019 16:09:32 +0200
From:   David Hildenbrand <david@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     Nitesh Narayan Lal <nitesh@...hat.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        pbonzini@...hat.com, lcapitulino@...hat.com, pagupta@...hat.com,
        wei.w.wang@...el.com, yang.zhang.wz@...il.com, riel@...riel.com,
        dodgen@...gle.com, konrad.wilk@...cle.com, dhildenb@...hat.com,
        aarcange@...hat.com, alexander.duyck@...il.com
Subject: Re: On guest free page hinting and OOM

>> Thinking about your approach, there is one elementary thing to notice:
>>
>> Giving the guest pages from the buffer while hinting requests are being
>> processed means that the guest can and will temporarily make use of more
>> memory than desired. Essentially up to the point where MADV_FREE is
>> finally called for the hinted pages.
> 
> Right - but that seems like exactly the reverse of the issue with the current
> approach which is guest can temporarily use less memory than desired.
> 
>> Even then the guest will logicall
>> make use of more memory than desired until core MM takes pages away.
> 
> That sounds more like a host issue though. If it wants to
> it can use e.g. MAD_DONTNEED.

Indeed. But MADV_DONTNEED is somewhat undesired for performance reasons.
You want to do the work when swapping not when hinting.

But what I wanted to say here: Looking at the pure size of your guest
will at least not help you to identify if more memory than desired will
be used.

> 
>> So:
>> 1) Unmodified guests will make use of more memory than desired.
> 
> One interesting possibility for this is to add the buffer memory
> by hotplug after the feature has been negotiated.
> I agree this sounds complex.

Yes it is, and it goes into the direction of virtio-mem that essentially
does that. But bad news: memory hotplug is complicated stuff, both on
the hypervisor and guest side. And things like NUMA make it more involved.

But even then, malicious guest can simply fake feature negotiation and
make use of all hotplugged memory. Won't work, at least not for
malicious guests.

> 
> But I have an idea: how about we include the hint size in the
> num_pages counter? Then unmodified guests put
> it in the balloon and don't use it. Modified ones
> will know to use it just for hinting.

These are the nightmares I was talking about. I would like to decouple
this feature as far as possible from balloon inflation/deflation.
Ballooning is 4k based and might have other undesirable side effect.
Just because somebody wants to use page hinting does not mean he wants
to use ballooning. Effectively, many people will want to avoid
ballooning completely by using page hinting for their use case.

> 
> 
>> 2) Malicious guests will make use of more memory than desired.
> 
> Well this limitation is fundamental to balloon right?

Yep, it is the fundamental issue of ballooning. If memory is available
right from the boot, the system is free to do with it whatever it wants.
(one of the main things virtio-mem will do differently/better)

> If host wants to add tracking of balloon memory, it
> can enforce the limits. So far no one bothered,
> but maybe with this feature we should start to do that.

I think I already had endless rants about why this is not possible.
Ballooning as it is currently implemented by virtio-balloon is broken by
design. Period. You can and never will be able to distinguish unmodified
guests from malicious guests. Please don't design new approaches based
on broken design.

> 
>> 3) Sane, modified guests will make use of more memory than desired.
>>
>> Instead, we could make our life much easier by doing the following:
>>
>> 1) Introduce a parameter to cap the amount of memory concurrently hinted
>> similar like you suggested, just don't consider it a buffer value.
>> "-device virtio-balloon,hinting_size=1G". This gives us control over the
>> hinting proceess.
>>
>> hinting_size=0 (default) disables hinting
>>
>> The admin can tweak the number along with memory requirements of the
>> guest. We can make suggestions (e.g. calculate depending on #cores,#size
>> of memory, or simply "1GB")
> 
> So if it's all up to the guest and for the benefit of the guest, and
> with no cost/benefit to the host, then why are we supplying this value
> from the host?

See 3), the admin has to be aware of hinting behavior.

> 
>> 2) In the guest, track the size of hints in progress, cap at the
>> hinting_size.
>>
>> 3) Document hinting behavior
>>
>> "When hinting is enabled, memory up to hinting_size might temporarily be
>> removed from your guest in order to be hinted to the hypervisor. This is
>> only for a very short time, but might affect applications. Consider the
>> hinting_size when sizing your guest. If your application was tested with
>> XGB and a hinting size of 1G is used, please configure X+1GB for the
>> guest. Otherwise, performance degradation might be possible."
> 
> OK, so let's start with this. Now let us assume that guest follows
> the advice.  We thus know that 1GB is not needed for guest applications.
> So why do we want to allow applications to still use this extra memory?

If the application does not need the 1GB, the 1GB will be hinted to the
hypervisor and are effectively only a buffer for the OOM scenario.
(ignoring page cache discussions for now).

"So why do we want to allow applications to still use this extra memory"
is the EXACT same issue you have with your buffer approach. Any guest
can make use of the buffer and you won't be able to detect it. Very same
problem. Only in your approach, the guest might agree to play nicely by
not making use of the 1G you provided. Just as if the application does
not need/use the additional 1GB.

The interesting thing is most probably: Will the hinting size usually be
reasonable small? At least I guess a guest with 4TB of RAM will not
suddenly get a hinting size of hundreds of GB. Most probably also only
something in the range of 1GB. But this is an interesting question to
look into.

Also, if the admin does not care about performance implications when
already close to hinting, no need to add the additional 1Gb to the ram size.

> 
>> 4) Do the loop/yield on OOM as discussed to improve performance when OOM
>> and avoid false OOM triggers just to be sure.
> 
> Yes, I'm not against trying the simpler approach as a first step.  But
> then we need this path actually tested so see whether hinting introduced
> unreasonable overhead on this path.  And it is tricky to test oom as you
> are skating close to system's limits. That's one reason I prefer
> avoiding oom handler if possible.

The issue with the actual issue we are chasing is that it can only
happen if (as far as I see)

1) Application uses X MAX_ORDER - 1 pages
2) Application frees X MAX_ORDER - 1 pages
3) Application reallocates X MAX_ORDER - 1 pages

AND

a) There are not enough MAX_ORDER - 1 pages remaining while hinting
b) The allocation request cannot be satisfied from other free pages of
smaller order
c) We actually trigger hinting with the X freed pages
d) Time between 2 and 3 is not enough to complete hinting

Only then the OOM handler will get active. If between 2) and 3)
reasonable time has passed, it is not an issue.

And as I said right from the beginning, reproducing this might be
difficult. And especially for this reason, I prefer simpler approaches
if possible. Document it for applications that might be affected, let
the admin enable the feature explicitly, avoid complexity.

> 
> When you say yield, I would guess that would involve config space access
> to the balloon to flush out outstanding hints?

I rather meant yield your CPU to the hypervisor, so it can process
hinting requests faster (like waiting for a spinlock). This is the
simple case. More involved approaches might somehow indicate to the
hypervisor to not process queued requests but simply return them to the
guest so the guest can add the isolated pages to the buddy. If this is
"config space access to the balloon to flush out outstanding hints" then
yes, something like that might be a good idea if it doesn't harm
performance.

-- 

Thanks,

David / dhildenb