linux-kernel - Re: On guest free page hinting and OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d105e3c7-52b4-de94-9f61-0aee5442d463@redhat.com>
Date:   Tue, 2 Apr 2019 20:21:30 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        Nitesh Narayan Lal <nitesh@...hat.com>,
        kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>, dodgen@...gle.com,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com, Andrea Arcangeli <aarcange@...hat.com>,
        Dave Hansen <dave.hansen@...el.com>
Subject: Re: On guest free page hinting and OOM

On 02.04.19 19:45, Alexander Duyck wrote:
> On Tue, Apr 2, 2019 at 10:09 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> On 02.04.19 18:18, Alexander Duyck wrote:
>>> n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@...hat.com> wrote:
>>>>
>>>> On 02.04.19 17:25, Michael S. Tsirkin wrote:
>>>>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
>>>>>> Basically what we would be doing is providing a means for
>>>>>> incrementally transitioning the buddy memory into the idle/offline
>>>>>> state to reduce guest memory overhead. It would require one function
>>>>>> that would walk the free page lists and pluck out pages that don't
>>>>>> have the "Offline" page type set,
>>>>>
>>>>> I think we will need an interface that gets
>>>>> an offline page and returns the next online free page.
>>>>>
>>>>> If we restart the list walk each time we can't guarantee progress.
>>>>
>>>> Yes, and essentially we are scanning all the time for chunks vs. we get
>>>> notified which chunks are possible hinting candidates. Totally different
>>>> design.
>>>
>>> The problem as I see it is that we can miss notifications if we become
>>> too backlogged, and that will lead to us having to fall back to
>>> scanning anyway. So instead of trying to implement both why don't we
>>> just focus on the scanning approach. Otherwise the only other option
>>> is to hold up the guest and make it wait until the hint processing has
>>> completed and at that point we are back to what is essentially just a
>>> synchronous solution with batching anyway.
>>>
>>
>> In general I am not a fan of "there might be a problem, let's try
>> something completely different". Expect the unexpected. At this point, I
>> prefer to think about easy solutions to eventual problems. not
>> completely new designs. As I said, we've been there already.
> 
> The solution as we have is not "easy". There are a number of race
> conditions contained within the code and it doesn't practically scale
> when you consider we are introducing multiple threads in both the
> isolation and returning of pages to/from the buddy allocator that will
> have to function within the zone lock.

We are freeing pages already, so we are already using the zone lock. We
essentially only care about two zones (maybe three). There is a lot of
optimization potential.

Regarding the scaling issue, I'd love to have a benchmark showcase this
issue.

> 
>> Related to "falling behind" with hinting. If this is indeed possible
>> (and I'd like to know under which conditions), I wonder at which point
>> we no longer care about missed hints. If our guest as a lot of MM
>> activity, could be that is good that we are dropping hints, because our
>> guest is so busy, it will reuse pages soon again.
> 
> This is making a LOT of assumptions. There are a few scenarios that
> can hold up hinting on the host side. One of the limitations of
> madvise is that we have to take the mm read semaphore. So if something
> is sitting on the write semaphore all of the hints will be blocked
> until it is released.
> 
>> One important point is - I think - that free page hinting does not have
>> to fit all possible setups. In certain environments it just makes sense
>> to disable it. Or live with it not giving you "all the hints". E.g.
>> databases that eat up all free memory either way. The other extreme
>> would be a simple webserver that is mostly idle.
> 
> My concern is we are introducing massive buffer bloat in the mm
> subsystem and it still has the potential for stalling VCPUs if we
> don't have room in the VQs. We went through this back in the day with
> networking. Adding more buffers is not the solution. The solution is
> to have a way to gracefully recover and keep our hinting latency and
> buffer bloat to a minimum.

I think the main point is that in contrast to real request, we can
always skip or delay hinting if it "just isn't the right time". I think
that is special when it comes to "hinting". At least so much to "
stalling VCPUs if we don't have room in the VQs".

I agree, adding buffers is not the solution, that's why I also didn't
like Michaels approach.

> 
>> We are losing hitning of quite free memory already due to the MAX_ORDER
>> - X discussion. Dropping a couple of other hints shouldn't really hurt.
>> The question is, are there scenarios where we can completely screw up.
> 
> My concern is that it can hurt a ton. In my mind the target for a
> feature like this is a guest that has something like an application
> that will fire up a few times a day eat up a massive amount of memory,
> and then free it all when it is done. Now if that application is
> freeing a massive block of memory and for whatever reason the QEMU
> thread that is translating our hint requests to madvise calls cannot
> keep up then we are going to spend the next several hours with that
> memory still assigned to an idle guest.
> 

Very special case I was mentioning. This is *AFAIK* not your usual
workload. It is an interesting use case to have in mind, though, and a
good corner case to where things might go wrong.

The other extreme is a system that barely frees (MAX_ORDER - X) pages,
however your thread will waste cycles scanning for such.

-- 

Thanks,

David / dhildenb