linux-kernel - Re: On guest free page hinting and OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UdHA66z1j=3H06AfgtiF4ThFdXwQ6i8p1MszdL2bRHeZQ@mail.gmail.com>
Date:   Tue, 2 Apr 2019 10:45:43 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        Nitesh Narayan Lal <nitesh@...hat.com>,
        kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>, dodgen@...gle.com,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com, Andrea Arcangeli <aarcange@...hat.com>,
        Dave Hansen <dave.hansen@...el.com>
Subject: Re: On guest free page hinting and OOM

On Tue, Apr 2, 2019 at 10:09 AM David Hildenbrand <david@...hat.com> wrote:
>
> On 02.04.19 18:18, Alexander Duyck wrote:
> > n Tue, Apr 2, 2019 at 8:57 AM David Hildenbrand <david@...hat.com> wrote:
> >>
> >> On 02.04.19 17:25, Michael S. Tsirkin wrote:
> >>> On Tue, Apr 02, 2019 at 08:04:00AM -0700, Alexander Duyck wrote:
> >>>> Basically what we would be doing is providing a means for
> >>>> incrementally transitioning the buddy memory into the idle/offline
> >>>> state to reduce guest memory overhead. It would require one function
> >>>> that would walk the free page lists and pluck out pages that don't
> >>>> have the "Offline" page type set,
> >>>
> >>> I think we will need an interface that gets
> >>> an offline page and returns the next online free page.
> >>>
> >>> If we restart the list walk each time we can't guarantee progress.
> >>
> >> Yes, and essentially we are scanning all the time for chunks vs. we get
> >> notified which chunks are possible hinting candidates. Totally different
> >> design.
> >
> > The problem as I see it is that we can miss notifications if we become
> > too backlogged, and that will lead to us having to fall back to
> > scanning anyway. So instead of trying to implement both why don't we
> > just focus on the scanning approach. Otherwise the only other option
> > is to hold up the guest and make it wait until the hint processing has
> > completed and at that point we are back to what is essentially just a
> > synchronous solution with batching anyway.
> >
>
> In general I am not a fan of "there might be a problem, let's try
> something completely different". Expect the unexpected. At this point, I
> prefer to think about easy solutions to eventual problems. not
> completely new designs. As I said, we've been there already.

The solution as we have is not "easy". There are a number of race
conditions contained within the code and it doesn't practically scale
when you consider we are introducing multiple threads in both the
isolation and returning of pages to/from the buddy allocator that will
have to function within the zone lock.

> Related to "falling behind" with hinting. If this is indeed possible
> (and I'd like to know under which conditions), I wonder at which point
> we no longer care about missed hints. If our guest as a lot of MM
> activity, could be that is good that we are dropping hints, because our
> guest is so busy, it will reuse pages soon again.

This is making a LOT of assumptions. There are a few scenarios that
can hold up hinting on the host side. One of the limitations of
madvise is that we have to take the mm read semaphore. So if something
is sitting on the write semaphore all of the hints will be blocked
until it is released.

> One important point is - I think - that free page hinting does not have
> to fit all possible setups. In certain environments it just makes sense
> to disable it. Or live with it not giving you "all the hints". E.g.
> databases that eat up all free memory either way. The other extreme
> would be a simple webserver that is mostly idle.

My concern is we are introducing massive buffer bloat in the mm
subsystem and it still has the potential for stalling VCPUs if we
don't have room in the VQs. We went through this back in the day with
networking. Adding more buffers is not the solution. The solution is
to have a way to gracefully recover and keep our hinting latency and
buffer bloat to a minimum.

> We are losing hitning of quite free memory already due to the MAX_ORDER
> - X discussion. Dropping a couple of other hints shouldn't really hurt.
> The question is, are there scenarios where we can completely screw up.

My concern is that it can hurt a ton. In my mind the target for a
feature like this is a guest that has something like an application
that will fire up a few times a day eat up a massive amount of memory,
and then free it all when it is done. Now if that application is
freeing a massive block of memory and for whatever reason the QEMU
thread that is translating our hint requests to madvise calls cannot
keep up then we are going to spend the next several hours with that
memory still assigned to an idle guest.