linux-kernel - Thoughts on simple scanner approach for free page hinting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ue4LufT4q4dLwjqhGRpDbVnucNWhmhwWxbwtytgjxx+Kw@mail.gmail.com>
Date:   Fri, 5 Apr 2019 17:09:45 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>,
        David Hildenbrand <david@...hat.com>,
        Nitesh Narayan Lal <nitesh@...hat.com>
Cc:     kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, wei.w.wang@...el.com,
        Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>, dodgen@...gle.com,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com, Andrea Arcangeli <aarcange@...hat.com>
Subject: Thoughts on simple scanner approach for free page hinting

So I am starting this thread as a spot to collect my thoughts on the
current guest free page hinting design as well as point out a few
possible things we could do to improve upon it.

1. The current design isn't likely going to scale well to multiple
VCPUs. The issue specifically is that the zone lock must be held to
pull pages off of the free list and to place them back there once they
have been hinted upon. As a result it would likely make sense to try
to limit ourselves to only having one thread performing the actual
hinting so that we can avoid running into issues with lock contention
between threads.

2. There are currently concerns about the hinting triggering false OOM
situations if too much memory is isolated while it is being hinted. My
thought on this is to simply avoid the issue by only hint on a limited
amount of memory at a time. Something like 64MB should be a workable
limit without introducing much in the way of regressions. However as a
result of this we can easily be overrun while waiting on the host to
process the hinting request. As such we will probably need a way to
walk the free list and free pages after they have been freed instead
of trying to do it as they are freed.

3. Even with the current buffering which is still on the larger side
it is possible to overrun the hinting limits if something causes the
host to stall and a large swath of memory is released. As such we are
still going to need some sort of scanning mechanism or will have to
live with not providing accurate hints.

4. In my opinion, the code overall is likely more complex then it
needs to be. We currently have 2 allocations that have to occur every
time we provide a hint all the way to the host, ideally we should not
need to allocate more memory to provide hints. We should be able to
hold the memory use for a memory hint device constant and simply map
the page address and size to the descriptors of the virtio-ring.

With that said I have a few ideas that may help to address the 4
issues called out above. The basic idea is simple. We use a high water
mark based on zone->free_area[order].nr_free to determine when to wake
up a thread to start hinting memory out of a given free area. From
there we allocate non-"Offline" pages from the free area and assign
them to the hinting queue up to 64MB at a time. Once the hinting is
completed we mark them "Offline" and add them to the tail of the
free_area. Doing this we should cycle the non-"Offline" pages slowly
out of the free_area. In addition the search cost should be minimal
since all of the "Offline" pages should be aggregated to the tail of
the free_area so all pages allocated off of the free_area will be the
non-"Offline" pages until we shift over to them all being "Offline".
This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
since the only real consumer of add_to_free_area_tail is
__free_one_page which uses it to place a page with an order less than
MAX_ORDER - 2 on the tail of a free_area assuming that it should be
freeing the buddy of that page shortly. The only other issue with
adding to tail would be the memory shuffling which was recently added,
but I don't see that as being something that will be enabled in most
cases so we could probably just make the features mutually exclusive,
at least for now.

So if I am not mistaken this would essentially require a couple
changes to the mm infrastructure in order for this to work.

First we would need to split nr_free into two counters, something like
nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
value currently used for nr_free. When we pulled the pages for hinting
we would reduce the nr_freed value and then add back to it when the
pages are returned. When pages are allocated they would increment the
nr_bound value. The idea behind this is that we can record nr_free
when we collect the pages and save it to some local value. This value
could then tell us how many new pages have been added that have not
been hinted upon.

In addition we will need some way to identify which pages have been
hinted on and which have not. The way I believe easiest to do this
would be to overload the PageType value so that we could essentially
have two values for "Buddy" pages. We would have our standard "Buddy"
pages, and "Buddy" pages that also have the "Offline" value set in the
PageType field. Tracking the Online vs Offline pages this way would
actually allow us to do this with almost no overhead as the mapcount
value is already being reset to clear the "Buddy" flag so adding a
"Offline" flag to this clearing should come at no additional cost.

Lastly we would need to create a specialized function for allocating
the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
"Offline" pages. I'm thinking the alloc function it would look
something like __rmqueue_smallest but without the "expand" and needing
to modify the !page check to also include a check to verify the page
is not "Offline". As far as the changes to __free_one_page it would be
a 2 line change to test for the PageType being offline, and if it is
to call add_to_free_area_tail instead of add_to_free_area.

Anyway this email ended up being pretty massive by the time I was
done. Feel free to reply to parts of it and we can break it out into
separate threads of discussion as necessary. I will start working on
coding some parts of this next week.

Thanks.

- Alex