linux-kernel - Re: Thoughts on simple scanner approach for free page hinting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <23bf2735-c61e-9ade-30f9-4d314617283e@redhat.com>
Date:   Mon, 8 Apr 2019 20:27:41 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Nitesh Narayan Lal <nitesh@...hat.com>
Cc:     kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, wei.w.wang@...el.com,
        Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>, dodgen@...gle.com,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com, Andrea Arcangeli <aarcange@...hat.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        "Michael S. Tsirkin" <mst@...hat.com>
Subject: Re: Thoughts on simple scanner approach for free page hinting

On 08.04.19 20:09, Nitesh Narayan Lal wrote:
> On 4/8/19 12:36 PM, David Hildenbrand wrote:
>> On 06.04.19 02:09, Alexander Duyck wrote:
>>> So I am starting this thread as a spot to collect my thoughts on the
>>> current guest free page hinting design as well as point out a few
>>> possible things we could do to improve upon it.
>>>
>>> 1. The current design isn't likely going to scale well to multiple
>>> VCPUs. The issue specifically is that the zone lock must be held to
>>> pull pages off of the free list and to place them back there once they
>>> have been hinted upon. As a result it would likely make sense to try
>>> to limit ourselves to only having one thread performing the actual
>>> hinting so that we can avoid running into issues with lock contention
>>> between threads.
>> Makes sense.
>>
>>> 2. There are currently concerns about the hinting triggering false OOM
>>> situations if too much memory is isolated while it is being hinted. My
>>> thought on this is to simply avoid the issue by only hint on a limited
>>> amount of memory at a time. Something like 64MB should be a workable
>>> limit without introducing much in the way of regressions. However as a
>>> result of this we can easily be overrun while waiting on the host to
>>> process the hinting request. As such we will probably need a way to
>>> walk the free list and free pages after they have been freed instead
>>> of trying to do it as they are freed.
>> We will need such a way in case we care about dropped hinting requests, yes.
>>
>>> 3. Even with the current buffering which is still on the larger side
>>> it is possible to overrun the hinting limits if something causes the
>>> host to stall and a large swath of memory is released. As such we are
>>> still going to need some sort of scanning mechanism or will have to
>>> live with not providing accurate hints.
>> Yes, usually if there is a lot of guest activity, you could however
>> assume that free pages might get reused either way soon. Of course,
>> special cases are "freeing XGB and being idle afterwards".
>>
>>> 4. In my opinion, the code overall is likely more complex then it
>>> needs to be. We currently have 2 allocations that have to occur every
>>> time we provide a hint all the way to the host, ideally we should not
>>> need to allocate more memory to provide hints. We should be able to
>>> hold the memory use for a memory hint device constant and simply map
>>> the page address and size to the descriptors of the virtio-ring.
>> I don't think the two allocations are that complex. The only thing I
>> consider complex is isolation a lot of pages from different zones etc.
>> Two allocations, nobody really cares about that. Of course, the fact
>> that we have to allocate memory from the VCPUs where we currently freed
>> a page is not optimal. I consider that rather a problem/complex.
>>
>> Especially you have a point regarding scalability and multiple VCPUs.
>>
>>> With that said I have a few ideas that may help to address the 4
>>> issues called out above. The basic idea is simple. We use a high water
>>> mark based on zone->free_area[order].nr_free to determine when to wake
>>> up a thread to start hinting memory out of a given free area. From
>>> there we allocate non-"Offline" pages from the free area and assign
>>> them to the hinting queue up to 64MB at a time. Once the hinting is
>>> completed we mark them "Offline" and add them to the tail of the
>>> free_area. Doing this we should cycle the non-"Offline" pages slowly
>>> out of the free_area. In addition the search cost should be minimal
>>> since all of the "Offline" pages should be aggregated to the tail of
>>> the free_area so all pages allocated off of the free_area will be the
>>> non-"Offline" pages until we shift over to them all being "Offline".
>>> This should be effective for MAX_ORDER - 1 and MAX_ORDER - 2 pages
>>> since the only real consumer of add_to_free_area_tail is
>>> __free_one_page which uses it to place a page with an order less than
>>> MAX_ORDER - 2 on the tail of a free_area assuming that it should be
>>> freeing the buddy of that page shortly. The only other issue with
>>> adding to tail would be the memory shuffling which was recently added,
>>> but I don't see that as being something that will be enabled in most
>>> cases so we could probably just make the features mutually exclusive,
>>> at least for now.
>>>
>>> So if I am not mistaken this would essentially require a couple
>>> changes to the mm infrastructure in order for this to work.
>>>
>>> First we would need to split nr_free into two counters, something like
>>> nr_freed and nr_bound. You could use nr_freed - nr_bound to get the
>>> value currently used for nr_free. When we pulled the pages for hinting
>>> we would reduce the nr_freed value and then add back to it when the
>>> pages are returned. When pages are allocated they would increment the
>>> nr_bound value. The idea behind this is that we can record nr_free
>>> when we collect the pages and save it to some local value. This value
>>> could then tell us how many new pages have been added that have not
>>> been hinted upon.
>> I can imagine that quite some people will have problems with such
>> "virtualization specific changes" splattered around core memory
>> management. Would there be a way to manage this data at a different
>> place, out of core-mm and somehow work on it via callbacks?
>>
>>> In addition we will need some way to identify which pages have been
>>> hinted on and which have not. The way I believe easiest to do this
>>> would be to overload the PageType value so that we could essentially
>>> have two values for "Buddy" pages. We would have our standard "Buddy"
>>> pages, and "Buddy" pages that also have the "Offline" value set in the
>>> PageType field. Tracking the Online vs Offline pages this way would
>>> actually allow us to do this with almost no overhead as the mapcount
>>> value is already being reset to clear the "Buddy" flag so adding a
>>> "Offline" flag to this clearing should come at no additional cost.
>> Just nothing here that this will require modifications to kdump
>> (makedumpfile to be precise and the vmcore information exposed from the
>> kernel), as kdump only checks for the the actual mapcount value to
>> detect buddy and offline pages (to exclude them from dumps), they are
>> not treated as flags.
>>
>> For now, any mapcount values are really only separate values, meaning
>> not the separate bits are of interest, like flags would be. Reusing
>> other flags would make our life a lot easier. E.g. PG_young or so. But
>> clearing of these is then the problematic part.
>>
>> Of course we could use in the kernel two values, Buddy and BuddyOffline.
>> But then we have to check for two different values whenever we want to
>> identify a buddy page in the kernel.
>>
>>> Lastly we would need to create a specialized function for allocating
>>> the non-"Offline" pages, and to tweak __free_one_page to tail enqueue
>>> "Offline" pages. I'm thinking the alloc function it would look
>>> something like __rmqueue_smallest but without the "expand" and needing
>>> to modify the !page check to also include a check to verify the page
>>> is not "Offline". As far as the changes to __free_one_page it would be
>>> a 2 line change to test for the PageType being offline, and if it is
>>> to call add_to_free_area_tail instead of add_to_free_area.
>> As already mentioned, there might be scenarios where the additional
>> hinting thread might consume too much CPU cycles, especially if there is
>> little guest activity any you mostly spend time scanning a handful of
>> free pages and reporting them. I wonder if we can somehow limit the
>> amount of wakeups/scans for a given period to mitigate this issue.
>>
>> One main issue I see with your approach is that we need quite a lot of
>> core memory management changes. This is a problem. I wonder if we can
>> factor out most parts into callbacks.
>>
>> E.g. in order to detect where to queue a certain page (front/tail), call
>> a callback if one is registered, mark/check pages in a core-mm unknown
>> way as offline etc.
>>
>> I still wonder if there could be an easier way to combine recording of
>> hints and one hinting thread, essentially avoiding scanning and some of
>> the required core-mm changes.
> In order to resolve the scalability issues associated with my
> patch-series without compromising with free memory hints, I may explore
> the idea described below:
> - Use xbitmap (if possible - earlier suggested by Rik and Wei)
> corresponding to each zone on a granularity of MAX_ORDER - 2, to track
> the freed PFN's.
> - Define and use counters corresponding to each zone to monitor the
> amount of memory freed.
> - As soon as the 64MB free memory threshold is hit wake up the kernel
> thread which will scan this xbitmap and try to isolate the pages and
> clear the corresponding bits. (We still have to acquire zone lock to
> protect the respective xbitmap)
> - Report the isolated pages back to the host in a synchronous manner.
> I still have to work on several details of this idea including xbitmap,
> but first would like to hear any suggestions/thoughts.

As discussed offline, I think this is the right approach to follow.

a) We have an asynchronous hinting thread as suggested by Alex.
b) We record hints and only have to scan a xbitmap, not some random lists.
c) We have minimal core mm changes

Best of both worlds and sounds simple.

-- 

Thanks,

David / dhildenb