[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190912092611.GN4023@dhcp22.suse.cz>
Date: Thu, 12 Sep 2019 11:26:11 +0200
From: Michal Hocko <mhocko@...nel.org>
To: David Hildenbrand <david@...hat.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
Alexander Duyck <alexander.duyck@...il.com>,
virtio-dev@...ts.oasis-open.org, kvm list <kvm@...r.kernel.org>,
Catalin Marinas <catalin.marinas@....com>,
Dave Hansen <dave.hansen@...el.com>,
LKML <linux-kernel@...r.kernel.org>,
Matthew Wilcox <willy@...radead.org>,
linux-mm <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>, will@...nel.org,
linux-arm-kernel@...ts.infradead.org,
Oscar Salvador <osalvador@...e.de>,
Yang Zhang <yang.zhang.wz@...il.com>,
Pankaj Gupta <pagupta@...hat.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
Nitesh Narayan Lal <nitesh@...hat.com>,
Rik van Riel <riel@...riel.com>, lcapitulino@...hat.com,
"Wang, Wei W" <wei.w.wang@...el.com>,
Andrea Arcangeli <aarcange@...hat.com>, ying.huang@...el.com,
Paolo Bonzini <pbonzini@...hat.com>,
Dan Williams <dan.j.williams@...el.com>,
Fengguang Wu <fengguang.wu@...el.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH v9 0/8] stg mail -e --version=v9 \
On Thu 12-09-19 09:47:30, David Hildenbrand wrote:
> On 12.09.19 09:16, Michal Hocko wrote:
> > On Wed 11-09-19 18:09:18, David Hildenbrand wrote:
> >> On 11.09.19 15:51, Michal Hocko wrote:
> >>> On Wed 11-09-19 15:20:02, Michal Hocko wrote:
> >>> [...]
> >>>>> 4. Continuously report, not the "one time report everything" approach.
> >>>>
> >>>> So you mean the allocator reporting this rather than an external code to
> >>>> poll right? I do not know, how much this is nice to have than must have?
> >>>
> >>> Another idea that I haven't really thought through so it might turned
> >>> out to be completely bogus but let's try anyway. Your "report everything"
> >>> just made me look and realize that free_pages_prepare already performs
> >>> stuff that actually does something similar yet unrelated.
> >>>
> >>> We do report to special page poisoning, zeroying or
> >>> CONFIG_DEBUG_PAGEALLOC to unmap the address from the kernel address
> >>> space. This sounds like something fitting your model no?
> >>>
> >>
> >> AFAIKS, the poisoning/unmapping is done whenever a page is freed. I
> >> don't quite see yet how that would help to remember if a page was
> >> already reported.
> >
> > Do you still have to differ that state when each page is reported?
>
> Ah, very good point. I can see that the reason for this was not
> discussed in this thread so far. (Alexander, Nitesh, please correct me
> if I am wrong). It's buried in the long history of free page
> hinting/reporting.
It would really be preferable to summarize such a previous discussion
ideally with some references.
> Some early patch sets tried to report during every free synchronously.
> Free a page, report them to the hypervisor. This resulted in some issues
> (especially, locking-related and the virtio + the hypervisor being
> involved, resulting in unpredictable delays, quite some overhead ...).
> It was no good.
>
> One design decision then was to not report single pages, but a bunch of
> pages at once. This made it necessary to "remember" the pages to be
> reported and to temporarily block them from getting allocated while
> reporting.
>
> Nitesh implemented (at least) two "capture PFNs of free pages in an
> array when freeing" approaches. One being synchronous from the freeing
> CPU once the list was full (having similar issues as plain synchronous
> reporting) and one being asynchronous by a separate thread (which solved
> many locking issues).
>
> Turned out the a simple array can quickly lead to us having to drop
> "reports" to the hypervisor because the array is full and the reporting
> thread was not able to keep up. Not good as well. Especially, if some
> process frees a lot of memory this can happen quickly and Nitesh wa
> sable to trigger this scenario frequently.
>
> Finally, Nitesh decided to use the bitmap instead to keep track of pages
> to report. I'd like to note that this approach could still be combined
> with an "array of potentially free PFNs". Only when the array/circular
> buffer runs out of entries ("reporting thread cannot keep up"), we would
> have to go back to scanning the bitmap.
>
> That was also the point where Alexander decided to look into integrating
> tracking/handling reported/unreported pages directly in the buddy.
OK, this gives at least some background which is really appreciated.
Explaining _why_ you need something in the core MM is essential to move
forward.
> >> After reporting the page we would have to switch some
> >> state (Nitesh: bitmap bit, Alexander: page flag) to identify that.
> >
> > Yes, you can either store the state somewhere.
> >
> >> Of course, we could map the page and treat that as "the state" when we
> >> reported it, but I am not sure that's such a good idea :)
> >>
> >> As always, I might be very wrong ...
> >
> > I still do not fully understand the usecase so I might be equally wrong.
> > My thinking is along these lines. Why should you scan free pages when
> > you can effectively capture each freed page? If you go one step further
> > then post_alloc_hook would be the counterpart to know that your page has
> > been allocated.
>
> I'd like to note that Nitesh's patch set contains the following hunk,
> which is roughly what you were thinking :)
>
>
> -static inline void __free_one_page(struct page *page,
> +inline void __free_one_page(struct page *page,
> unsigned long pfn,
> struct zone *zone, unsigned int order,
> - int migratetype)
> + int migratetype, bool hint)
> {
> unsigned long combined_pfn;
> unsigned long uninitialized_var(buddy_pfn);
> @@ -980,7 +981,8 @@ static inline void __free_one_page(struct page *page,
> migratetype);
> else
> add_to_free_area(page, &zone->free_area[order], migratetype);
> -
> + if (hint)
> + page_hinting_enqueue(page, order);
> }
>
>
> (ignore the hint parameter, when he would switch to a isolate vs.
> alloc/free, that can go away and all we left is the enqueue part)
>
>
> Inside that callback we can remember the pages any way we want. Right
> now in a bitmap. Maybe later in a array + bitmap (as discussed above).
> Another idea I had was to simply go over all pages and report them when
> running into this "array full" condition. But I am not yet sure about
> the performance implications on rather large machines. So the bitmap
> idea might have some other limitations but seems to do its job.
>
> Hoe that makes things clearer and am not missing something.
It certainly helped me to get a better idea. I have commented on my
reservations regarding the approach in this thread elsewhere but at
least I _think_ I am getting a point of what you guys try to achieve.
Thanks!
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists