[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b4be42a4-cbfc-8706-cc94-26211ddcbe4a@redhat.com>
Date: Tue, 22 Oct 2019 16:02:09 +0200
From: David Hildenbrand <david@...hat.com>
To: Michal Hocko <mhocko@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
virtualization@...ts.linux-foundation.org,
Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Juergen Gross <jgross@...e.com>,
Pavel Tatashin <pavel.tatashin@...rosoft.com>,
Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
Anthony Yznaga <anthony.yznaga@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Johannes Weiner <hannes@...xchg.org>,
Oscar Salvador <osalvador@...e.de>,
Pingfan Liu <kernelfans@...il.com>, Qian Cai <cai@....pw>,
Dan Williams <dan.j.williams@...el.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Wei Yang <richardw.yang@...ux.intel.com>,
Alexander Potapenko <glider@...gle.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Jason Gunthorpe <jgg@...pe.ca>,
Stephen Rothwell <sfr@...b.auug.org.au>,
Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
Matthew Wilcox <willy@...radead.org>,
Yu Zhao <yuzhao@...gle.com>, Minchan Kim <minchan@...nel.org>,
Yang Shi <yang.shi@...ux.alibaba.com>,
Ira Weiny <ira.weiny@...el.com>,
Andrey Ryabinin <aryabinin@...tuozzo.com>
Subject: Re: [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with
a reference count of 0
>> Please note that we have other users that use PG_offline + refcount >= 1
>> (HyperV balloon, XEN). We should not affect these users (IOW,
>> has_unmovable_pages() has to stop right there if we see one of these pages).
>
> OK, this is exactly what I was worried about. I can see why you might
> want to go an easier way and rule those users out but wouldn't be it
> actually more reasonable to explicitly request PageOffline users to
> implement MEM_GOING_OFFLINE and prepare their offlined pages for the
> offlining operation or fail right there if that is not possible.
> If you fail right there during the isolation phase then there is no way
> to allow the offlining to proceed from that context.
I am not sure I agree. But let's discuss the details. See below.
>
>>>> 2) memory_notify(MEM_GOING_OFFLINE, &arg);
>>>> -> Here, we could release all pages to the buddy, clearing PG_offline
>>>> -> PF_offline must not be cleared so dumping tools will not touch
>>>> these pages. There is a time where pages are !PageBuddy() and
>>>> !PageOffline().
>>>
>>> Well, this is fully under control of the driver, no? Reference count
>>> shouldn't play any role here AFAIU.
>>
>> Yes, this is more a PG_offline issue. The reference count is an issue of
>> reaching this call :) If we want to go via the buddy:
>>
>> 1. Clear PG_offline
>> 2. Free page (gets set PG_buddy)
>>
>> Between 1 and 2, a dumping tool could not exclude these pages and therefore
>> try to read from these pages. That is an issue IFF we want to return the
>> pages back to the buddy instead of doing what I propose here.
>
> If the driver is going to free page to the allocator then it would have
> to claim the page back and so it is usable again. If it cannot free it
> then it would simply set the reference count to 0. It can even keep the
> PG_offline if necessary although I have to admit I am not really sure it
> is necessary.
Yes it is necessary to keep PG_offline set to avoid anybody touching the
page until the section is offline. (especially, dumping tools)
But that's another discussion. The important part is to not go via the buddy.
>
>>>> 3) scan_movable_pages() ...
>>
>> Please note that when we don't put the pages back to the buddy and don't
>> implement something like I have in this patch, we'll loop/fail here.
>> Especially if we have pages with PG_offline + refcount >= 1 .
>
> You should have your reference count 0 at this stage as it is after
> MEM_GOING_OFFLINE.
>
>>> MEM_CANCEL_OFFLINE could gain the reference back to balance the
>>> MEM_GOING_OFFLINE step.
>>
>> The pages are already unisolated and could be used by the buddy. But again,
>> I think you have an idea that tries to avoid putting pages to the buddy.
>
> Yeah, set_page_count(page, 0) if you do not want to release that page
> from the notifier context to reflect that the page is ok to be offlined
> with the rest.
>
I neither see how you deal with __test_page_isolated_in_pageblock() nor with
__offline_isolated_pages(). Sorry, but what I read is incomplete and you
probably have a full proposal in your head. Please read below how I think
you want to solve it.
>
>>> explicit control via the reference count which is the standard way to
>>> control the struct page life cycle.
>>>
>>> Anyway hooking into __put_page (which tends to be a hot path with
>>> something that is barely used on most systems) doesn't sound nice to me.
>>> This is the whole point which made me think about the whole reference
>>> count approach in the first place.
>>
>> Again, the race I think that is possible
>>
>> somebody: get_page_unless_zero(page)
>> virtio_mem: page_ref_dec(pfn_to_page(pfn)
>> somebody: put_page() -> straight to the buddy
>
> Who is that somebody? I thought that it is only the owner/driver to have
> a control over the page. Also the above is not possible as long as the
> owner/driver keeps a reference to the PageOffline page throughout the
> time it is marked that way.
>
I was reading
include/linux/mm_types.h:
"If you want to use the refcount field, it must be used in such a way
that other CPUs temporarily incrementing and then decrementing the
refcount does not cause problems"
And that made me think "anybody can go ahead and try get_page_unless_zero()".
If I am missing something here and this can indeed not happen (e.g.,
because PageOffline() pages are never mapped to user space), then I'll
happily remove this code.
>
>>>>> If you can let the page go then just drop the reference count. The page
>>>>> is isolated already by that time. If you cannot let it go for whatever
>>>>> reason you can fail the offlining.
>>>>
>>>> We do have one hack in current MM code, which is the memory isolation
>>>> notifier only used by CMM on PPC. It allows to "cheat" has_unmovable_pages()
>>>> to skip over unmovable pages. But quite frankly, I rather want to get rid of
>>>> that crap (something I am working on right now) than introduce new users.
>>>> This stuff is racy as hell and for CMM, if memory offlining fails, the
>>>> ballooned pages are suddenly part of the buddy. Fragile.
>>>
>>> Could you be more specific please?
>>
>> Let's take a look at how arch/powerpc/platforms/pseries/cmm.c handles it:
>>
>> cmm_memory_isolate_cb() -> cmm_count_pages(arg):
>> - Memory Isolation notifier callback
>> - Count how many pages in the range to be isolated are in the ballooon
>> - This makes has_unmovable_pages() succeed. Pages can be isolated.
>>
>> cmm_memory_cb -> cmm_mem_going_offline(arg):
>> - Memory notifier (online/offline)
>> - Release all pages in the range to the buddy
>>
>> If offlining fails, the pages are now in the buddy, no longer in the
>> balloon. MEM_CANCEL_ONLINE is too late, because the range is already
>> unisolated again and the pages might be in use.
>>
>> For CMM it might not be that bad, because it can actually "reloan" any
>> pages. In contrast, virtio-mem cannot simply go ahead and reuse random
>> memory in unplugged. Any access to these pages would be evil. Giving them
>> back to the buddy is dangerous.
>
> Thanks, I was not aware of that code. But from what I understood this is
> an outright bug in this code because cmm_mem_going_offline releases
> pages to the buddy allocator which is something that is not recoverable
> on a later failure.
>
Yes, and that should be gone if we switch to balloon compaction.
Let's recap what I suggest:
"PageOffline() pages that have a reference count of 0 will be treated
like free pages when offlining pages, allowing the containing memory
block to get offlined. In case a driver wants to revive such a page, it
has to synchronize against memory onlining/offlining (e.g., using memory
notifiers) while incrementing the reference count. Also, a driver that
relies in this feature is aware that re-onlining the memory will require
to re-set the pages PageOffline() - e.g., via the online_page_callback_t."
a) has_unmovable_pages() already skips over pages with a refcount of zero.
The code I add to not skip over these pages when !MEMORY_OFFLINE is a pure
optimization to fail early when trying to allocate from that range.
b) __test_page_isolated_in_pageblock() is modified to skip over
PageOffline() pages with a refcount of zero
c) __offline_isolated_pages() is modified to skip over
PageOffline() pages with a refcount of zero
d) __put_page() is modified to not return pages to the buddy in any
case as a safety net. We might be able to get rid of that.
What I think you suggest:
a) has_unmovable_pages() skips over all PageOffline() pages.
This results in a lot of false negatives when trying to offline. Might be ok.
b) The driver decrements the reference count of the PageOffline pages
in MEM_GOING_OFFLINE.
c) The driver increments the reference count of the PageOffline pages
in MEM_CANCEL_OFFLINE. One issue might be that the pages are no longer
isolated once we get that call. Might be ok.
d) How to make __test_page_isolated_in_pageblock() succeed?
Like I propose in this patch (PageOffline() + refcount == 0)?
e) How to make __offline_isolated_pages() succeed?
Like I propose in this patch (PageOffline() + refcount == 0)?
In summary, is what you suggest simply delaying setting the reference count to 0
in MEM_GOING_OFFLINE instead of right away when the driver unpluggs the pages?
What's the big benefit you see and I fail to see?
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists