[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com>
Date: Tue, 19 Jul 2022 17:19:34 +0200
From: David Hildenbrand <david@...hat.com>
To: Michal Hocko <mhocko@...e.com>,
Charan Teja Kalla <quic_charante@...cinc.com>
Cc: akpm@...ux-foundation.org, pasha.tatashin@...een.com,
sjpark@...zon.de, sieberf@...zon.com, shakeelb@...gle.com,
dhowells@...hat.com, willy@...radead.org, vbabka@...e.cz,
minchan@...nel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
"iamjoonsoo.kim@....com" <iamjoonsoo.kim@....com>
Subject: Re: [PATCH] mm: fix use-after free of page_ext after race with
memory-offline
On 18.07.22 16:54, Michal Hocko wrote:
> On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote:
>> Thanks Michal for the comments!!
>>
>> On 7/18/2022 5:20 PM, Michal Hocko wrote:
>>>> The above mentioned race is just one example __but the problem persists
>>>> in the other paths too involving page_ext->flags access(eg:
>>>> page_is_idle())__. Since offline waits till the last reference on the
>>>> page goes down i.e. any path that took the refcount on the page can make
>>>> the memory offline operation to wait. Eg: In the migrate_pages()
>>>> operation, we do take the extra refcount on the pages that are under
>>>> migration and then we do copy page_owner by accessing page_ext. For
>>>>
>>>> Fix those paths where offline races with page_ext access by maintaining
>>>> synchronization with rcu lock.
>>> Please be much more specific about the synchronization. How does RCU
>>> actually synchronize the offlining and access? Higher level description
>>> of all the actors would be very helpful not only for the review but also
>>> for future readers.
>>
>> I will improve the commit message about this synchronization change
>> using RCU's.
>
> Thanks! The most imporant part is how the exclusion is actual achieved
> because that is not really clear at first sight
>
> CPU1 CPU2
> lookup_page_ext(PageA) offlining
> offline_page_ext
> __free_page_ext(addrA)
> get_entry(addrA)
> ms->page_ext = NULL
> synchronize_rcu()
> free_page_ext
> free_pages_exact (now addrA is unusable)
>
> rcu_read_lock()
> entryA = get_entry(addrA)
> base + page_ext_size * index # an address not invalidated by the freeing path
> do_something(entryA)
> rcu_read_unlock()
>
> CPU1 never checks ms->page_ext so it cannot bail out early when the
> thing is torn down. Or maybe I am missing something. I am not familiar
> with page_ext much.
>
>>> Also, more specifically
>>> [...]
>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>> index 3dc715d..5ccd3ee 100644
>>>> --- a/mm/page_ext.c
>>>> +++ b/mm/page_ext.c
>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>>>> if (!ms || !ms->page_ext)
>>>> return;
>>>> base = get_entry(ms->page_ext, pfn);
>>>> - free_page_ext(base);
>>>> ms->page_ext = NULL;
>>>> + synchronize_rcu();
>>>> + free_page_ext(base);
>>>> }
>>> So you are imposing the RCU grace period for each page_ext! This can get
>>> really expensive. Have you tried to measure the effect?
>
> I was wrong here! This is for each memory section which is not as
> terrible as every single page_ext. This can be still quite a lot memory
> sections in a single memory block (e.g. on ppc memory sections are
> ridiculously small).
>
>> I didn't really measure the effect. Let me measure it and post these in V2.
>
> I think it would be much more optimal to split the operation into 2
> phases. Invalidate all the page_ext metadata then synchronize_rcu and
> only then free them all. I am not very familiar with page_ext so I am
> not sure this is easy to be done. Maybe page_ext = NULL can be done in
> the first stage.
>
>>> Is there any reason why page_ext is freed during offlining rather when
>>> it is hotremoved?
>>
>> This is something I am struggling to get the answer. IMO, this is even
>> wrong design where I don't have page_ext but page. Moving the freeing of
>> page_ext to hotremove path actually solves the problem but somehow this
>> idea didn't liked[1]. copying the excerpt here:
>
> yes, it certainly adds subtlety to the page_ext thingy. I do agree that
> even situation around struct page is not all that great wrt
> synchronization. We have pfn_to_online_page which even when racy doesn't
> give you a garbage because hotremove happens very rarely or so long
> after offlining that the race window is essentially impractically too
> long for any potential damage. We would have to change a lot to make it
> work "properly". I am not optimistic this is actually feasible.
>
>>> 3) Change the design where the page_ext is valid as long as the struct
>>> page is alive.
>>
>> :/ Doesn't spark joy."
>
> I would be wondering why. It should only take to move the callback to
> happen at hotremove. So it shouldn't be very involved of a change. I can
> imagine somebody would be relying on releasing resources when offlining
> memory but is that really the case?
Various reasons:
1) There was a discussion in the past to eventually also use rcu
protection for handling pdn_to_online_page(). So doing it cleanly here
is certainly an improvement.
2) I really dislike having to scatter section online checks all over the
place in page ext code. Once there is a difference between active vs.
stale page ext data things get a bit messy and error prone. This is
already ugly enough in our generic memmap handling code IMHO.
3) Having on-demand allocations, such as KASAN or page ext from the
memory online notifier is at least currently cleaner, because we don't
have to handle each and every subsystem that hooks into that during the
core memory hotadd/remove phase, which primarily only setups the
vmemmap, direct map and memory block devices.
Personally, I think what we have in this patch is quite nice and clean.
But I won't object if it can be similarly done in a clean way from
hot(un)plug code.
That is, I ack this patch but don't object to similarly clean approaches.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists