[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf33f9c8-4ea9-a6ef-b445-83344729d4e4@linux.alibaba.com>
Date: Thu, 19 Mar 2020 10:22:46 -0700
From: Yang Shi <yang.shi@...ux.alibaba.com>
To: "Kirill A. Shutemov" <kirill@...temov.name>
Cc: kirill.shutemov@...ux.intel.com, hughd@...gle.com,
aarcange@...hat.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: khugepaged: fix potential page state corruption
On 3/19/20 9:57 AM, Yang Shi wrote:
>
>
> On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
>> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>>
>>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>>
>>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>>> be freed
>>>>>> via pagevec or free_page_and_swap_cache(). But, the anonymous
>>>>>> page may
>>>>>> be added back to LRU, then it might result in the below race:
>>>>>>
>>>>>> CPU A CPU B
>>>>>> khugepaged:
>>>>>> unlock page
>>>>>> putback_lru_page
>>>>>> add to lru
>>>>>> page reclaim:
>>>>>> isolate this page
>>>>>> try_to_unmap
>>>>>> page_remove_rmap <-- corrupt _mapcount
>>>>>>
>>>>>> It looks nothing would prevent the pages from isolating by
>>>>>> reclaimer.
>>>>> Hm. Why should it?
>>>>>
>>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>>> reclaim as it's not part of usual page table tree. Basically
>>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>>> khugepaged side.
>>>> I don't quite get. What does "not part of usual page table tree"
>>>> means?
>>>>
>>>> How's about try_to_unmap() acquires ptl before khugepaged?
>> The page table we are dealing with was detached from the process' page
>> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
>> pte.
>>
>> try_to_unmap() can only reach the ptl if split ptl is disabled
>> (mm->page_table_lock is used), but it still will not be able to reach
>> pte.
>
> Aha, got it. Thanks for explaining. I definitely missed this point.
> Yes, pmdp_collapse_flush() would clear the pmd, then others won't see
> the page table.
>
> However, it looks the vmscan would not stop at try_to_unmap() at all,
> try_to_unmap() would just return true since pmd_present() should
> return false in pvmw. Then it would go all the way down to
> __remove_mapping(), but freezing the page would fail since
> try_to_unmap() doesn't actually drop the refcount from the pte map.
>
> It would not result in any critical problem AFAICT, but suboptimal and
> it may causes some unnecessary I/O due to swap.
To correct, it would not reach __remove_mapping() since refcount check
in pageout() would fail.
>
>>
>>>>> I don't see the issue right away.
>>>>>
>>>>>> The other problem is the page's active or unevictable flag might be
>>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>>> So what?
>>>> The flags may leak to page free path then kernel may complain if
>>>> DEBUG_VM is set.
>> Could you elaborate on what codepath you are talking about?
>
> __put_page ->
> __put_single_page ->
> free_unref_page ->
> put_unref_page_prepare ->
> free_pcp_prepare ->
> free_pages_prepare ->
> free_pages_check
>
> This check would just be run when DEBUG_VM is enabled.
>
>>
>>>>>> The putback_lru_page() would not clear those two flags if the
>>>>>> pages are
>>>>>> released via pagevec, it sounds nothing prevents from isolating
>>>>>> active
>>> Sorry, this is a typo. If the page is freed via pagevec, active and
>>> unevictable flag would get cleared before freeing by page_off_lru().
>>>
>>> But, if the page is freed by free_page_and_swap_cache(), these two
>>> flags are
>>> not cleared. But, it seems this path is hit rare, the pages are
>>> freed by
>>> pagevec for the most cases.
>>>
>>>>>> or unevictable pages.
>>>>> Again, why should it? vmscan is equipped to deal with this.
>>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>>> unevictable pages since it just simply walks page table.
>> Why it is wrong? lru_cache_add() only complains if both flags set, it
>> shouldn't happen.
>
> Noting wrong about isolating active or unevictable pages. I just mean
> it seems possible active or unevictable flag may be there if the page
> is freed via free_page_add_swap_cache() path.
>
>>
>>>>>> However I didn't really run into these problems, just in theory
>>>>>> by visual
>>>>>> inspection.
>>>>>>
>>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>>> again since
>>>>>> they are about to be freed when reaching this point. So,
>>>>>> clearing active
>>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>>> instead of calling putback_lru_page() as what page cache collapse
>>>>>> does.
>>>>> Hm? But we do call putback_lru_page() on the way out. I do not
>>>>> follow.
>>>> It just calls putback_lru_page() at error path, not success path.
>>>> Putting pages back to lru on error path definitely makes sense.
>>>> Here it
>>>> is the success path.
>> I agree that putting the apage on LRU just before free the page is
>> suboptimal, but I don't see it as a critical issue.
>
> Yes, given the code analysis above, I agree. If you thought the patch
> is a fine micro-optimization, I would like to re-submit it with
> rectified commit log. Thank you for your time.
>
>>
>>
>
Powered by blists - more mailing lists