lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e716c8c6-898e-5199-019c-161ea3ec06c3@linux.alibaba.com>
Date:   Thu, 19 Mar 2020 09:57:47 -0700
From:   Yang Shi <yang.shi@...ux.alibaba.com>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     kirill.shutemov@...ux.intel.com, hughd@...gle.com,
        aarcange@...hat.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: khugepaged: fix potential page state corruption



On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>
>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>
>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>> be freed
>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>>>> be added back to LRU, then it might result in the below race:
>>>>>
>>>>>      CPU A                CPU B
>>>>> khugepaged:
>>>>>     unlock page
>>>>>     putback_lru_page
>>>>>       add to lru
>>>>>                  page reclaim:
>>>>>                    isolate this page
>>>>>                    try_to_unmap
>>>>>     page_remove_rmap <-- corrupt _mapcount
>>>>>
>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>> Hm. Why should it?
>>>>
>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>> reclaim as it's not part of usual page table tree. Basically
>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>> khugepaged side.
>>> I don't quite get. What does "not part of usual page table tree" means?
>>>
>>> How's about try_to_unmap() acquires ptl before khugepaged?
> The page table we are dealing with was detached from the process' page
> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> pte.
>
> try_to_unmap() can only reach the ptl if split ptl is disabled
> (mm->page_table_lock is used), but it still will not be able to reach pte.

Aha, got it. Thanks for explaining. I definitely missed this point. Yes, 
pmdp_collapse_flush() would clear the pmd, then others won't see the 
page table.

However, it looks the vmscan would not stop at try_to_unmap() at all, 
try_to_unmap() would just return true since pmd_present() should return 
false in pvmw. Then it would go all the way down to __remove_mapping(), 
but freezing the page would fail since try_to_unmap() doesn't actually 
drop the refcount from the pte map.

It would not result in any critical problem AFAICT, but suboptimal and 
it may causes some unnecessary I/O due to swap.

>
>>>> I don't see the issue right away.
>>>>
>>>>> The other problem is the page's active or unevictable flag might be
>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>> So what?
>>> The flags may leak to page free path then kernel may complain if
>>> DEBUG_VM is set.
> Could you elaborate on what codepath you are talking about?

__put_page ->
     __put_single_page ->
         free_unref_page ->
             put_unref_page_prepare ->
                 free_pcp_prepare ->
                     free_pages_prepare ->
                         free_pages_check

This check would just be run when DEBUG_VM is enabled.

>
>>>>> The putback_lru_page() would not clear those two flags if the pages are
>>>>> released via pagevec, it sounds nothing prevents from isolating active
>> Sorry, this is a typo. If the page is freed via pagevec, active and
>> unevictable flag would get cleared before freeing by page_off_lru().
>>
>> But, if the page is freed by free_page_and_swap_cache(), these two flags are
>> not cleared. But, it seems this path is hit rare, the pages are freed by
>> pagevec for the most cases.
>>
>>>>> or unevictable pages.
>>>> Again, why should it? vmscan is equipped to deal with this.
>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>> unevictable pages since it just simply walks page table.
> Why it is wrong? lru_cache_add() only complains if both flags set, it
> shouldn't happen.

Noting wrong about isolating active or unevictable pages. I just mean it 
seems possible active or unevictable flag may be there if the page is 
freed via free_page_add_swap_cache() path.

>
>>>>> However I didn't really run into these problems, just in theory
>>>>> by visual
>>>>> inspection.
>>>>>
>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>> again since
>>>>> they are about to be freed when reaching this point.  So,
>>>>> clearing active
>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>> instead of calling putback_lru_page() as what page cache collapse does.
>>>> Hm? But we do call putback_lru_page() on the way out. I do not follow.
>>> It just calls putback_lru_page() at error path, not success path.
>>> Putting pages back to lru on error path definitely makes sense. Here it
>>> is the success path.
> I agree that putting the apage on LRU just before free the page is
> suboptimal, but I don't see it as a critical issue.

Yes, given the code analysis above, I agree. If you thought the patch is 
a fine micro-optimization, I would like to re-submit it with rectified 
commit log. Thank you for your time.

>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ