linux-kernel - Re: [PATCH v2 01/15] mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <689c199f-9954-524b-7a2d-40f186e87b34@oracle.com>
Date:   Wed, 30 Mar 2022 11:04:24 -0600
From:   Khalid Aziz <khalid.aziz@...cle.com>
To:     David Hildenbrand <david@...hat.com>, linux-kernel@...r.kernel.org
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Liang Zhang <zhangliang5@...wei.com>,
        Pedro Gomes <pedrodemargomes@...il.com>,
        Oded Gabbay <oded.gabbay@...il.com>, linux-mm@...ck.org
Subject: Re: [PATCH v2 01/15] mm/rmap: fix missing swap_free() in
 try_to_unmap() after arch_unmap_one() failed

On 3/29/22 14:55, David Hildenbrand wrote:
> On 29.03.22 22:42, Khalid Aziz wrote:
>> On 3/29/22 07:59, David Hildenbrand wrote:
>>> On 15.03.22 11:47, David Hildenbrand wrote:
>>>> In case arch_unmap_one() fails, we already did a swap_duplicate(). let's
>>>> undo that properly via swap_free().
>>>>
>>>> Fixes: ca827d55ebaa ("mm, swap: Add infrastructure for saving page metadata on swap")
>>>> Reviewed-by: Khalid Aziz <khalid.aziz@...cle.com>
>>>> Signed-off-by: David Hildenbrand <david@...hat.com>
>>>> ---
>>>>    mm/rmap.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 6a1e8c7f6213..f825aeef61ca 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1625,6 +1625,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>    				break;
>>>>    			}
>>>>    			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
>>>> +				swap_free(entry);
>>>>    				set_pte_at(mm, address, pvmw.pte, pteval);
>>>>    				ret = false;
>>>>    				page_vma_mapped_walk_done(&pvmw);
>>>
>>> Hi Khalid,
>>>
>>> I'm a bit confused about the semantics if arch_unmap_one(), I hope you can clarify.
>>>
>>>
>>> See patch #11 in this series, were we can fail unmapping after arch_unmap_one() succeeded. E.g.,
>>>
>>> @@ -1623,6 +1634,24 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>    				page_vma_mapped_walk_done(&pvmw);
>>>    				break;
>>>    			}
>>> +			if (anon_exclusive &&
>>> +			    page_try_share_anon_rmap(subpage)) {
>>> +				swap_free(entry);
>>> +				set_pte_at(mm, address, pvmw.pte, pteval);
>>> +				ret = false;
>>> +				page_vma_mapped_walk_done(&pvmw);
>>> +				break;
>>> +			}
>>> +			/*
>>> +			 * Note: We don't remember yet if the page was mapped
>>> +			 * exclusively in the swap entry, so swapin code has
>>> +			 * to re-determine that manually and might detect the
>>> +			 * page as possibly shared, for example, if there are
>>> +			 * other references on the page or if the page is under
>>> +			 * writeback. We made sure that there are no GUP pins
>>> +			 * on the page that would rely on it, so for GUP pins
>>> +			 * this is fine.
>>> +			 */
>>>    			if (list_empty(&mm->mmlist)) {
>>>    				spin_lock(&mmlist_lock);
>>>    				if (list_empty(&mm->mmlist))
>>>
>>>
>>> For now, I was under the impression that we don't have to undo anything after
>>> arch_unmap_one() succeeded, because we seem to not do anything for two
>>> cases below. But looking into arch_unmap_one() and how it allocates stuff I do
>>> wonder what we would actually want to do here -- I'd assume we'd want to
>>> trigger the del_tag_store() somehow?
>>
>> Hi David,
>>
> 
> Hi,
> 
> thanks for your fast reply.
> 
>> Currently once arch_unmap_one() completes successfully, we are at the point of no return for this pte. It will be
>> replaced by swap pte soon thereafter. Patch 11 adds another case where we may return without replacing current pte with
>> swap pte. For now could you resolve this by moving the above code block in patch 11 to before the call to
>> arch_unmap_one(). That still leaves open the issue having the flexibility of undoing what arch_unmap_one() does for some
>> other reason in future. That will require coming up with a properly architected way to do it.
> 
> I really want clearing PG_anon_exclusive be the last action, without
> eventually having to set it again and overcomplicating
> PG_anon_exclusive/rmap handling. Ideally, we'd have a "arch_remap_one()"
> that just reverts what arch_unmap_one() did.

Hi David,

arch_remap_one() sounds reasonable. Would you like to add that in your patch series, or would you like me to create a 
separate patch for adding this on top of your patch series?

> 
>>
>>>
>>> arch_unmap_one() calls adi_save_tags(), which allocates memory.
>>> adi_restore_tags()->del_tag_store() reverts that operation and ends up
>>> freeing memory conditionally; However, it's only
>>> called from arch_do_swap_page().
>>>
>>>
>>> Here is why I have to scratch my head:
>>>
>>> a) arch_do_swap_page() is only called from do_swap_page(). We don't do anything similar
>>> for mm/swapfile.c:unuse_pte(), aren't we missing something?
>>
>> My understanding of this code path maybe flawed, so do correct me if this does not sound right. unused_pte() is called
>> upon user turning off swap on a device. unused_pte() is called by unused_pte_range() which swaps the page back in from
>> swap device before calling unuse_pte(). Once the page is read back in from swap, ultimately access to the va for the
>> page will result in call to __handle_mm_fault() which in turn will call handle_pte_fault() to insert a new pte for this
>> mapping and handle_pte_fault() will call arch_do_swap_page() which will restore the tags.
> 
> unuse_pte() will replace a swap pte directly by a proper, present pte,
> just like do_swap_page() would. You won't end up in do_swap_page()
> anymore and arch_do_swap_page() won't be called, because there is no
> swap PTE anymore.
> 
>>
>>>
>>> b) try_to_migrate_one() does the arch_unmap_one(), but who will do the
>>> restore+free after migration succeeded or failed, aren't we missing something?
>>
>> try_to_migrate_one() replaces the current pte with a migration pte after calling arch_unmap_one(). This causes
>> __handle_mm_fault() to be called when a reference to the va covered by migration pte is made. This will in turn finally
>> result in a call to arch_do_swap_page() which restores the tags.
> 
> Migration PTEs are restore via mm/migrate.c:remove_migration_ptes().
> arch_do_swap_page() won't be called.
> 
> What you mention is if someone accesses the migration PTE while
> migration is active and the migration PTEs have not been removed yet.
> While we'll end up in do_swap_page(), we'll do a migration_entry_wait(),
> followed by an effective immediate "return 0;". arch_do_swap_page()
> won't get called.
> 
> 
> So in essence, I think this doesn't work as expected yet. In the best
> case we don't immediately free memory. In the worst case we lose the tags.
> 

I see what you mean. I can work on fixing these issues up.

Thanks,
Khalid