linux-kernel - Re: [PATCH/RFC] mm: do not drop unused pages when userfaultd is running

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <eaa540cb-2249-72d1-d4b8-a54c2869d7a3@de.ibm.com>
Date:   Thu, 28 Jun 2018 16:51:14 +0200
From:   Christian Borntraeger <borntraeger@...ibm.com>
To:     David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
        linux-s390@...r.kernel.org
Cc:     kvm@...r.kernel.org, Janosch Frank <frankja@...ux.ibm.com>,
        Cornelia Huck <cohuck@...hat.com>,
        linux-kernel@...r.kernel.org,
        Martin Schwidefsky <schwidefsky@...ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [PATCH/RFC] mm: do not drop unused pages when userfaultd is
 running



On 06/28/2018 04:49 PM, David Hildenbrand wrote:
> On 28.06.2018 16:39, Christian Borntraeger wrote:
>>
>>
>> On 06/28/2018 03:18 PM, David Hildenbrand wrote:
>>> On 28.06.2018 14:39, Christian Borntraeger wrote:
>>>> KVM guests on s390 can notify the host of unused pages. This can result
>>>> in pte_unused callbacks to be true for KVM guest memory.
>>>>
>>>> If a page is unused (checked with pte_unused) we might drop this page
>>>> instead of paging it. This can have side-effects on userfaultd, when the
>>>> page in question was already migrated:
>>>>
>>>> The next access of that page will trigger a fault and a user fault
>>>> instead of faulting in a new and empty zero page. As QEMU does not
>>>> expect a userfault on an already migrated page this migration will fail.
>>>>
>>>> The most straightforward solution is to ignore the pte_unused hint if a
>>>> userfault context is active for this VMA.
>>>>
>>>> Cc: Martin Schwidefsky <schwidefsky@...ibm.com>
>>>> Cc: Andrea Arcangeli <aarcange@...hat.com>
>>>> Cc: stable@...r.kernel.org
>>>> Signed-off-by: Christian Borntraeger <borntraeger@...ibm.com>
>>>> ---
>>>>  mm/rmap.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 6db729dc4c50..3f3a72aa99f2 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1481,7 +1481,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>  				set_pte_at(mm, address, pvmw.pte, pteval);
>>>>  			}
>>>>  
>>>> -		} else if (pte_unused(pteval)) {
>>>> +		} else if (pte_unused(pteval) && !vma->vm_userfaultfd_ctx.ctx) {
>>>>  			/*
>>>>  			 * The guest indicated that the page content is of no
>>>>  			 * interest anymore. Simply discard the pte, vmscan
>>>>
>>>
>>> To understand the implications better:
>>>
>>> This is like a MADV_DONTNEED from user space while a userfaultfd
>>> notifier is registered for this vma range.
>>>
>>> While we can block such calls in QEMU ("we registered it, we know it
>>> best"), we can't do the same in the kernel.
>>>
>>> These "intern MADV_DONTNEED" can actually trigger "deferred", so e.g. if
>>> the pte_unused() was set before userfaultfd has been registered, we can
>>> still get the same result, right?>
>> Not sure I understand your last sentence.
> 
> Rephrased: Instead trying to stop somebody from setting pte_unused will
> not work, as we might get a userfaultfd registration at some point and
> find a previously set pte_unused afterwards.

Yes, exactly. the unused value can be set before the migration.


> 
>> This place here is called on the unmap, (e.g. when the host tries to page out).
>> The value was transferred before (and always before) during the page table invalidation.
>> So pte_unused was always set before. This is the place where we decide if we page
>> out (ans establish a swap pte) or just drop this page table entry. So if
>> no userfaultd is registered at that point in time we are good.
> 
> This certainly applies to ordinary userfaultfd we have right now.
> userfaultfd WP (write-protect) or other features to come might be
> different, but it does not seem to do any harm in case we page out
> instead of dropping it. This way we are on the safe side.

yes.
> 
> In other words: I think this is the right approach.