linux-kernel - Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bcc5cbb6-4ce6-4c01-8b5b-f6e01b306b2d@redhat.com>
Date: Tue, 27 May 2025 11:00:03 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: aarcange@...hat.com, akpm@...ux-foundation.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, lokeshgidra@...gle.com,
 peterx@...hat.com, ryncsn@...il.com, surenb@...gle.com
Subject: Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs
 concurrently with swap-out

On 27.05.25 10:37, Barry Song wrote:
> On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@...il.com> wrote:
>>
>> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@...hat.com> wrote:
>>>
>>> On 23.05.25 01:23, Barry Song wrote:
>>>> Hi All,
>>>
>>> Hi!
>>>
>>>>
>>>> I'm encountering another bug that can be easily reproduced using the small
>>>> program below[1], which performs swap-out and swap-in in parallel.
>>>>
>>>> The issue occurs when a folio is being swapped out while it is accessed
>>>> concurrently. In this case, do_swap_page() handles the access. However,
>>>> because the folio is under writeback, do_swap_page() completely removes
>>>> its exclusive attribute.
>>>>
>>>> do_swap_page:
>>>>                  } else if (exclusive && folio_test_writeback(folio) &&
>>>>                             data_race(si->flags & SWP_STABLE_WRITES)) {
>>>>                           ...
>>>>                           exclusive = false;
>>>>
>>>> As a result, userfaultfd_move() will return -EBUSY, even though the
>>>> folio is not shared and is in fact exclusively owned.
>>>>
>>>>                           folio = vm_normal_folio(src_vma, src_addr,
>>>> orig_src_pte);
>>>>                           if (!folio || !PageAnonExclusive(&folio->page)) {
>>>>                                   spin_unlock(src_ptl);
>>>> +                               pr_err("%s %d folio:%lx exclusive:%d
>>>> swapcache:%d\n",
>>>> +                                       __func__, __LINE__, folio,
>>>> PageAnonExclusive(&folio->page),
>>>> +                                       folio_test_swapcache(folio));
>>>>                                   err = -EBUSY;
>>>>                                   goto out;
>>>>                           }
>>>>
>>>> I understand that shared folios should not be moved. However, in this
>>>> case, the folio is not shared, yet its exclusive flag is not set.
>>>>
>>>> Therefore, I believe PageAnonExclusive is not a reliable indicator of
>>>> whether a folio is truly exclusive to a process.
>>>
>>> It is. The flag *not* being set is not a reliable indicator whether it
>>> is really shared. ;)
>>>
>>> The reason why we have this PAE workaround (dropping the flag) in place
>>> is because the page must not be written to (SWP_STABLE_WRITES). CoW
>>> reuse is not possible.
>>>
>>> uffd moving that page -- and in that same process setting it writable,
>>> see move_present_pte()->pte_mkwrite() -- would be very bad.
>>
>> An alternative approach is to make the folio writable only when we are
>> reasonably certain it is exclusive; otherwise, it remains read-only. If the
>> destination is later written to and the folio has become exclusive, it can
>> be reused directly. If not, a copy-on-write will occur on the destination
>> address, transparently to userspace. This avoids Lokesh’s userspace-based
>> strategy, which requires forcing a write to the source address.
> 
> Conceptually, I mean something like this:
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..70eaabf4f1a3 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm,
>   	}
>   	if (folio_test_large(src_folio) ||
>   	    folio_maybe_dma_pinned(src_folio) ||
> -	    !PageAnonExclusive(&src_folio->page)) {
> +	    (!PageAnonExclusive(&src_folio->page) &&
> +	    folio_mapcount(src_folio) != 1)) {
>   		err = -EBUSY;
>   		goto out;
>   	}
> @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm,
>   #endif
>   	if (pte_dirty(orig_src_pte))
>   		orig_dst_pte = pte_mkdirty(orig_dst_pte);
> -	orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> +	if (PageAnonExclusive(&src_folio->page))
> +		orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
>   
>   	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
>   out:
> @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>   			}
>   
>   			folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> -			if (!folio || !PageAnonExclusive(&folio->page)) {
> +			if (!folio || (!PageAnonExclusive(&folio->page) &&
> +					folio_mapcount(folio) != 1)) {
>   				spin_unlock(src_ptl);
>   				err = -EBUSY;
>   				goto out;
> 
> I'm not trying to push this approach—unless Lokesh clearly sees that it
> could reduce userspace noise. I'm mainly just curious how we might make
> the fixup transparent to userspace. :-)

And that reveals the exact problem: it's all *very* complicated. :)

... and dangerous when we use the mapcount without having a complete 
understanding how it all works.


What we would have to do for a small folio is

1) Take the folio lock

2) Make sure there is only this present page table mapping:
	folio_mapcount(folio) != 1

	of better

	!folio_maybe_mapped_shared(folio);

3) Make sure that there are no swap references

	If in the swapcache, check the actual swapcount

3) Make sure it is not a KSM folio


THPs are way, way, way more complicated to get right that way. Likely, 
the scenario described above cannot happen with a PMD-mapped THP for now 
at least (we don't have PMD swap entries).


Of course, we'd then also have to handle the case when we have a swap 
pte where the marker is not set (e.g., because of swapout after the 
described swapin where we dropped the marker).


What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's 
arguably less optimal in case the core will decide to swapin / CoW, but 
it leaves the magic to get all this right to the core -- and mimics the 
approach Lokesh uses.

But then, maybe letting userspace just do a uffdio_copy would be even 
faster (only a single TLB shootdown?).


I am also skeptical of calling this a BUG here. It's described to behave 
exactly like that [1]:

        EBUSY
               The pages in the source virtual memory range are either
               pinned or not exclusive to the process. The kernel might
               only perform lightweight checks for detecting whether the
               pages are exclusive. To make the operation more likely to
               succeed, KSM should be disabled, fork() should be avoided
               or MADV_DONTFORK should be configured for the source
	      virtual memory area before fork().

Note the "lightweight" and "more likely to succeed".


[1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.com/

-- 
Cheers,

David / dhildenb