[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bcc5cbb6-4ce6-4c01-8b5b-f6e01b306b2d@redhat.com>
Date: Tue, 27 May 2025 11:00:03 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: aarcange@...hat.com, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org, lokeshgidra@...gle.com,
peterx@...hat.com, ryncsn@...il.com, surenb@...gle.com
Subject: Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs
concurrently with swap-out
On 27.05.25 10:37, Barry Song wrote:
> On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@...il.com> wrote:
>>
>> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@...hat.com> wrote:
>>>
>>> On 23.05.25 01:23, Barry Song wrote:
>>>> Hi All,
>>>
>>> Hi!
>>>
>>>>
>>>> I'm encountering another bug that can be easily reproduced using the small
>>>> program below[1], which performs swap-out and swap-in in parallel.
>>>>
>>>> The issue occurs when a folio is being swapped out while it is accessed
>>>> concurrently. In this case, do_swap_page() handles the access. However,
>>>> because the folio is under writeback, do_swap_page() completely removes
>>>> its exclusive attribute.
>>>>
>>>> do_swap_page:
>>>> } else if (exclusive && folio_test_writeback(folio) &&
>>>> data_race(si->flags & SWP_STABLE_WRITES)) {
>>>> ...
>>>> exclusive = false;
>>>>
>>>> As a result, userfaultfd_move() will return -EBUSY, even though the
>>>> folio is not shared and is in fact exclusively owned.
>>>>
>>>> folio = vm_normal_folio(src_vma, src_addr,
>>>> orig_src_pte);
>>>> if (!folio || !PageAnonExclusive(&folio->page)) {
>>>> spin_unlock(src_ptl);
>>>> + pr_err("%s %d folio:%lx exclusive:%d
>>>> swapcache:%d\n",
>>>> + __func__, __LINE__, folio,
>>>> PageAnonExclusive(&folio->page),
>>>> + folio_test_swapcache(folio));
>>>> err = -EBUSY;
>>>> goto out;
>>>> }
>>>>
>>>> I understand that shared folios should not be moved. However, in this
>>>> case, the folio is not shared, yet its exclusive flag is not set.
>>>>
>>>> Therefore, I believe PageAnonExclusive is not a reliable indicator of
>>>> whether a folio is truly exclusive to a process.
>>>
>>> It is. The flag *not* being set is not a reliable indicator whether it
>>> is really shared. ;)
>>>
>>> The reason why we have this PAE workaround (dropping the flag) in place
>>> is because the page must not be written to (SWP_STABLE_WRITES). CoW
>>> reuse is not possible.
>>>
>>> uffd moving that page -- and in that same process setting it writable,
>>> see move_present_pte()->pte_mkwrite() -- would be very bad.
>>
>> An alternative approach is to make the folio writable only when we are
>> reasonably certain it is exclusive; otherwise, it remains read-only. If the
>> destination is later written to and the folio has become exclusive, it can
>> be reused directly. If not, a copy-on-write will occur on the destination
>> address, transparently to userspace. This avoids Lokesh’s userspace-based
>> strategy, which requires forcing a write to the source address.
>
> Conceptually, I mean something like this:
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..70eaabf4f1a3 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm,
> }
> if (folio_test_large(src_folio) ||
> folio_maybe_dma_pinned(src_folio) ||
> - !PageAnonExclusive(&src_folio->page)) {
> + (!PageAnonExclusive(&src_folio->page) &&
> + folio_mapcount(src_folio) != 1)) {
> err = -EBUSY;
> goto out;
> }
> @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm,
> #endif
> if (pte_dirty(orig_src_pte))
> orig_dst_pte = pte_mkdirty(orig_dst_pte);
> - orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> + if (PageAnonExclusive(&src_folio->page))
> + orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
>
> set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> out:
> @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> }
>
> folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> - if (!folio || !PageAnonExclusive(&folio->page)) {
> + if (!folio || (!PageAnonExclusive(&folio->page) &&
> + folio_mapcount(folio) != 1)) {
> spin_unlock(src_ptl);
> err = -EBUSY;
> goto out;
>
> I'm not trying to push this approach—unless Lokesh clearly sees that it
> could reduce userspace noise. I'm mainly just curious how we might make
> the fixup transparent to userspace. :-)
And that reveals the exact problem: it's all *very* complicated. :)
... and dangerous when we use the mapcount without having a complete
understanding how it all works.
What we would have to do for a small folio is
1) Take the folio lock
2) Make sure there is only this present page table mapping:
folio_mapcount(folio) != 1
of better
!folio_maybe_mapped_shared(folio);
3) Make sure that there are no swap references
If in the swapcache, check the actual swapcount
3) Make sure it is not a KSM folio
THPs are way, way, way more complicated to get right that way. Likely,
the scenario described above cannot happen with a PMD-mapped THP for now
at least (we don't have PMD swap entries).
Of course, we'd then also have to handle the case when we have a swap
pte where the marker is not set (e.g., because of swapout after the
described swapin where we dropped the marker).
What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's
arguably less optimal in case the core will decide to swapin / CoW, but
it leaves the magic to get all this right to the core -- and mimics the
approach Lokesh uses.
But then, maybe letting userspace just do a uffdio_copy would be even
faster (only a single TLB shootdown?).
I am also skeptical of calling this a BUG here. It's described to behave
exactly like that [1]:
EBUSY
The pages in the source virtual memory range are either
pinned or not exclusive to the process. The kernel might
only perform lightweight checks for detecting whether the
pages are exclusive. To make the operation more likely to
succeed, KSM should be disabled, fork() should be avoided
or MADV_DONTFORK should be configured for the source
virtual memory area before fork().
Note the "lightweight" and "more likely to succeed".
[1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.com/
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists