[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7a104117-0b38-4944-8cc3-9942add668b4@arm.com>
Date: Fri, 29 Aug 2025 16:39:51 +0530
From: Dev Jain <dev.jain@....com>
To: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
david@...hat.com, shuah@...nel.org
Cc: lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz,
rppt@...nel.org, surenb@...gle.com, mhocko@...e.com, npache@...hat.com,
linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] selftests/mm/uffd-stress: Make test operate on less
hugetlb memory
On 28/08/25 8:20 pm, Ryan Roberts wrote:
> On 26/08/2025 08:07, Dev Jain wrote:
>> We observed uffd-stress selftest failure on arm64 and intermittent failures
>> on x86 too:
>> running ./uffd-stress hugetlb-private 128 32
>>
>> bounces: 17, mode: rnd read, ERROR: UFFDIO_COPY error: -12 (errno=12, @uffd-common.c:617) [FAIL]
>> not ok 18 uffd-stress hugetlb-private 128 32 # exit=1
>>
>> For this particular case, the number of free hugepages from run_vmtests.sh
>> will be 128, and the test will allocate 64 hugepages in the source
>> location. The stress() function will start spawning threads which will
>> operate on the destination location, triggering uffd-operations like
>> UFFDIO_COPY from src to dst, which means that we will require 64 more
>> hugepages for the dst location.
>>
>> Let us observe the locking_thread() function. It will lock the mutex kept
>> at dst, triggering uffd-copy. Suppose that 127 (64 for src and 63 for dst)
>> hugepages have been reserved. In case of BOUNCE_RANDOM, it may happen that
>> two threads trying to lock the mutex at dst, try to do so at the same
>> hugepage number. If one thread succeeds in reserving the last hugepage,
>> then the other thread may fail in alloc_hugetlb_folio(), returning -ENOMEM.
>> I can confirm that this is indeed the case by this hacky patch:
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 753f99b4c718..39eb21d8a91b 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6929,6 +6929,11 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>>
>> folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
>> if (IS_ERR(folio)) {
>> + pte_t *actual_pte = hugetlb_walk(dst_vma, dst_addr, PMD_SIZE);
>> + if (actual_pte) {
>> + ret = -EEXIST;
>> + goto out;
>> + }
>> ret = -ENOMEM;
>> goto out;
>> }
>>
>> This code path gets triggered indicating that the PMD at which one thread
>> is trying to map a hugepage, gets filled by a racing thread.
>>
>> Therefore, instead of using freepgs to compute the amount of memory,
>> use freepgs - 10, so that the test still has some extra hugepages to use.
>> Note that, in case this value underflows, there is a check for the number
>> of free hugepages in the test itself, which will fail, so we are safe.
>>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> ---
>> tools/testing/selftests/mm/run_vmtests.sh | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
>> index 471e539d82b8..6a9f435be7a1 100755
>> --- a/tools/testing/selftests/mm/run_vmtests.sh
>> +++ b/tools/testing/selftests/mm/run_vmtests.sh
>> @@ -326,7 +326,7 @@ CATEGORY="userfaultfd" run_test ${uffd_stress_bin} anon 20 16
>> # the size of the free pages we have, which is used for *each*.
>> # uffd-stress expects a region expressed in MiB, so we adjust
>> # half_ufd_size_MB accordingly.
>> -half_ufd_size_MB=$(((freepgs * hpgsize_KB) / 1024 / 2))
>> +half_ufd_size_MB=$((((freepgs - 10) * hpgsize_KB) / 1024 / 2))
> Why 10? I don't know much about uffd-stress but the comment at the top says it
> runs 3 threads per CPU, so does the number of potential races increase with the
> number of CPUs? Perhaps this number needs to be a function of nrcpu?
Yes the race will amplify with nrcpus, technically we need nr_cpus - 1 extra
hugepages - the worst case is that all threads will try to perform uffd-copy on
the same address, one of them will reserve the last hugepage and others will
fail. 10 was just a random number; I see that run_vmtests.sh already has
nr_cpus computed so I can easily reuse that. I'll send a v2.
>
> I tested it and it works though so:
>
> Tested-by: Ryan Roberts <ryan.roberts@....com>
Thanks.
>
>> CATEGORY="userfaultfd" run_test ${uffd_stress_bin} hugetlb "$half_ufd_size_MB" 32
>> CATEGORY="userfaultfd" run_test ${uffd_stress_bin} hugetlb-private "$half_ufd_size_MB" 32
>> CATEGORY="userfaultfd" run_test ${uffd_stress_bin} shmem 20 16
Powered by blists - more mailing lists