linux-kernel - Re: Race condition observed between page migration and page fault handling on arm64 machines

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a8c813b5-abce-48cf-9d14-2f969d6c8180@redhat.com>
Date: Wed, 7 Aug 2024 13:39:38 +0200
From: David Hildenbrand <david@...hat.com>
To: Dev Jain <dev.jain@....com>, Will Deacon <will@...nel.org>
Cc: akpm@...ux-foundation.org, willy@...radead.org, ryan.roberts@....com,
 anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org,
 vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com, osalvador@...e.de,
 baolin.wang@...ux.alibaba.com, dave.hansen@...ux.intel.com,
 baohua@...nel.org, ioworker0@...il.com, gshan@...hat.com,
 mark.rutland@....com, kirill.shutemov@...ux.intel.com, hughd@...gle.com,
 aneesh.kumar@...nel.org, yang@...amperecomputing.com, peterx@...hat.com,
 broonie@...nel.org, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org, mgorman@...hsingularity.net
Subject: Re: Race condition observed between page migration and page fault
 handling on arm64 machines

On 05.08.24 16:14, Dev Jain wrote:
> 
> On 8/5/24 16:16, David Hildenbrand wrote:
>> On 05.08.24 11:51, Dev Jain wrote:
>>>
>>> On 8/1/24 19:18, David Hildenbrand wrote:
>>>> On 01.08.24 15:43, Will Deacon wrote:
>>>>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>>>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>>>>> To dampen the tradeoff, we could do this in shmem_fault()
>>>>>>>>>> instead? But
>>>>>>>>>> then, this would mean that we do this in all
>>>>>>>>>>
>>>>>>>>>> kinds of vma->vm_ops->fault, only when we discover another
>>>>>>>>>> reference
>>>>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>>>>
>>>>>>>>>> should solve this once and for all. In fact, do_pte_missing()
>>>>>>>>>> may call
>>>>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>>>>
>>>>>>>>>> noticed that the former already checks this using
>>>>>>>>>> vmf_pte_changed().
>>>>>>>>>
>>>>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if
>>>>>>>>> this
>>>>>>>>> is something we should really worry about. There are other reasons
>>>>>>>>> (e.g., speculative references) why migration could temporarily
>>>>>>>>> fail,
>>>>>>>>> does it happen that often that it is really something we have to
>>>>>>>>> worry
>>>>>>>>> about?
>>>>>>>>
>>>>>>>>
>>>>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>>>>> strange since the race is clearly arch-independent.
>>>>>>>
>>>>>>> Yes, I think this is what we have to understand. Is the race simply
>>>>>>> less
>>>>>>> likely to trigger on x86?
>>>>>>>
>>>>>>> I would assume that it would trigger on any arch.
>>>>>>>
>>>>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to
>>>>>>> work here.
>>>>>>>
>>>>>>> Is this maybe related to deferred flushing? Such that the other CPU
>>>>>>> will
>>>>>>> by accident just observe the !pte_none a little less likely?
>>>>>>>
>>>>>>> But arm64 also usually defers flushes, right? At least unless
>>>>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do
>>>>>>> deferred
>>>>>>> flushes.
>>>>>>
>>>>>> Bingo!
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index e51ed44f8b53..ce94b810586b 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct
>>>>>> mm_struct
>>>>>> *mm, pte_t pteval,
>>>>>>      */
>>>>>>     static bool should_defer_flush(struct mm_struct *mm, enum
>>>>>> ttu_flags flags)
>>>>>>     {
>>>>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>>>>> -               return false;
>>>>>> -
>>>>>> -       return arch_tlbbatch_should_defer(mm);
>>>>>> +       return false;
>>>>>>     }
>>>>>>
>>>>>>
>>>>>> On x86:
>>>>>>
>>>>>> # ./migration
>>>>>> TAP version 13
>>>>>> 1..1
>>>>>> # Starting 1 tests from 1 test cases.
>>>>>> #  RUN           migration.shared_anon ...
>>>>>> Didn't migrate 1 pages
>>>>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1,
>>>>>> self->n2) (-2)
>>>>>> == 0 (0)
>>>>>> # shared_anon: Test terminated by assertion
>>>>>> #          FAIL  migration.shared_anon
>>>>>> not ok 1 migration.shared_anon
>>>>>>
>>>>>>
>>>>>> It fails all of the time!
>>>>>
>>>>> Nice work! I suppose that makes sense as, with the eager TLB
>>>>> invalidation, the window between the other CPU faulting and the
>>>>> migration entry being written is fairly wide.
>>>>>
>>>>> Not sure about a fix though :/ It feels a bit overkill to add a new
>>>>> invalid pte encoding just for this.
>>>>
>>>> Something like that might make the test happy in most cases:
>>>>
>>>> diff --git a/tools/testing/selftests/mm/migration.c
>>>> b/tools/testing/selftests/mm/migration.c
>>>> index 6908569ef406..4c18bfc13b94 100644
>>>> --- a/tools/testing/selftests/mm/migration.c
>>>> +++ b/tools/testing/selftests/mm/migration.c
>>>> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>           int ret, tmp;
>>>>           int status = 0;
>>>>           struct timespec ts1, ts2;
>>>> +       int errors = 0;
>>>>
>>>>           if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>>>>                   return -1;
>>>> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>>>>                   ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>>>>                                   MPOL_MF_MOVE_ALL);
>>>>                   if (ret) {
>>>> -                       if (ret > 0)
>>>> +                       if (ret > 0) {
>>>> +                               if (++errors < 100)
>>>> +                                       continue;
>>>>                                   printf("Didn't migrate %d pages\n",
>>>> ret);
>>>> -                       else
>>>> +                       } else {
>>>>                                   perror("Couldn't migrate pages");
>>>> +                       }
>>>>                           return -2;
>>>>                   }
>>>> +               /* Progress! */
>>>> +               errors = 0;
>>>>
>>>>                   tmp = n2;
>>>>                   n2 = n1;
>>>>
>>>>
>>>> [root@...alhost mm]# ./migration
>>>> TAP version 13
>>>> 1..1
>>>> # Starting 1 tests from 1 test cases.
>>>> #  RUN           migration.shared_anon ...
>>>> #            OK  migration.shared_anon
>>>> ok 1 migration.shared_anon
>>>> # PASSED: 1 / 1 tests passed.
>>>> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>
>>>
>>> This does make the test pass, to my surprise, since what you are doing
>>> from userspace
>>>
>>> should have been done by the kernel, because it retries folio unmapping
>>> and moving
>>>
>>> NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up
>>> these
>>>
>>> macros and the original test was still failing. Now, I digged in more,
>>> and, if the
>>>
>>> following assertion is correct:
>>>
>>>
>>> Any thread having a reference on a folio will end up calling
>>> folio_lock()
>>>
>>
>> Good point. I suspect concurrent things like read/write would also be
>> able to trigger this (did not check, though).
>>
>>>
>>> then it seems to me that the retry for loop wrapped around
>>> migrate_folio_move(), inside
>>>
>>> migrate_pages_batch(), is useless; if migrate_folio_move() fails on the
>>> first iteration, it is
>>>
>>> going to fail for all iterations since, if I am reading the code path
>>> correctly, the only way it
>>>
>>> fails is when the actual refcount is not equal to expected refcount (in
>>> folio_migrate_mapping()),
>>>
>>> and there is no way that the extra refcount is going to get released
>>> since the migration path
>>>
>>> has the folio lock.
>>>
>>> And therefore, this begs the question: isn't it logical to assert the
>>> actual refcount against the
>>>
>>> expected refcount, just after we have changed the PTEs, so that if this
>>> assertion fails, we can
>>>
>>> go to the next iteration of the for loop for migrate_folio_unmap()
>>> inside migrate_pages_batch()
>>>
>>> by calling migrate_folio_undo_src()/dst() to restore the old state? I am
>>> trying to implement
>>>
>>> this but is not as straightforward as it seemed to me this morning.
>>
>> I agree with your assessment that migration code currently doesn't
>> handle the case well when some other thread does an unconditional
>> folio_lock(). folio_trylock() users would be handled, but that's not
>> what we want with FGP_LOCK I assume.
>>
>> So IIUC, your idea would be to unlock the folio in migration code and
>> try again their. Sounds reasonable, without looking into the details :)
> 
> 

BTW, I was trying to find the spot that would do the folio_lock(), but 
filemap_fault() does the lock_folio_maybe_drop_mmap() where we do a 
folio_trylock().

Where exactly is the folio_lock() on the fault path that would prohibit 
us from making progress?

> (Adding Mel if at all he has any comments for a compaction use-case)
> 
> What I was trying to say is this (forgive me for the hard-coded value):

The hard-coded 2 is wrong indeed :)

> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a8c6f466e33a..404af46dd661 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1262,6 +1262,8 @@ static int migrate_folio_unmap(new_folio_t
> get_new_folio,
> }
> if (!folio_mapped(src)) {
> + if (folio_ref_count(src) != 2)
> + goto out;
> __migrate_folio_record(dst, old_page_state, anon_vma);
> return MIGRATEPAGE_UNMAP;
> }
> This will give us more chances to win the race. On an average, now
> the test fails on 100 iterations of move_pages(). If you multiply
> the NR_MAX_PAGES_(A)SYNC_RETRY macros by 3, the average goes above
> to 2000.
> But if the consensus is that this is just pleasing the test without
> any real use-case (compaction?) then I guess I am alright with making
> the change in the test.

I'd be curious if other activity (besides fault, like concurrent 
read()/write()/...) can similarly make migration fail. I suspect it can.

-- 
Cheers,

David / dhildenb