[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58025293-c70f-4377-b8be-39994136af83@redhat.com>
Date: Thu, 1 Aug 2024 08:36:32 +0200
From: David Hildenbrand <david@...hat.com>
To: Usama Arif <usamaarif642@...il.com>, akpm@...ux-foundation.org,
linux-mm@...ck.org
Cc: hannes@...xchg.org, riel@...riel.com, shakeel.butt@...ux.dev,
roman.gushchin@...ux.dev, yuzhao@...gle.com, baohua@...nel.org,
ryan.roberts@....com, rppt@...nel.org, willy@...radead.org,
cerasuolodomenico@...il.com, corbet@....net, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/6] mm: split underutilized THPs
>> I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy live migration. Looks like my assumption was right:
>>
>> On the destination:
>>
>> Writing received pages during precopy # ram_load_precopy()
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
>> Discarding pages # loadvm_postcopy_ram_handle_discard()
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Registering UFFD # postcopy_ram_incoming_setup()
>>
>
> Thanks for this, yes it makes sense after you mentioned postcopy_ram_incoming_setup.
> postcopy_ram_incoming_setup happens in the Listen phase, which is after the discard phase, so I was able to follow in code in qemu the same sequence of events that the above prints show.
I just added another printf to postcopy_ram_supported_by_host(), where
we temporarily do a UFFDIO_REGISTER on some test area.
Sensing UFFD support # postcopy_ram_supported_by_host()
Sensing UFFD support
Writing received pages during precopy # ram_load_precopy()
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
Discarding pages # loadvm_postcopy_ram_handle_discard()
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Registering UFFD # postcopy_ram_incoming_setup()
We could think about using this "ever user uffd" to avoid the shared
zeropage in most processes.
Of course, there might be other applications where that wouldn't work,
but I think this behavior (write to area before enabling uffd) might be
fairly QEMU specific already.
Avoiding the shared zeropage has the benefit that a later write fault
won't have to do a TLB flush and can simply install a fresh anon page.
>>
>> Let me know if you need more information.
>>
>>> Thanks for pointing to mm_forbids_zeropage. Incorporating that into the code, and if I am (hopefully :)) right about qemu and kernel above, then I believe the right code should be:
>>
>> I'm afraid you are not right about the qemu code :)
>>
>
> Yes, and also didn't consider MADV_DONTNEED! Thanks for explaining both of these things clearly. Its clear that pte_clear won't work in this case.
>
> We don't need to clear_pte, just use zero_page for all cases. The original series from Alex did tlb flush, but looking further at the code, thats not needed. try_to_migrate() flushes tlb and installs migration entries which are not ‘present’ so will never be tlb cached. remove_migration_ptes() restores page pointers so tlb flushing is not needed. When using zeropage, we don't need make a distinction if uffd is used or not. i.e. we can just do below:
>
> if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))
It's worth noting that on s390x, MMs that forbid the zeropage also have
THPs disabled. So we shouldn't really run into that that often (of
course, it's subject to change in the future, so we better have this
check here).
> return false;
>
> newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
> pvmw->vma->vm_page_prot));
>
> set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
We're replacing a present page by another present page without doing a
TLB flush in between. I *think* this should be fine because the new
present page is R/O and cannot possibly be written to.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists