lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d302a026-b49b-45d9-ba06-ad2f1c3f14e6@redhat.com>
Date: Tue, 6 Aug 2024 19:30:30 +0200
From: David Hildenbrand <david@...hat.com>
To: Usama Arif <usamaarif642@...il.com>, akpm@...ux-foundation.org,
 linux-mm@...ck.org
Cc: hannes@...xchg.org, riel@...riel.com, shakeel.butt@...ux.dev,
 roman.gushchin@...ux.dev, yuzhao@...gle.com, baohua@...nel.org,
 ryan.roberts@....com, rppt@...nel.org, willy@...radead.org,
 cerasuolodomenico@...il.com, corbet@....net, linux-kernel@...r.kernel.org,
 linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/6] mm: split underutilized THPs

On 06.08.24 19:17, Usama Arif wrote:
> 
> 
> On 05/08/2024 00:04, Usama Arif wrote:
>>
>>
>> On 01/08/2024 07:36, David Hildenbrand wrote:
>>>>> I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy live migration. Looks like my assumption was right:
>>>>>
>>>>> On the destination:
>>>>>
>>>>> Writing received pages during precopy # ram_load_precopy()
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Writing received pages during precopy
>>>>> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
>>>>> Discarding pages # loadvm_postcopy_ram_handle_discard()
>>>>> Discarding pages
>>>>> Discarding pages
>>>>> Discarding pages
>>>>> Discarding pages
>>>>> Discarding pages
>>>>> Discarding pages
>>>>> Registering UFFD # postcopy_ram_incoming_setup()
>>>>>
>>>>
>>>> Thanks for this, yes it makes sense after you mentioned postcopy_ram_incoming_setup.
>>>> postcopy_ram_incoming_setup happens in the Listen phase, which is after the discard phase, so I was able to follow in code in qemu the same sequence of events that the above prints show.
>>>
>>>
>>> I just added another printf to postcopy_ram_supported_by_host(), where we temporarily do a UFFDIO_REGISTER on some test area.
>>>
>>> Sensing UFFD support # postcopy_ram_supported_by_host()
>>> Sensing UFFD support
>>> Writing received pages during precopy # ram_load_precopy()
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
>>> Discarding pages # loadvm_postcopy_ram_handle_discard()
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Registering UFFD # postcopy_ram_incoming_setup()
>>>
>>> We could think about using this "ever user uffd" to avoid the shared zeropage in most processes.
>>>
>>> Of course, there might be other applications where that wouldn't work, but I think this behavior (write to area before enabling uffd) might be fairly QEMU specific already.
>>>
>>> Avoiding the shared zeropage has the benefit that a later write fault won't have to do a TLB flush and can simply install a fresh anon page.
>>>
>>
>> I checked CRIU and that does a check at the start as well before attempting to use uffd: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L1349
>>
>> If writing to an area before enabling uffd is likely to be QEMU specific, then you make a good point to clear pte instead of using shared zeropage to avoid the TLB flush if uffd is ever used.
>>
>> I think "ever used uffd" would need to be tracked using mm_struct. This also won't cause an issue if the check is done in a parent process and the actual use is in a forked process, as copy_mm should take care of it.
>> The possibilities would then be:
>> 1) Have a new bit in mm->flags, set it in new_userfaultfd and test it in try_to_unmap_unused, but unfortunately all the bits in mm->flags are taken.
>> 2) We could use mm->def_flags as it looks like there is an unused bit (0x800) just before VM_UFFD_WP. But that makes the code confusing as its used to initialize the default flags for VMAs and is not supposed to be used as a "mm flag".
>> 3) Introducing mm->flags2 and set/test as 1. This would introduce a 8 byte overhead for all mm_structs.
>>
>> I am not sure either 2 or 3 are acceptable upstream, unless there is a need for more flags in the near future and the 8 byte overhead starts to make sense. Maybe we go with shared zeropage?
>  >
> There is an another option to use bit 32 of mm->flags for 64 bit kernel only for ever_used_uffd, but considering the 2 reasons below, I will send a v2 of the series (in a few days incase any more comments come up) with shared zeropage in all circumstances (and addressing the comments in the other patches).
> - "ever used uffd" is not a 100% safe, i.e. someone might not check uffd support before using it and do the same sequence of events as qemu precopy + postcopy (+ some bitmap to track and check whether to request a page from uffd handler). Its very unlikely that anyone else does this, but we have to cater for all current and future usecases.

Possible, but IMHO unlikely :) I'd be willing to take a risk here if we 
make sure CRIU is also fine.

> - If THP shrinker is splitting and pointing pages to a shared zeropage, then the page was considered "unused" and unlikely to have a write fault at some point in the near future, hence the probability of incurring that TLB flush on write fault is low.

Surely evidence might help. But it's easy to find cases where this might 
not be the true: populated a THP for a VM and the VM only used some of 
the memory. Then we split, then the VM wants to use the remaining pieces.


Reading your prior email I just wanted to say that mm->flags is unsigned 
long and you should just make the shrinker 64bit-specific and use bit 33 
in mm->flags.

But starting with "all shared zeropage" as a first step also works.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ