lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87cyi7cjb0.fsf@oracle.com>
Date: Wed, 04 Dec 2024 11:57:55 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Frank van der Linden <fvdl@...gle.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, Mateusz Guzik
 <mjguzik@...il.com>,
        linux-mm@...ck.org, akpm@...ux-foundation.org,
        Muchun
 Song <muchun.song@...ux.dev>,
        Miaohe Lin <linmiaohe@...wei.com>, Oscar
 Salvador <osalvador@...e.de>,
        David Hildenbrand <david@...hat.com>, Peter
 Xu <peterx@...hat.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages


Frank van der Linden <fvdl@...gle.com> writes:

> On Tue, Dec 3, 2024 at 4:05 PM Ankur Arora <ankur.a.arora@...cle.com> wrote:
>>
>>
>> Mateusz Guzik <mjguzik@...il.com> writes:
>>
>> > On Mon, Dec 02, 2024 at 08:20:58PM +0000, Frank van der Linden wrote:
>> >> Fresh hugetlb pages are zeroed out when they are faulted in,
>> >> just like with all other page types. This can take up a good
>> >> amount of time for larger page sizes (e.g. around 40
>> >> milliseconds for a 1G page on a recent AMD-based system).
>> >>
>> >> This normally isn't a problem, since hugetlb pages are typically
>> >> mapped by the application for a long time, and the initial
>> >> delay when touching them isn't much of an issue.
>> >>
>> >> However, there are some use cases where a large number of hugetlb
>> >> pages are touched when an application (such as a VM backed by these
>> >> pages) starts. For 256 1G pages and 40ms per page, this would take
>> >> 10 seconds, a noticeable delay.
>> >
>> > The current huge page zeroing code is not that great to begin with.
>>
>> Yeah definitely suboptimal. The current huge page zeroing code is
>> both slow and it trashes the cache while zeroing.
>>
>> > There was a patchset posted some time ago to remedy at least some of it:
>> > https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/
>> >
>> > but it apparently fell through the cracks.
>>
>> As Joao mentioned that got side tracked due to the preempt-lazy stuff.
>> Now that lazy is in, I plan to follow up on the zeroing work.
>>
>> > Any games with "background zeroing" are notoriously crappy and I would
>> > argue one should exhaust other avenues before going there -- at the end
>> > of the day the cost of zeroing will have to get paid.
>>
>> Yeah and the background zeroing has dual cost: the cost in CPU time plus
>> the indirect cost to other processes due to the trashing of L3 etc.
>
> I'm not sure what you mean here - any caching side effects of zeroing
> happen regardless of who does it, right?

Sure.

> It doesn't matter if it's a
> kthread or the calling thread.

As other people point out it's more a matter of accruing it to the
right context. The noise will always spill over but userspace can use
cpu cgroups etc to to limit how far these effects are seen.

Additionally, this kthread will be doing bulk zeroing while a user
thread would zero as needed. Though I guess for the VM prefaulting
case it's likely similar.

> If you're concerned about the caching side effects in general, using
> non-temporal instructions helps (e.g. movnti on x86). See the link I
> mentioned for a patch that was sent years ago (
> https://lore.kernel.org/all/20180725023728.44630-1-cannonmatthews@google.com/
> ). Using movnti on x86 definitely helps performance (up to 50% in my
> experiments). Which is great, but it still leaves considerable delay
> for the use case I mentioned.

In my testing at least on AMD you can get a lot more than 50%
improvement.

See for instance the CLZERO (or the REP STOS) numbers here: https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/

--
ankur

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ