lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0D144AAE-706F-4674-AB20-1FD3A3537E33@nvidia.com>
Date: Wed, 22 Jan 2025 11:41:24 -0500
From: Zi Yan <ziy@...dia.com>
To: Jiaqi Yan <jiaqiyan@...gle.com>
Cc: nao.horiguchi@...il.com, linmiaohe@...wei.com, tony.luck@...el.com,
 wangkefeng.wang@...wei.com, willy@...radead.org, jane.chu@...cle.com,
 akpm@...ux-foundation.org, osalvador@...e.de, rientjes@...gle.com,
 duenwen@...gle.com, jthoughton@...gle.com, jgg@...dia.com, ankita@...dia.com,
 peterx@...hat.com, sidhartha.kumar@...cle.com, david@...hat.com,
 dave.hansen@...ux.intel.com, muchun.song@...ux.dev, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/3] Userspace MFR Policy via memfd

On 18 Jan 2025, at 18:15, Jiaqi Yan wrote:

<snip>

> MemCycler Benchmarking
> ======================
>
> To follow up the question by Dave Hansen, “If one motivation for this is
> guest performance, then it would be great to have some data to back that
> up, even if it is worst-case data”, we run MemCycler in guest and
> compare its performance when there are an extremely large number of
> memory errors.
>
> The MemCycler benchmark cycles through memory with multiple threads. On
> each iteration, the thread reads the current value, validates it, and
> writes a counter value. The benchmark continuously outputs rates
> indicating the speed at which it is reading and writing 64-bit integers,
> and aggregates the reads and writes of the multiple threads across
> multiple iterations into a single rate (unit: 64-bit per microsecond).
>
> MemCycler is running inside a VM with 80 vCPUs and 640 GB guest memory.
> The hardware platform hosting the VM is using Intel Emerald Rapids CPUs
> (in total 120 physical cores) and 1.5 T DDR5 memory. MemCycler allocates
> memory with 2M transparent hugepage in the guest. Our in-house VMM backs
> the guest memory with 2M transparent hugepage on the host. The final
> aggregate rate after 60 runtime is 17,204.69 and referred to as the
> baseline case.
>
> In the experimental case, all the setups are identical to the baseline
> case, however 25% of the guest memory is split from THP to 4K pages due
> to the memory failure recovery triggered by MADV_HWPOISON. I made some
> minor changes in the kernel so that the MADV_HWPOISON-ed pages are
> unpoisoned, and afterwards the in-guest MemCycle is still able to read
> and write its data. The final aggregate rate is 16,355.11, which is
> decreased by 5.06% compared to the baseline case. When 5% of the guest
> memory is split after MADV_HWPOISON, the final aggregate rate is
> 16,999.14, a drop of 1.20% compared to the baseline case.
>
<snip>
>
> Extensibility: THP SHMEM/TMPFS
> ==============================
>
> The current MFR behavior for THP SHMEM/TMPFS is to split the hugepage
> into raw page and only offline the raw HWPoison-ed page. In most cases
> THP is 2M and raw page size is 4K, so userspace loses the “huge”
> property of a 2M huge memory, but the actual data loss is only 4K.

I wonder if the buddy allocator like split[1] could help here by splitting
the THP to 1MB, 512KB, 256KB, ..., two 4KB, so you still have some mTHPs
at the end.

[1] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/

Best Regards,
Yan, Zi

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ