linux-kernel - Re: [PATCH v7 4/7] Drivers: hv: Fix huge page handling in memory region traversal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aUBrwAWqGEgV9GxK@skinsburskii.localdomain>
Date: Mon, 15 Dec 2025 12:12:48 -0800
From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>,
	"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
	"wei.liu@...nel.org" <wei.liu@...nel.org>,
	"decui@...rosoft.com" <decui@...rosoft.com>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v7 4/7] Drivers: hv: Fix huge page handling in memory
 region traversal

On Thu, Dec 11, 2025 at 05:37:26PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Thursday, December 4, 2025 1:09 PM
> > 
> > On Thu, Dec 04, 2025 at 04:03:26PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com> Sent: Tuesday, November 25, 2025 6:09 PM
> > > >
> 
> [snip]
> 

<snip>

> > > > +
> > > > +	stride = 1 << page_order;
> > > > +
> > > > +	/* Start at stride since the first page is validated */
> > > > +	for (count = stride; count < page_count; count += stride) {
> > >
> > > This striding doesn't work properly in the general case. Suppose the
> > > page_offset value puts the start of the chunk in the middle of a 2 Meg
> > > page, and that 2 Meg page is then followed by a bunch of single pages.
> > > (Presumably the mmu notifier "invalidate" callback could do this.)
> > > The use of the full stride here jumps over the remaining portion of the
> > > 2 Meg page plus some number of the single pages, which isn't what you
> > > want. For the striding to work, it must figure out how much remains in the
> > > initial large page, and then once the striding is aligned to the large page
> > > boundaries, the full stride length works.
> > >
> > > Also, what do the hypercalls in the handler functions do if a chunk starts
> > > in the middle of a 2 Meg page? It looks like the handler functions will set
> > > the *_LARGE_PAGE flag to the hypercall but then the hv_call_* function
> > > will fail if the page_count isn't 2 Meg aligned.
> > >
> > 
> > This situation you described is not possible, because invalidation
> > callback simply can't invalidate a part of the huge page even in THP
> > case (leave aside hugetlb case) without splitting it beforehand, and
> > splitting a huge page requires invalidation of the whole huge page
> > first.
> 
> I've been playing around with mmu notifiers and 2 Meg pages. At least in my
> experiment, there's a case where the .invalidate callback is invoked on a
> range *before* the 2 Meg page is split. The kernel code that does this is
> in zap_page_range_single_batched(). Early on this function calls
> mmu_notifier_invalidate_range_start(), which invokes the .invalidate
> callback on the initial range. Later on, unmap_single_vma() is called, which
> does the split and eventually makes a second .invalidate callback for the
> entire 2 Meg page.
> 
> Details:  My experiment is a user space program that does the following:
> 
> 1. Allocates 16 Megs of memory on a 16 Meg boundary using
> posix_memalign(). So this is private anonymous memory. Transparent
> huge pages are enabled.
> 
> 2. Writes to a byte in each 4K page so they are all populated. 
> /proc/meminfo shows eight 2 Meg pages have been allocated.
> 
> 3. Creates an mmu notifier for the allocated 16 Megs, using an ioctl
> hacked into the kernel for experimentation purposes.
> 
> 4. Uses madvise() with the DONTNEED option to free 32 Kbytes on a 4K
> page boundary somewhere in the 16 Meg allocation. This results in an mmu
> notifier invalidate callback for that 32 Kbytes. Then there's a second invalidate
> callback covering the entire 2 Meg page that contains the 32 Kbyte range.
> Kernel stack traces for the two invalidate callbacks show them originating
> in zap_page_range_single_batched().
> 
> 5. Sleeps for 60 seconds. During that time, khugepaged wakes up and does
> hpage_collapse_scan_pmd() -> collapse_huge_page(), which generates a third
> .invalidate callback for the 2 Meg page. I'm haven't investigated what this is
> all about.
> 
> 6. Interestingly, if Step 4 above does a slightly different operation using
> mprotect() with PROT_READ instead of madvise(), the 2 Meg page is split first.
> The .invalidate callback for the full 2 Meg happens before the .invalidate
> callback for the specified range.
> 
> The root partition probably isn't doing madvise() with DONTNEED for memory
> allocated for guests. But regardless of what user space does or doesn't do, MSHV's
> invalidate callback path should be made safe for this case. Maybe that's just
> detecting it and returning an error (and maybe a WARN_ON) if user space
> doesn't need it to work.
> 

This is a deep research, Michael. Thanks a lot for you effort.
I'll think more about it and will likely follow up.

Thank you,
Stanislav

> Michael
>