linux-kernel - Re: [PATCH] mm: fix copy_vma() error handling for hugetlb mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <afba02be-21d0-49f2-9ca1-36ee6f7fe27f@lucifer.local>
Date: Fri, 23 May 2025 11:00:40 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Ricardo Cañuelo Navarro <rcn@...lia.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Pedro Falcato <pfalcato@...e.de>, revest@...gle.com,
        kernel-dev@...lia.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, stable@...r.kernel.org,
        Oscar Salvador <osalvador@...e.de>
Subject: Re: [PATCH] mm: fix copy_vma() error handling for hugetlb mappings

+cc Oscar - please loop him in on this as he's looking at refactoring and
improving hugetlb as a whole! Thanks :)

Ricardo - thanks very much for this, TL;DR is - this is great work and you found
something painful :)

To be clear and to quell too much concern, this case is one that won't happen
realistically in reality - I'm see (afaict) your syzbot did fault injection to
get this. This is because the only way this could happen is for
vma_iter_prealloc() to fail, and it's really subject to being a 'too small to
fail' allocation, in other words direct reclaim will simply keep going until the
allocation succeeds in this case even under extreme memory pressure.

So this patch is perhaps less urgent than it might seem (though we should
address this, of course).

On Fri, May 23, 2025 at 09:56:18AM +0200, Ricardo Cañuelo Navarro wrote:
> If, during a mremap() operation for a hugetlb-backed memory mapping,
> copy_vma() fails after the source vma has been duplicated and
> opened (ie. vma_link() fails), the error is handled by closing the new

OK so really it is _only_ when vma_link() fails?

This error paths really are a problem. As is vma_close(). I'm pretty sure we've
seen similar issues actually with a miscounted reservation count caused by
vm_ops->close(), hugetlb seems really sensitive to this.

> vma. This updates the hugetlbfs reservation counter of the reservation
> map which at this point is referenced by both the source vma and the new
> copy. As a result, once the new vma has been freed and copy_vma()
> returns, the reservation counter for the source vma will be incorrect.

OK so looking into the code I think I got it:

copy_vma_and_data() already does this:

	if (is_vm_hugetlb_page(vma))
		clear_vma_resv_huge_pages(vma);

But only if copy_vma() succeeds.

Even if the page table move fails, we reverse the operation and _still_ do this.

We do this _before_ unmapping the VMA later in the operation.

But now vma_close() has screwed things up in this one singular case. We need to
fix this _prior to the vma_close()_ so it doesn't screw up this 'two VMAs
referencing the same reservation' special case.

OK so your patch makes sense.

>
> This patch addresses this corner case by clearing the hugetlb private
> page reservation reference for the new vma and decrementing the
> reference before closing the vma, so that vma_close() won't update the
> reservation counter.
>
> The issue was reported by a private syzbot instance, see the error
> report log [1] and reproducer [2]. Possible duplicate of public syzbot
> report [3].

Ordinarily 'private syzbot instance' makes me nervous, but you've made your case
here logically.

>
> Signed-off-by: Ricardo Cañuelo Navarro <rcn@...lia.com>
> Cc: stable@...r.kernel.org # 6.12+

Hm, do we have a Fixes?

This might be a massive pain to backport though due to a certain somebody
initials 'LS' who did a big refactoring... :)

Why 6.12+? It seems this bug has been around for... a while.

THEN AGAIN and importantly - I'm fine with this, as this is a bug that I
really don't believe can be hit in reality.

It is however vital we fix it so the error paths work correctly, especially
if at a later date we change how allocation works and fail more etc.

It matters more for the future than the past, basically.

> Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_cancel.txt [1]
> Link: https://people.igalia.com/rcn/kernel_logs/20250422__WARNING_in_page_counter_cancel__repro.c [2]
> Link: https://lore.kernel.org/all/67000a50.050a0220.49194.048d.GAE@google.com/ [3]

Thanks for links, though it's better to please provide this information here
even if in succinct form. This is because commit messages are a permanent
record, and these links (other than lore) are ephemeral.

Here it's a bit of a pain, however, as presumably the repro is fairly big.

So, can we please copy/paste the splat from [1] and drop this link, maybe just
keep link [2] as it's not so important (I'm guessing this takes a while to repro
so the failure injection hits the right point?) and of course keep [3].

> ---
>  mm/vma.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 839d12f02c885d3338d8d233583eb302d82bb80b..9d9f699ace977c9c869e5da5f88f12be183adcfb 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1834,6 +1834,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  	return new_vma;
>
>  out_vma_link:
> +	if (is_vm_hugetlb_page(new_vma))
> +		clear_vma_resv_huge_pages(new_vma);

So,

Could you implement this slightly differently please? We're duplicating
this code now, so I think this should be in its own function with a copious
comment.

Something like:

static void fixup_hugetlb_reservations(struct vm_area_struct *vma)
{
	if (is_vm_hugetlb_page(new_vma))
		clear_vma_resv_huge_pages(new_vma);
}

And call this from here and also in copy_vma_and_data().

Could you also please update the comment in clear_vma_resv_huge_pages():

/*
 * Reset and decrement one ref on hugepage private reservation.
 * Called with mm->mmap_lock writer semaphore held.
 * This function should be only used by move_vma() and operate on
 * same sized vma. It should never come here with last ref on the
 * reservation.
 */

Drop the mention of the specific function (which is now wrong, but
mentioning _any_ function is asking for bit rot anyway) and replace with
something like 'This function should only be used by mremap and...'


>  	vma_close(new_vma);
>
>  	if (new_vma->vm_file)
>
> ---
> base-commit: 94305e83eccb3120c921cd3a015cd74731140bac
> change-id: 20250523-warning_in_page_counter_cancel-e8c71a6b4c88
>

Thanks!