linux-kernel - Re: [PATCH] mm/vma: Fix hugetlb accounting error in copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHS8izPRyTFBBMkRDtpWK_nuygqfXkwd4-LUyUECZbeNGenfkg@mail.gmail.com>
Date: Mon, 27 Jan 2025 11:37:06 -0800
From: Mina Almasry <almasrymina@...gle.com>
To: Muchun Song <muchun.song@...ux.dev>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Daniil Dulov <d.dulov@...ddin.ru>, 
	lvc-project@...uxtesting.org, Jann Horn <jannh@...gle.com>, 
	linux-kernel@...r.kernel.org, Mike Kravetz <mike.kravetz@...cle.com>, 
	"Matthew Wilcox (Oracle)" <willy@...radead.org>, linux-mm@...ck.org, 
	"Liam R. Howlett" <Liam.Howlett@...cle.com>, Andrew Morton <akpm@...ux-foundation.org>, 
	Vlastimil Babka <vbabka@...e.cz>, stable@...r.kernel.com
Subject: Re: [PATCH] mm/vma: Fix hugetlb accounting error in copy_vma()

On Mon, Jan 27, 2025 at 8:45 AM Fedor Pchelkin <pchelkin@...ras.ru> wrote:
>
> +Cc Mina Almasry and Mike Kravetz
>

+Cc current Hugetlb maintainer as well.

> On Mon, 27. Jan 14:46, Lorenzo Stoakes wrote:
> > Thanks for the report.
> >
> > On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote:
> > > In copy_vma() allocation of maple tree nodes may fail. Since page accounting
> > > takes place at the close() operation for hugetlb, it is called at the error
> > > path against the new_vma to account pages of the vma that was not successfully
> > > copied and that shares the page_counter with the original vma. Then, when the
> > > process is being terminated, vm_ops->close() is called once again against the
> > > original vma, which results in a page_counter underflow.
> >
> > This seems like a bug in hugetlb.
> >
> > I really hate the solution here, it's hacky and assumes only these fields are
> > meaningful for 'close twice' scenarios.
> >
> > We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning
> > no further close() invocations can occur.
>
> Does the "close twice" scenario exactly mean ->close() is called twice
> for the same object of struct vm_area_struct?
>
> For the observed case I think that's not true. ->close() is called for
> two different objects of type vm_area_struct - the first time for the
> new_vma on error path of copy_vma(), the second time for the original
> vma. It turns out then these objects share the same reservation map
> holding page_counters at this point of time.
>
> >
> > If hugetlb is _still_ choosing to internally invoke this, it seems like it
> > should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That
> > way it'll account for the closing twice issue.
> >
> > Can you easily repro in order to check a solution like that fixes your problem?
> > I don't see why it shouldn't
> >
>
> Seems that wouldn't fix the problem (again, two different vma objects).
> There's presumably no obvious place in hugetlb internals where this may
> be fixed elegantly, at the quick glance. But yep, it does look like a bug
> for hugetlb to care about.. Perhaps somehow defer the reservation map
> copying?
>
> > >
> > > page_counter underflow: -1024 nr_pages=1024
> > > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55
> > > Modules linked in:
> > > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > Call Trace:
> > >  <TASK>
> > >  page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158
> > >  hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430
> > >  hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886
> > >  remove_vma+0x84/0x130 mm/mmap.c:140
> > >  exit_mmap+0x32f/0x7a0 mm/mmap.c:3249
> > >  __mmput+0x11e/0x430 kernel/fork.c:1199
> > >  mmput+0x61/0x70 kernel/fork.c:1221
> > >  exit_mm kernel/exit.c:565 [inline]
> > >  do_exit+0xa4a/0x2790 kernel/exit.c:858
> > >  do_group_exit+0xd0/0x2a0 kernel/exit.c:1021
> > >  __do_sys_exit_group kernel/exit.c:1032 [inline]
> > >  __se_sys_exit_group kernel/exit.c:1030 [inline]
> > >  __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030
> > >  do_syscall_x64 arch/x86/entry/common.c:51 [inline]
> > >  do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
> > >  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > >  </TASK>
> >
> > >
> > > Since there is no sense in vm accounting for a bad copy of vma, set vm_start
> > > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue
> > > has been fixed in __split_vma() in the same way [1].
> > >
> > > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@oracle.com/T/
> >
> > Understood that we do this elsewhere, I think equally we should not do this
> > there either! :)
> >
> > >
> > > Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
> > >
> > > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
> > > Cc: stable@...r.kernel.com
> > > Signed-off-by: Daniil Dulov <d.dulov@...ddin.ru>
> > > ---
> > >  mm/vma.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index bb2119e5a0d0..dbc68b7cd0ec 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > >     return new_vma;
> > >
> > >  out_vma_link:
> > > +   /* Avoid vm accounting in close() operation */
> > > +   new_vma->vm_start = new_vma->vm_end;
> > > +   new_vma->vm_pgoff = 0;
> > >     vma_close(new_vma);
> > >
> > >     if (new_vma->vm_file)
> > > --
> > > 2.34.1
> > >

-- 
Thanks,
Mina