lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2e65cc96-5fb8-4197-b4c2-188c4378c417@lucifer.local>
Date: Mon, 20 Oct 2025 11:58:25 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Sumanth Korikkar <sumanthk@...ux.ibm.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        Jonathan Corbet <corbet@....net>, Matthew Wilcox <willy@...radead.org>,
        Guo Ren <guoren@...nel.org>,
        Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
        Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ux.ibm.com>,
        Sven Schnelle <svens@...ux.ibm.com>,
        "David S . Miller" <davem@...emloft.net>,
        Andreas Larsson <andreas@...sler.com>, Arnd Bergmann <arnd@...db.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Dan Williams <dan.j.williams@...el.com>,
        Vishal Verma <vishal.l.verma@...el.com>,
        Dave Jiang <dave.jiang@...el.com>, Nicolas Pitre <nico@...xnic.net>,
        Muchun Song <muchun.song@...ux.dev>,
        Oscar Salvador <osalvador@...e.de>,
        David Hildenbrand <david@...hat.com>,
        Konstantin Komarov <almaz.alexandrovich@...agon-software.com>,
        Baoquan He <bhe@...hat.com>, Vivek Goyal <vgoyal@...hat.com>,
        Dave Young <dyoung@...hat.com>, Tony Luck <tony.luck@...el.com>,
        Reinette Chatre <reinette.chatre@...el.com>,
        Dave Martin <Dave.Martin@....com>, James Morse <james.morse@....com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
        Hugh Dickins <hughd@...gle.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        Uladzislau Rezki <urezki@...il.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Andrey Konovalov <andreyknvl@...il.com>, Jann Horn <jannh@...gle.com>,
        Pedro Falcato <pfalcato@...e.de>, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-csky@...r.kernel.org, linux-mips@...r.kernel.org,
        linux-s390@...r.kernel.org, sparclinux@...r.kernel.org,
        nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org, linux-mm@...ck.org,
        ntfs3@...ts.linux.dev, kexec@...ts.infradead.org,
        kasan-dev@...glegroups.com, Jason Gunthorpe <jgg@...dia.com>,
        iommu@...ts.linux.dev, Kevin Tian <kevin.tian@...el.com>,
        Will Deacon <will@...nel.org>, Robin Murphy <robin.murphy@....com>
Subject: Re: [PATCH v4 11/14] mm/hugetlbfs: update hugetlbfs to use
 mmap_prepare

On Tue, Sep 23, 2025 at 01:52:09PM +0200, Sumanth Korikkar wrote:
> Hi Lorenzo,
>
> The following tests causes the kernel to enter a blocked state,
> suggesting an issue related to locking order. I was able to reproduce
> this behavior in certain test runs.
>
> Test case:
> git clone https://github.com/libhugetlbfs/libhugetlbfs.git
> cd libhugetlbfs ; ./configure
> make -j32
> cd tests
> echo 100 > /proc/sys/vm/nr_hugepages
> mkdir -p /test-hugepages && mount -t hugetlbfs nodev /test-hugepages
> ./run_tests.py <in a loop>
> ...
> shm-fork 10 100 (1024K: 64):    PASS
> set shmmax limit to 104857600
> shm-getraw 100 /dev/full (1024K: 32):
> shm-getraw 100 /dev/full (1024K: 64):   PASS
> fallocate_stress.sh (1024K: 64):  <blocked>
>
> Blocked task state below:
>
> task:fallocate_stres state:D stack:0     pid:5106  tgid:5106  ppid:5103
> task_flags:0x400000 flags:0x00000001
> Call Trace:
>  [<00000255adc646f0>] __schedule+0x370/0x7f0
>  [<00000255adc64bb0>] schedule+0x40/0xc0
>  [<00000255adc64d32>] schedule_preempt_disabled+0x22/0x30
>  [<00000255adc68492>] rwsem_down_write_slowpath+0x232/0x610
>  [<00000255adc68922>] down_write_killable+0x52/0x80
>  [<00000255ad12c980>] vm_mmap_pgoff+0xc0/0x1f0
>  [<00000255ad164bbe>] ksys_mmap_pgoff+0x17e/0x220
>  [<00000255ad164d3c>] __s390x_sys_old_mmap+0x7c/0xa0
>  [<00000255adc60e4e>] __do_syscall+0x12e/0x350
>  [<00000255adc6cfee>] system_call+0x6e/0x90
> task:fallocate_stres state:D stack:0     pid:5109  tgid:5106  ppid:5103
> task_flags:0x400040 flags:0x00000001
> Call Trace:
>  [<00000255adc646f0>] __schedule+0x370/0x7f0
>  [<00000255adc64bb0>] schedule+0x40/0xc0
>  [<00000255adc64d32>] schedule_preempt_disabled+0x22/0x30
>  [<00000255adc68492>] rwsem_down_write_slowpath+0x232/0x610
>  [<00000255adc688be>] down_write+0x4e/0x60
>  [<00000255ad1c11ec>] __hugetlb_zap_begin+0x3c/0x70
>  [<00000255ad158b9c>] unmap_vmas+0x10c/0x1a0
>  [<00000255ad180844>] vms_complete_munmap_vmas+0x134/0x2e0
>  [<00000255ad1811be>] do_vmi_align_munmap+0x13e/0x170
>  [<00000255ad1812ae>] do_vmi_munmap+0xbe/0x140
>  [<00000255ad183f86>] __vm_munmap+0xe6/0x190
>  [<00000255ad166832>] __s390x_sys_munmap+0x32/0x40
>  [<00000255adc60e4e>] __do_syscall+0x12e/0x350
>  [<00000255adc6cfee>] system_call+0x6e/0x90
>
>
> Thanks,
> Sumanth

(been on holiday for a couple weeks and last week was a catch-up! :)

So having looked into this, the issue is that hugetlbfs exposes a per-VMA
hugetlbfs lock which can be taken via the rmap.

So, while faults are disallowed until the VMA is fully setup, the rmap is not,
and therefore there's a race between setting up the hugetlbfs lock and the rmap
trying to take/release it.

It's a real edge case as it's kind of unusual to have this requirement during
initial custom mmap, but to account for this and for any other users which might
require it, I have resolved this by introducing the ability to hold on to the
rmap lock until the VMA is fully set up.

The window is very very small, but obviously it's one we have to account for :)

This is the most correct solution I think, as it prevents any confusion as to
the state of the lock, rmap users simply cannot access the VMA until it is
established.

I am putting the finishing touches to a respin with this fix included, will cc
you on it.

Cheers, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ