linux-kernel - Re: [tip: locking/urgent] futex: Allow to resize the private local hash

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFZeHv72yPGovnRv@mozart.vkv.me>
Date: Sat, 21 Jun 2025 00:24:14 -0700
From: Calvin Owens <calvin@...nvd.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: linux-kernel@...r.kernel.org, "Lai, Yi" <yi1.lai@...ux.intel.com>,
	"Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org
Subject: Re: [tip: locking/urgent] futex: Allow to resize the private local
 hash

On Friday 06/20 at 18:02 -0700, Calvin Owens wrote:
> On Friday 06/20 at 11:56 -0700, Calvin Owens wrote:
> > On Friday 06/20 at 12:31 +0200, Sebastian Andrzej Siewior wrote:
> > > On 2025-06-19 14:07:30 [-0700], Calvin Owens wrote:
> > > > > Machine #2 oopsed with the GCC kernel after just over an hour:
> > > > > 
> > > > >     BUG: unable to handle page fault for address: ffff88a91eac4458
> > > > >     RIP: 0010:futex_hash+0x16/0x90
> > > …
> > > > >     Call Trace:
> > > > >      <TASK>
> > > > >      futex_wait_setup+0x51/0x1b0
> > > …
> > > 
> > > The futex_hash_bucket pointer has an invalid ->priv pointer.
> > > This could be use-after-free or double-free. I've been looking through
> > > your config and you don't have CONFIG_SLAB_FREELIST_* set. I don't
> > > remember which one but one of the two has a "primitiv" double free
> > > detection. 
> > > 
> > > …
> > > > I am not able to reproduce the oops at all with these options:
> > > > 
> > > >     * DEBUG_PAGEALLOC_ENABLE_DEFAULT
> > > >     * SLUB_DEBUG_ON
> > > 
> > > SLUB_DEBUG_ON is something that would "reliably" notice double free.
> > > If you drop SLUB_DEBUG_ON (but keep SLUB_DEBUG) then you can boot with
> > > slab_debug=f keeping only the consistency checks. The "poison" checks
> > > would be excluded for instance. That allocation is kvzalloc() but it
> > > should be small on your machine to avoid vmalloc() and use only
> > > kmalloc().
> > 
> > I'll try slab_debug=f next.
> 
> I just hit the oops with SLUB_DEBUG and slab_debug=f, but nothing new
> was logged.

I went back to the original GCC config, and set up yocto to log what it
was doing over /dev/kmsg so maybe we can isolate the trigger.

I got a novel oops this time:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0 
    Oops: Oops: 0000 [#1] SMP
    CPU: 6 UID: 0 PID: 12 Comm: kworker/u128:0 Not tainted 6.16.0-rc2-gcc-00269-g11313e2f7812 #1 PREEMPT 
    Hardware name: Gigabyte Technology Co., Ltd. A620I AX/A620I AX, BIOS F3 07/10/2023
    Workqueue: netns cleanup_net
    RIP: 0010:default_device_exit_batch+0xd0/0x2f0
    Code: 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 44 00 00 <49> 8b 94 24 40 01 00 00 4c 89 e5 49 8d 84 24 40 01 00 00 48 39 04
    RSP: 0018:ffffc900001c7d58 EFLAGS: 00010202
    RAX: ffff888f1bacc140 RBX: ffffc900001c7e18 RCX: 0000000000000002
    RDX: ffff888165232930 RSI: 0000000000000000 RDI: ffffffff82a00820
    RBP: ffff888f1bacc000 R08: 0000036dae5dbcdb R09: ffff8881038c5300
    R10: 000000000000036e R11: 0000000000000001 R12: fffffffffffffec0
    R13: dead000000000122 R14: dead000000000100 R15: ffffc900001c7dd0
    FS:  0000000000000000(0000) GS:ffff888cccd6b000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000a414f4000 CR4: 0000000000350ef0
    Call Trace:
     <TASK>
     ops_undo_list+0xd9/0x1e0
     cleanup_net+0x1b2/0x2c0
     process_one_work+0x148/0x240
     worker_thread+0x2d7/0x410
     ? rescuer_thread+0x500/0x500
     kthread+0xd5/0x1e0
     ? kthread_queue_delayed_work+0x70/0x70
     ret_from_fork+0xa0/0xe0
     ? kthread_queue_delayed_work+0x70/0x70
     ? kthread_queue_delayed_work+0x70/0x70
     ret_from_fork_asm+0x11/0x20
     </TASK>
    CR2: 0000000000000000
    ---[ end trace 0000000000000000 ]---
    2025-06-20 23:47:28 - INFO     - ##teamcity[message text='recipe libaio-0.3.113-r0: task do_populate_sysroot: Succeeded' status='NORMAL']
    2025-06-20 23:47:28 - ERROR    - ##teamcity[message text='recipe libaio-0.3.113-r0: task do_populate_sysroot: Succeeded' status='NORMAL']
    RIP: 0010:default_device_exit_batch+0xd0/0x2f0
    Code: 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 0f 1f 44 00 00 <49> 8b 94 24 40 01 00 00 4c 89 e5 49 8d 84 24 40 01 00 00 48 39 04
    RSP: 0018:ffffc900001c7d58 EFLAGS: 00010202
    RAX: ffff888f1bacc140 RBX: ffffc900001c7e18 RCX: 0000000000000002
    RDX: ffff888165232930 RSI: 0000000000000000 RDI: ffffffff82a00820
    RBP: ffff888f1bacc000 R08: 0000036dae5dbcdb R09: ffff8881038c5300
    R10: 000000000000036e R11: 0000000000000001 R12: fffffffffffffec0
    R13: dead000000000122 R14: dead000000000100 R15: ffffc900001c7dd0
    FS:  0000000000000000(0000) GS:ffff888cccd6b000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000000361a000 CR4: 0000000000350ef0
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled
    ---[ end Kernel panic - not syncing: Fatal exception ]---

Based on subtracting the set of things that had completed do_compile from
the set of things that started, it was building:

    clang-native, duktape, linux-upstream, nodejs-native, and zstd

...when it oopsed. The whole 5MB log is in "new-different-oops.txt".

> > > > I'm also experimenting with stress-ng as a reproducer, no luck so far.
> > > 
> > > Not sure what you are using there. I think cargo does:
> > > - lock/ unlock in a threads
> > > - create new thread which triggers auto-resize
> > > - auto-resize gets delayed due to lock/ unlock in other threads (the
> > >   reference is held)
> > 
> > I've tried various combinations of --io, --fork, --exec, --futex, --cpu,
> > --vm, and --forkheavy. It's not mixing the operations in threads as I
> > understand it, so I guess it won't ever do anything like what you're
> > describing no matter what stressors I run?
> > 
> > I did get this message once, something I haven't seen before:
> > 
> >     [33024.247423] [    T281] sched: DL replenish lagged too much
> > 
> > ...but maybe that's my fault for overloading it so much.
> > 
> > > And now something happens leading to what we see.
> > > _Maybe_ the cargo application terminates/ execs before the new struct is
> > > assigned in an unexpected way.
> > > The regular hash bucket has reference counting so it should raise
> > > warnings if it goes wrong. I haven't seen those.
> > > 
> > > > A third machine with an older Skylake CPU died overnight, but nothing
> > > > was logged over netconsole. Luckily it actually has a serial header on
> > > > the motherboard, so that's wired up and it's running again, maybe it
> > > > dies in a different way that might be a better clue...
> > > 
> > > So far I *think* that cargo does something that I don't expect and this
> > > leads to a memory double-free. The SLUB_DEBUG_ON hopefully delays the
> > > process long enough that the double free does not trigger.
> > > 
> > > I think I'm going to look for a random rust packet that is using cargo
> > > for building (unless you have a recommendation) and look what it is
> > > doing. It was always cargo after all. Maybe this brings some light.
> > 
> > The list of things in my big build that use cargo is pretty short:
> > 
> >     === Dependendency Snapshot ===
> >     Dep    =mc:house:cargo-native.do_install
> >     Package=mc:house:cargo-native.do_populate_sysroot
> >     RDep   =mc:house:cargo-c-native.do_prepare_recipe_sysroot
> >             mc:house:cargo-native.do_create_spdx
> >             mc:house:cbindgen-native.do_prepare_recipe_sysroot
> >             mc:house:librsvg-native.do_prepare_recipe_sysroot
> >             mc:house:librsvg.do_prepare_recipe_sysroot
> >             mc:house:libstd-rs.do_prepare_recipe_sysroot
> >             mc:house:python3-maturin-native.do_prepare_recipe_sysroot
> >             mc:house:python3-maturin-native.do_populate_sysroot
> >             mc:house:python3-rpds-py.do_prepare_recipe_sysroot
> >             mc:house:python3-setuptools-rust-native.do_prepare_recipe_sysroot
> > 
> > I've tried building each of those targets alone (and all of them
> > together) in a loop, but that hasn't triggered anything. I guess that
> > other concurrent builds are necessary to trigger whatever this is.
> > 
> > I tried using stress-ng --vm and --cpu together to "load up" the machine
> > while running the isolated targets, but that hasn't worked either.
> > 
> > If you want to run *exactly* what I am, clone this unholy mess:
> > 
> >     https://github.com/jcalvinowens/meta-house
> > 
> > ...setup for yocto and install kas as described here:
> > 
> >     https://docs.yoctoproject.org/ref-manual/system-requirements.html#ubuntu-and-debian
> >     https://github.com/jcalvinowens/meta-house/blob/6f6a9c643169fc37ba809f7230261d0e5255b6d7/README.md#kas
> > 
> > ...and run (for the 32-thread machine):
> > 
> >     BB_NUMBER_THREADS="48" PARALLEL_MAKE="-j 36" kas build kas/walnascar.yaml -- -k
> > 
> > Fair warning, it needs a *lot* of RAM at the high concurrency, I have
> > 96GB with 128GB of swap to spill into. It needs ~500GB of disk space if
> > it runs to completion and downloads ~15GB of tarballs when it starts.
> > 
> > Annoyingly it won't work if the system compiler is gcc-15 right now (the
> > verison of glib it has won't build, haven't had a chance to fix it yet).
> > 
> > > > > > Thanks,
> > > > > > Calvin
> > > 
> > > Sebastian