[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a72fe0ba-b022-4f6e-b401-78e93aadc5ce@redhat.com>
Date: Mon, 23 Jun 2025 17:11:47 +0200
From: David Hildenbrand <david@...hat.com>
To: Jens Axboe <axboe@...nel.dk>, Alexander Potapenko <glider@...gle.com>
Cc: syzbot <syzbot+1d335893772467199ab6@...kaller.appspotmail.com>,
akpm@...ux-foundation.org, catalin.marinas@....com, jgg@...pe.ca,
jhubbard@...dia.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
peterx@...hat.com, syzkaller-bugs@...glegroups.com,
Pavel Begunkov <asml.silence@...il.com>
Subject: Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
On 23.06.25 16:58, Jens Axboe wrote:
> On 6/23/25 6:22 AM, David Hildenbrand wrote:
>> On 23.06.25 12:10, David Hildenbrand wrote:
>>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>>> syzkaller-bugs <syzkaller-bugs@...glegroups.com> wrote:
>>>>>
>>>>> On 21.06.25 23:52, syzbot wrote:
>>>>>> syzbot has found a reproducer for the following issue on:
>>>>>>
>>>>>> HEAD commit: 9aa9b43d689e Merge branch 'for-next/core' into for-kernelci
>>>>>> git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1525330c580000
>>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=27f179c74d5c35cd
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=1d335893772467199ab6
>>>>>> compiler: Debian clang version 20.1.6 (++20250514063057+1e4d39e07757-1~exp1~20250514183223.118), Debian LLD 20.1.6
>>>>>> userspace arch: arm64
>>>>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=16d73370580000
>>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=160ef30c580000
>>>>>
>>>>> There is not that much magic in there, I'm afraid.
>>>>>
>>>>> fork() is only used to spin up guests, but before the memory region of
>>>>> interest is actually allocated, IIUC. No threading code that races.
>>>>>
>>>>> IIUC, it triggers fairly fast on aarch64. I've left it running for a
>>>>> while on x86_64 without any luck.
>>>>>
>>>>> So maybe this is really some aarch64-special stuff (pointer tagging?).
>>>>>
>>>>> In particular, there is something very weird in the reproducer:
>>>>>
>>>>> syscall(__NR_madvise, /*addr=*/0x20a93000ul, /*len=*/0x4000ul,
>>>>> /*advice=MADV_HUGEPAGE|0x800000000*/ 0x80000000eul);
>>>>>
>>>>> advise is supposed to be a 32bit int. What does the magical
>>>>> "0x800000000" do?
>>>>
>>>> I am pretty sure this is a red herring.
>>>> Syzkaller sometimes mutates integer flags, even if the result makes no
>>>> sense - because sometimes it can trigger interesting bugs.
>>>> This `advice` argument will be discarded by is_valid_madvise(),
>>>> resulting in -EINVAL.
>>>
>>> I thought the same, but likely the upper bits are discarded, and we end
>>> up with __NR_madvise succeeding.
>>>
>>> The kernel config has
>>>
>>> CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
>>>
>>> So without MADV_HUGEPAGE, we wouldn't get a THP in the first place.
>>>
>>> So likely this is really just like dropping the "0x800000000"
>>>
>>> Anyhow, I managed to reproduce in the VM using the provided rootfs on
>>> aarch64. It triggers immediately, so no races involved.
>>>
>>> Running the reproducer on a Fedora 42 debug-kernel in the hypervisor
>>> does not trigger.
>>
>> Simplified reproducer that does not depend on a race with the
>> child process.
>>
>> As expected previously, we have PAE cleared on the head page,
>> because it is/was COW-shared with a child process.
>>
>> We are registering more than one consecutive tail pages of that
>> THP through iouring, GUP-pinning them. These pages are not
>> COW-shared and, therefore, do not have PAE set.
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <sys/ioctl.h>
>> #include <sys/mman.h>
>> #include <sys/syscall.h>
>> #include <sys/types.h>
>> #include <liburing.h>
>>
>> int main(void)
>> {
>> struct io_uring_params params = {
>> .wq_fd = -1,
>> };
>> struct iovec iovec;
>> const size_t pagesize = getpagesize();
>> size_t size = 2048 * pagesize;
>> char *addr;
>> int fd;
>>
>> /* We need a THP-aligned area. */
>> addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ,
>> MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
>> if (addr == MAP_FAILED) {
>> perror("MAP_FIXED failed\n");
>> return 1;
>> }
>>
>> if (madvise(addr, size, MADV_HUGEPAGE)) {
>> perror("MADV_HUGEPAGE failed\n");
>> return 1;
>> }
>>
>> /* Populate a THP. */
>> memset(addr, 0, size);
>>
>> /* COW-share only the first page ... */
>> if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) {
>> perror("MADV_DONTFORK failed\n");
>> return 1;
>> }
>>
>> /* ... using fork(). This will clear PAE on the head page. */
>> if (fork() == 0)
>> exit(0);
>>
>> /* Setup iouring */
>> fd = syscall(__NR_io_uring_setup, 1024, ¶ms);
>> if (fd < 0) {
>> perror("__NR_io_uring_setup failed\n");
>> return 1;
>> }
>>
>> /* Register (GUP-pin) two consecutive tail pages. */
>> iovec.iov_base = addr + pagesize;
>> iovec.iov_len = 2 * pagesize;
>> syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1);
>> return 0;
>> }
>>
>> [ 108.070381][ T14] kernel BUG at mm/gup.c:71!
>> [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>> [ 108.117202][ T14] Modules linked in:
>> [ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
>> [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
>> [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work
>> [ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0
>> [ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0
>> [ 108.138025][ T14] sp : ffff800097ac7640
>> [ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000
>> [ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000
>> [ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c
>> [ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff
>> [ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4
>> [ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff
>> [ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700
>> [ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001
>> [ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348
>> [ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061
>> [ 108.174205][ T14] Call trace:
>> [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P)
>> [ 108.178138][ T14] unpin_user_page+0x80/0x10c
>> [ 108.180189][ T14] io_release_ubuf+0x84/0xf8
>> [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c
>> [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298
>> [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0
>> [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480
>> [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8
>> [ 108.193207][ T14] process_one_work+0x7e8/0x155c
>> [ 108.195431][ T14] worker_thread+0x958/0xed8
>> [ 108.197561][ T14] kthread+0x5fc/0x75c
>> [ 108.199362][ T14] ret_from_fork+0x10/0x20
>>
>>
>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>>
>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
>> (IOW, one we never pinned).
>>
>> So it's related to the io_coalesce_buffer() machinery.
>>
>> And in fact, in there, we have this weird logic:
>>
>> /* Store head pages only*/
>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
>> ...
>>
>>
>> Essentially discarding the subpage information when coalescing tail pages.
>>
>>
>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
>> flawed (we can -- in theory -- coalesc different folio page ranges in
>> a GUP result?).
>>
>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
>> place.
>>
>> Can you look into that, as you are more familiar with the logic?
>
> Leaving this all quoted and adding Pavel, who wrote that code. I'm
> currently away, so can't look into this right now.
I did some more digging, but ended up being all confused about
io_check_coalesce_buffer() and io_imu_folio_data().
Assuming we pass a bunch of consecutive tail pages that all belong to
the same folio, then the loop in io_check_coalesce_buffer() will always
run into the
if (page_folio(page_array[i]) == folio &&
page_array[i] == page_array[i-1] + 1) {
count++;
continue;
}
case, making the function return "true" ... in io_coalesce_buffer(), we
then store the head page ... which seems very wrong.
In general, storing head pages when they are not the first page to be
coalesced seems wrong.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists