linux-kernel - Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aef06093-6081-460e-8452-87b522b06050@redhat.com>
Date: Mon, 23 Jun 2025 19:36:21 +0200
From: David Hildenbrand <david@...hat.com>
To: Pavel Begunkov <asml.silence@...il.com>, Jens Axboe <axboe@...nel.dk>,
 Alexander Potapenko <glider@...gle.com>
Cc: syzbot <syzbot+1d335893772467199ab6@...kaller.appspotmail.com>,
 akpm@...ux-foundation.org, catalin.marinas@....com, jgg@...pe.ca,
 jhubbard@...dia.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 peterx@...hat.com, syzkaller-bugs@...glegroups.com
Subject: Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages

On 23.06.25 18:59, David Hildenbrand wrote:
> On 23.06.25 18:48, Pavel Begunkov wrote:
>> On 6/23/25 16:11, David Hildenbrand wrote:
>>> On 23.06.25 16:58, Jens Axboe wrote:
>>>> On 6/23/25 6:22 AM, David Hildenbrand wrote:
>>>>> On 23.06.25 12:10, David Hildenbrand wrote:
>>>>>> On 23.06.25 11:53, Alexander Potapenko wrote:
>>>>>>> On Mon, Jun 23, 2025 at 11:29?AM 'David Hildenbrand' via
>>>>>>> syzkaller-bugs <syzkaller-bugs@...glegroups.com> wrote:
>>>>>>>>
>> ...>>> When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
>>>>>
>>>>> So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
>>>>> on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
>>>>> (IOW, one we never pinned).
>>>>>
>>>>> So it's related to the io_coalesce_buffer() machinery.
>>>>>
>>>>> And in fact, in there, we have this weird logic:
>>>>>
>>>>> /* Store head pages only*/
>>>>> new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
>>>>> ...
>>>>>
>>>>>
>>>>> Essentially discarding the subpage information when coalescing tail pages.
>>>>>
>>>>>
>>>>> I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
>>>>> flawed (we can -- in theory -- coalesc different folio page ranges in
>>>>> a GUP result?).
>>>>>
>>>>> @Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
>>>>> imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
>>>>> place.
>>>>>
>>>>> Can you look into that, as you are more familiar with the logic?
>>>>
>>>> Leaving this all quoted and adding Pavel, who wrote that code. I'm
>>>> currently away, so can't look into this right now.
>>
>> Chenliang Li did, but not like it matters
>>
>>> I did some more digging, but ended up being all confused about io_check_coalesce_buffer() and io_imu_folio_data().
>>>
>>> Assuming we pass a bunch of consecutive tail pages that all belong to the same folio, then the loop in io_check_coalesce_buffer() will always
>>> run into the
>>>
>>> if (page_folio(page_array[i]) == folio &&
>>>        page_array[i] == page_array[i-1] + 1) {
>>>        count++;
>>>        continue;
>>> }
>>>
>>> case, making the function return "true" ... in io_coalesce_buffer(), we then store the head page ... which seems very wrong.
>>>
>>> In general, storing head pages when they are not the first page to be coalesced seems wrong.
>>
>> Yes, it stores the head page even if the range passed to
>> pin_user_pages() doesn't cover the head page.
>   > > It should be converted to unpin_user_folio(), which doesn't seem
>> to do sanity_check_pinned_pages(). Do you think that'll be enough
>> (conceptually)? Nobody is actually touching the head page in those
>> cases apart from the final unpin, and storing the head page is
>> more convenient than keeping folios. I'll take a look if it can
>> be fully converted to folios w/o extra overhead.
> 
> Assuming we had from GUP
> 
> nr_pages = 2
> pages[0] = folio_page(folio, 1)
> pages[1] = folio_page(folio, 2)
> 
> After io_coalesce_buffer() we have
> 
> nr_pages = 1
> pages[0] = folio_page(folio, 0)
> 
> 
> Using unpin_user_folio() in all places where we could see something like
> that would be the right thing to do. The sanity checks are not in
> unpin_user_folio() for exactly that reason: we don't know which folio
> pages we pinned.
> 
> But now I wonder where you make sure that "Nobody is actually touching
> the head page"?
> 
> How do you get back the "which folio range" information after
> io_coalesce_buffer() ?
> 
> 
> If you rely on alignment in virtual address space for you, combined with
> imu->folio_shift, that might not work reliably ...

FWIW, applying the following on top of origin/master:

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index dbbcc5eb3dce5..e62a284dcf906 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
                         log_test_result(KSFT_FAIL);
                         goto munmap;
                 }
+               mem = mremap_mem;
                 size = mremap_size;
                 break;
         case THP_RUN_PARTIAL_SHARED:


and then running the selftest, something is not happy:

...
# [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB)
[34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI
[34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1
[34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023
[34272.021984] RIP: 0010:memcpy+0xc/0x20
[34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
[34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206
[34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff
[34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000
[34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000
[34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff
[34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000
[34272.021998] FS:  00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000
[34272.021999] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0
[34272.022001] PKRU: 55555554
[34272.022002] Call Trace:
[34272.022004]  <TASK>
[34272.022005]  copy_page_from_iter_atomic+0x36f/0x7e0
[34272.022009]  ? simple_xattr_get+0x59/0xa0
[34272.022012]  generic_perform_write+0x86/0x2e0
[34272.022016]  shmem_file_write_iter+0x86/0x90
[34272.022019]  io_write+0xe4/0x390
[34272.022023]  io_issue_sqe+0x65/0x4f0
[34272.022024]  ? lock_timer_base+0x7d/0xc0
[34272.022027]  io_wq_submit_work+0xb8/0x320
[34272.022029]  io_worker_handle_work+0xd5/0x300
[34272.022032]  io_wq_worker+0xda/0x300
[34272.022034]  ? finish_task_switch.isra.0+0x99/0x2c0
[34272.022037]  ? __pfx_io_wq_worker+0x10/0x10
[34272.022039]  ret_from_fork+0x34/0x50
[34272.022042]  ? __pfx_io_wq_worker+0x10/0x10
[34272.022044]  ret_from_fork_asm+0x1a/0x30
[34272.022047]  </TASK>


There, we essentially mremap a THP to not be aligned in VA space, and then register half the
THP as a fixed buffer.

So ... my suspicion that this is all rather broken grows :)

-- 
Cheers,

David / dhildenb