[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <da071ca9-ff7f-4c3c-8be9-f01b93abb1ff@gmail.com>
Date: Mon, 23 Jun 2025 19:02:21 +0100
From: Pavel Begunkov <asml.silence@...il.com>
To: David Hildenbrand <david@...hat.com>, Jens Axboe <axboe@...nel.dk>,
Alexander Potapenko <glider@...gle.com>
Cc: syzbot <syzbot+1d335893772467199ab6@...kaller.appspotmail.com>,
akpm@...ux-foundation.org, catalin.marinas@....com, jgg@...pe.ca,
jhubbard@...dia.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
peterx@...hat.com, syzkaller-bugs@...glegroups.com
Subject: Re: [syzbot] [mm?] kernel BUG in sanity_check_pinned_pages
On 6/23/25 18:36, David Hildenbrand wrote:
> On 23.06.25 18:59, David Hildenbrand wrote:
>> On 23.06.25 18:48, Pavel Begunkov wrote:
>>> On 6/23/25 16:11, David Hildenbrand wrote:
...>>> Yes, it stores the head page even if the range passed to
>>> pin_user_pages() doesn't cover the head page.
>> > > It should be converted to unpin_user_folio(), which doesn't seem
>>> to do sanity_check_pinned_pages(). Do you think that'll be enough
>>> (conceptually)? Nobody is actually touching the head page in those
>>> cases apart from the final unpin, and storing the head page is
>>> more convenient than keeping folios. I'll take a look if it can
>>> be fully converted to folios w/o extra overhead.
>>
>> Assuming we had from GUP
>>
>> nr_pages = 2
>> pages[0] = folio_page(folio, 1)
>> pages[1] = folio_page(folio, 2)
>>
>> After io_coalesce_buffer() we have
>>
>> nr_pages = 1
>> pages[0] = folio_page(folio, 0)
>>
>>
>> Using unpin_user_folio() in all places where we could see something like
>> that would be the right thing to do. The sanity checks are not in
>> unpin_user_folio() for exactly that reason: we don't know which folio
>> pages we pinned.
Let's do that for starters
>> But now I wonder where you make sure that "Nobody is actually touching
>> the head page"?
>>
>> How do you get back the "which folio range" information after
>> io_coalesce_buffer() ?
>>
>>
>> If you rely on alignment in virtual address space for you, combined with
>> imu->folio_shift, that might not work reliably ...
>
> FWIW, applying the following on top of origin/master:
>
> diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
> index dbbcc5eb3dce5..e62a284dcf906 100644
> --- a/tools/testing/selftests/mm/cow.c
> +++ b/tools/testing/selftests/mm/cow.c
> @@ -946,6 +946,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
> log_test_result(KSFT_FAIL);
> goto munmap;
> }
> + mem = mremap_mem;
> size = mremap_size;
> break;
> case THP_RUN_PARTIAL_SHARED:
>
>
> and then running the selftest, something is not happy:
>
> ...
> # [RUN] R/O-mapping a page registered as iouring fixed buffer ... with partially mremap()'ed THP (512 kB)
> [34272.021973] Oops: general protection fault, maybe for address 0xffff8bab09d5b000: 0000 [#1] PREEMPT SMP NOPTI
> [34272.021980] CPU: 3 UID: 0 PID: 1048307 Comm: iou-wrk-1047940 Not tainted 6.14.9-300.fc42.x86_64 #1
> [34272.021983] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET53W (1.53 ) 03/22/2023
> [34272.021984] RIP: 0010:memcpy+0xc/0x20
> [34272.021989] Code: cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 d1 <f3> a4 e9 4d f9 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
> [34272.021991] RSP: 0018:ffffcff459183c20 EFLAGS: 00010206
> [34272.021993] RAX: ffff8bab09d5b000 RBX: 0000000000000fff RCX: 0000000000000fff
> [34272.021994] RDX: 0000000000000fff RSI: 0021461670800001 RDI: ffff8bab09d5b000
> [34272.021995] RBP: ffff8ba794866c40 R08: ffff8bab09d5b000 R09: 0000000000001000
> [34272.021996] R10: ffff8ba7a316f9d0 R11: ffff8ba92f133080 R12: 0000000000000fff
> [34272.021997] R13: ffff8baa85d5b6a0 R14: 0000000000000fff R15: 0000000000001000
> [34272.021998] FS: 00007f16c568a740(0000) GS:ffff8baebf580000(0000) knlGS:0000000000000000
> [34272.021999] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [34272.022000] CR2: 00007fffb6a10b00 CR3: 00000003df9eb006 CR4: 0000000000f72ef0
> [34272.022001] PKRU: 55555554
> [34272.022002] Call Trace:
> [34272.022004] <TASK>
> [34272.022005] copy_page_from_iter_atomic+0x36f/0x7e0
> [34272.022009] ? simple_xattr_get+0x59/0xa0
> [34272.022012] generic_perform_write+0x86/0x2e0
> [34272.022016] shmem_file_write_iter+0x86/0x90
> [34272.022019] io_write+0xe4/0x390
> [34272.022023] io_issue_sqe+0x65/0x4f0
> [34272.022024] ? lock_timer_base+0x7d/0xc0
> [34272.022027] io_wq_submit_work+0xb8/0x320
> [34272.022029] io_worker_handle_work+0xd5/0x300
> [34272.022032] io_wq_worker+0xda/0x300
> [34272.022034] ? finish_task_switch.isra.0+0x99/0x2c0
> [34272.022037] ? __pfx_io_wq_worker+0x10/0x10
> [34272.022039] ret_from_fork+0x34/0x50
> [34272.022042] ? __pfx_io_wq_worker+0x10/0x10
> [34272.022044] ret_from_fork_asm+0x1a/0x30
> [34272.022047] </TASK>
>
>
> There, we essentially mremap a THP to not be aligned in VA space, and then register half the
> THP as a fixed buffer.
>
> So ... my suspicion that this is all rather broken grows :)
It's supposed to calculate the offset from a user pointer and
then work with that, but I guess there are masking that violate
it, I'll check.
--
Pavel Begunkov
Powered by blists - more mailing lists