linux-kernel - Re: [syzbot] [btrfs?] kernel BUG in __folio_start

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac211a22-c93a-4443-9053-1d09d79fcb1c@suse.com>
Date: Tue, 26 Nov 2024 19:13:37 +1030
From: Qu Wenruo <wqu@...e.com>
To: Aleksandr Nogikh <nogikh@...gle.com>
Cc: Matthew Wilcox <willy@...radead.org>,
 syzbot <syzbot+aac7bff85be224de5156@...kaller.appspotmail.com>,
 akpm@...ux-foundation.org, clm@...com, dsterba@...e.com,
 josef@...icpanda.com, linux-btrfs@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, syzkaller-bugs@...glegroups.com
Subject: Re: [syzbot] [btrfs?] kernel BUG in __folio_start_writeback



在 2024/11/25 21:14, Aleksandr Nogikh 写道:
> On Mon, Nov 25, 2024 at 1:30 AM 'Qu Wenruo' via syzkaller-bugs
> <syzkaller-bugs@...glegroups.com> wrote:
>>
>>
>>
>> 在 2024/11/25 07:56, Matthew Wilcox 写道:
>>> On Sun, Nov 24, 2024 at 05:45:18AM -0800, syzbot wrote:
>>>>
>>>>    __fput+0x5ba/0xa50 fs/file_table.c:458
>>>>    task_work_run+0x24f/0x310 kernel/task_work.c:239
>>>>    resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
>>>>    exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
>>>>    exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
>>>>    __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
>>>>    syscall_exit_to_user_mode+0x13f/0x340 kernel/entry/common.c:218
>>>>    do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
>>>>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>>
>>> This is:
>>>
>>>           VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
>>>
>>> ie we've called __folio_start_writeback() on a folio which is already
>>> under writeback.
>>>
>>> Higher up in the trace, we have the useful information:
>>>
>>>    page: refcount:6 mapcount:0 mapping:ffff888077139710 index:0x3 pfn:0x72ae5
>>>    memcg:ffff888140adc000
>>>    aops:btrfs_aops ino:105 dentry name(?):"file2"
>>>    flags: 0xfff000000040ab(locked|waiters|uptodate|lru|private|writeback|node=0|zone=1|lastcpupid=0x7ff)
>>>    raw: 00fff000000040ab ffffea0001c8f408 ffffea0000939708 ffff888077139710
>>>    raw: 0000000000000003 0000000000000001 00000006ffffffff ffff888140adc000
>>>    page dumped because: VM_BUG_ON_FOLIO(folio_test_writeback(folio))
>>>    page_owner tracks the page as allocated
>>>
>>> The interesting part of the page_owner stacktrace is:
>>>
>>>     filemap_alloc_folio_noprof+0xdf/0x500
>>>     __filemap_get_folio+0x446/0xbd0
>>>     prepare_one_folio+0xb6/0xa20
>>>     btrfs_buffered_write+0x6bd/0x1150
>>>     btrfs_direct_write+0x52d/0xa30
>>>     btrfs_do_write_iter+0x2a0/0x760
>>>     do_iter_readv_writev+0x600/0x880
>>>     vfs_writev+0x376/0xba0
>>>
>>> (ie not very interesting)
>>>
>>>> Workqueue: btrfs-delalloc btrfs_work_helper
>>>> RIP: 0010:__folio_start_writeback+0xc06/0x1050 mm/page-writeback.c:3119
>>>> Call Trace:
>>>>    <TASK>
>>>>    process_one_folio fs/btrfs/extent_io.c:187 [inline]
>>>>    __process_folios_contig+0x31c/0x540 fs/btrfs/extent_io.c:216
>>>>    submit_one_async_extent fs/btrfs/inode.c:1229 [inline]
>>>>    submit_compressed_extents+0xdb3/0x16e0 fs/btrfs/inode.c:1632
>>>>    run_ordered_work fs/btrfs/async-thread.c:245 [inline]
>>>>    btrfs_work_helper+0x56b/0xc50 fs/btrfs/async-thread.c:324
>>>>    process_one_work kernel/workqueue.c:3229 [inline]
>>>
>>> This looks like a race?
>>>
>>> process_one_folio() calls
>>> btrfs_folio_clamp_set_writeback calls
>>> btrfs_subpage_set_writeback:
>>>
>>>           spin_lock_irqsave(&subpage->lock, flags);
>>>           bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits)
>>> ;
>>>           if (!folio_test_writeback(folio))
>>>                   folio_start_writeback(folio);
>>>           spin_unlock_irqrestore(&subpage->lock, flags);
>>>
>>> so somebody else set writeback after we tested for writeback here.
>>
>> The test VM is using X86_64, thus we won't go into the subpage routine,
>> but directly call folio_start_writeback().
>>
>>>
>>> One thing that comes to mind is that _usually_ we take folio_lock()
>>> first, then start writeback, then call folio_unlock() and btrfs isn't
>>> doing that here (afaict).  Maybe that's not the source of the bug?
>>
>> We still hold the folio locked, do submission then unlock.
>>
>> You can check extent_writepage(), where at the entrance we check if the
>> folio is still locked.
>> Then inside extent_writepage_io() we do the submission, setting the
>> folio writeback inside submit_one_sector().
>> Eventually unlock the folio at the end of extent_writepage(), that's for
>> the uncompressed writes.
>>
>> There are a lot of special handling for async submission (compression),
>> but it  still holds the folio locked, do compression and submission, and
>> unlock, just all in another thread (this case).
>>
>> So it looks like something is wrong when transferring the ownership of
>> the page cache folios to the compression path, or some not properly
>> handled error path.
>>
>> Unfortunately I'm not really able to reproduce the case using the
>> reproducer...
> 
> I've just tried to reproduce locally using the downloadable assets and
> the kernel crashed ~ after 1 minute of running the attached C repro.
> 
> [   87.616440][ T9044] ------------[ cut here ]------------
> [   87.617126][ T9044] kernel BUG at mm/page-writeback.c:3119!
> [   87.619308][ T9044] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
> [   87.620174][ T9044] CPU: 1 UID: 0 PID: 9044 Comm: kworker/u10:6 Not
> tainted 6.12.0-syzkaller-08446-g228a1157fb9f #0
> 
> Here are the instructions I followed:
> https://github.com/google/syzkaller/blob/master/docs/syzbot_assets.md#run-a-c-reproducer

Thanks for the confirmation.

I can reproduce it using the exact disk image (around 1min), but not 
inside my usual development VM (over 5min).

So it will a lot tricky to debug now...

Thanks,
Qu