lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <74921cba-6237-4303-bb4c-baa22aaf497b@kernel.dk>
Date:   Tue, 24 Oct 2023 08:30:02 -0600
From:   Jens Axboe <axboe@...nel.dk>
To:     Dave Chinner <david@...morbit.com>,
        Andres Freund <andres@...razel.de>
Cc:     Theodore Ts'o <tytso@....edu>,
        Thorsten Leemhuis <regressions@...mhuis.info>,
        Shreeya Patel <shreeya.patel@...labora.com>,
        linux-ext4@...r.kernel.org,
        Ricardo CaƱuelo <ricardo.canuelo@...labora.com>,
        gustavo.padovan@...labora.com, zsm@...gle.com, garrick@...gle.com,
        Linux regressions mailing list <regressions@...ts.linux.dev>,
        io-uring@...r.kernel.org
Subject: Re: task hung in ext4_fallocate #2

On 10/23/23 7:12 PM, Dave Chinner wrote:
> [cc Jens, io-uring]

Thanks!

> On Tue, Oct 17, 2023 at 07:50:09PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2023-10-17 20:43:35 -0400, Theodore Ts'o wrote:
>>> On Mon, Oct 16, 2023 at 08:37:25PM -0700, Andres Freund wrote:
>>>> I just was able to reproduce the issue, after upgrading to 6.6-rc6 - this time
>>>> it took ~55min of high load (io_uring using branch of postgres, running a
>>>> write heavy transactional workload concurrently with concurrent bulk data
>>>> load) to trigger the issue.
>>>>
>>>> For now I have left the system running, in case there's something you would
>>>> like me to check while the system is hung.
>>>>
>>>> The first hanging task that I observed:
>>>>
>>>> cat /proc/57606/stack
>>>> [<0>] inode_dio_wait+0xd5/0x100
>>>> [<0>] ext4_fallocate+0x12f/0x1040
>>>> [<0>] vfs_fallocate+0x135/0x360
>>>> [<0>] __x64_sys_fallocate+0x42/0x70
>>>> [<0>] do_syscall_64+0x38/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>
>>> This stack trace is from some process (presumably postgres) trying to
>>> do a fallocate() system call:
>>
>> Yes, it's indeed postgres.
>>
>>
>>>> [ 3194.579297] INFO: task iou-wrk-58004:58874 blocked for more than 122 seconds.
>>>> [ 3194.579304]       Not tainted 6.6.0-rc6-andres-00001-g01edcfe38260 #77
>>>> [ 3194.579310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> [ 3194.579314] task:iou-wrk-58004   state:D stack:0     pid:58874 ppid:52606  flags:0x00004000
>>>> [ 3194.579325] Call Trace:
>>>> [ 3194.579329]  <TASK>
>>>> [ 3194.579334]  __schedule+0x388/0x13e0
>>>> [ 3194.579349]  schedule+0x5f/0xe0
>>>> [ 3194.579361]  schedule_preempt_disabled+0x15/0x20
>>>> [ 3194.579374]  rwsem_down_read_slowpath+0x26e/0x4c0
>>>> [ 3194.579385]  down_read+0x44/0xa0
>>>> [ 3194.579393]  ext4_file_write_iter+0x432/0xa80
>>>> [ 3194.579407]  io_write+0x129/0x420
>>>
>>> This could potentially be a interesting stack trace; but this is where
>>> we really need to map the stack address to line numbers.  Is that
>>> something you could do?
>>
>> Converting the above with scripts/decode_stacktrace.sh yields:
>>
>> __schedule (kernel/sched/core.c:5382 kernel/sched/core.c:6695)
>> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 13) kernel/sched/core.c:6772 (discriminator 13))
>> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 (discriminator 10) kernel/sched/core.c:6831 (discriminator 10))
>> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 4))
>> down_read (./arch/x86/include/asm/preempt.h:95 kernel/locking/rwsem.c:1257 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
>> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
>> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
>>
>> But I'm not sure that that stacktrace is quite right (e.g. what's
>> ./include/linux/fs.h:1073 doing in this stacktrace?). But with an optimized
>> build it's a bit hard to tell what's an optimization artifact and what's an
>> off stack trace...
>>
>>
>> I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres
>> processes and the associated io-uring threads, and then to deduplicate them.
>>
>> 22x:
>> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
>> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
>> io_issue_sqe (io_uring/io_uring.c:1878)
>> io_wq_submit_work (io_uring/io_uring.c:1960)
>> io_worker_handle_work (io_uring/io-wq.c:596)
>> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
>> ret_from_fork (arch/x86/kernel/process.c:147)
>> ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> 
> io_uring does some interesting stuff with IO completion and iomap
> now - IIRC IOCB_DIO_CALLER_COMP is new 6.6-rc1 functionality. This
> flag is set by io_write() to tell the iomap IO completion handler
> that the caller will the IO completion, avoiding a context switch
> to run the completion in a kworker thread.
> 
> It's not until the caller runs iocb->dio_complete() that
> inode_dio_end() is called to drop the i_dio_count. This happens when
> io_uring gets completion notification from iomap via
> iocb->ki_complete(iocb, 0);
> 
> It then requires the io_uring layer to process the completion
> and call iocb->dio_complete() before the inode->i_dio_count is
> dropped and inode_dio_wait() will complete.
> 
> So what I suspect here is that we have a situation where the worker
> that would run the completion is blocked on the inode rwsem because
> this isn't a IOCB_NOWAIT IO and the fallocate call holds the rwsem
> exclusive and is waiting on inode_dio_wait() to return.
> 
> Cc Jens, because I'm not sure this is actually an ext4 problem - I
> can't see why XFS won't have the same issue w.r.t.
> truncate/fallocate and IOCB_DIO_CALLER_COMP delaying the
> inode_dio_end() until whenever the io_uring code can get around to
> processing the delayed completion....

I don't think this is related to the io-wq workers doing non-blocking
IO. The callback is eventually executed by the task that originally
submitted the IO, which is the owner and not the async workers. But...
If that original task is blocked in eg fallocate, then I can see how
that would potentially be an issue.

I'll take a closer look.

-- 
Jens Axboe

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ