[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c5f32ae1c760f3296c2d2dcff0399f331e94d775.camel@oracle.com>
Date: Thu, 22 Jun 2023 23:22:12 +0000
From: Allison Henderson <allison.henderson@...cle.com>
To: "bongiojp@...il.com" <bongiojp@...il.com>,
"djwong@...nel.org" <djwong@...nel.org>,
"tytso@....edu" <tytso@....edu>
CC: "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>
Subject: Re: [PATCH 0/1] iomap regression for aio dio 4k writes
On Wed, 2023-06-21 at 10:29 -0700, Jeremy Bongio wrote:
> Hi Darrick and Allison,
>
> There has been a standing performance regression involving AIO DIO
> 4k-aligned writes on ext4 backed by a fast local SSD since the switch
> to iomap. I think it was originally reported and investigated in this
> thread:
> https://urldefense.com/v3/__https://lore.kernel.org/all/87lf7rkffv.fsf@collabora.com/__;!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CQMgHZyQ$
>
>
> Short version:
> Pre-iomap, for ext4 async direct writes, after the bio is written to
> disk
> the completion function is called directly during the endio stage.
>
> Post-iomap, for direct writes, after the bio is written to disk, the
> completion
> function is deferred to a work queue. This adds latency that impacts
> performance most noticeably in very fast SSDs.
>
> Detailed version:
> A possible explanation is below, followed by a few questions to
> figure
> out the right way to fix it.
>
> In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has
> completed
> in the nvme driver, the interrupt handler for the write request ends
> in calling bio_endio() which ends up calling dio_bio_end_aio(). A
> different end_io function is used for async and sync io. If there are
> no pages mapped in memory for the write operation's inode, then the
> completion function for ext4 is called directly. If there are pages
> mapped, then they might be dirty and need to be updated and work
> is deferred to a work queue.
>
> Here is the relevant 4.15 code:
>
> fs/direct-io.c: dio_bio_end_aio()
> if (dio->result)
> defer_completion = dio->defer_completion ||
> (dio_op == REQ_OP_WRITE &&
> dio->inode->i_mapping->nrpages);
> if (defer_completion) {
> INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> queue_work(dio->inode->i_sb->s_dio_done_wq,
> &dio->complete_work);
> } else {
> dio_complete(dio, 0, DIO_COMPLETE_ASYNC);
> }
>
> After ext4 switched to using iomap, the endio function became
> iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end
> io
> function is used for both async and sync io. All write requests will
> defer io completion to a work queue even if there are no mapped pages
> for the inode.
>
> Here is the relevant code in latest kernel (post-iomap) ...
>
> fs/iomap/direct-io.c: iomap_dio_bio_end_io()
> if (dio->wait_for_completion) {
> struct task_struct *waiter = dio->submit.waiter;
> WRITE_ONCE(dio->submit.waiter, NULL);
> blk_wake_io_task(waiter);
> } else if (dio->flags & IOMAP_DIO_WRITE) {
> struct inode *inode = file_inode(dio->iocb->ki_filp);
>
> WRITE_ONCE(dio->iocb->private, NULL);
> INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> } else {
> WRITE_ONCE(dio->iocb->private, NULL);
> iomap_dio_complete_work(&dio->aio.work);
> }
>
> With the attached patch, I see significantly better performance in
> 5.10 than 4.15. 5.10 is the latest kernel where I have driver support
> for an SSD that is fast enough to reproduce the regression. I
> verified that upstream iomap works the same.
>
> Test results using the reproduction script from the original report
> and testing with 4k/8k/12k/16k blocksizes and write-only:
> https://urldefense.com/v3/__https://people.collabora.com/*krisman/dio/week21/bench.sh__;fg!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CpnhVXfg$
>
>
> fio benchmark command:
> fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --
> iodepth=64 \
> --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \
> --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \
> --runtime=100 --output-format=terse >> ${LOG}
>
> For 4.15, with all write completions called in io handler:
> 4k: bw=1056MiB/s
> 8k: bw=2082MiB/s
> 12k: bw=2332MiB/s
> 16k: bw=2453MiB/s
>
> For unmodified 5.10, with all write completions deferred:
> 4k: bw=1004MiB/s
> 8k: bw=2074MiB/s
> 12k: bw=2309MiB/s
> 16k: bw=2465MiB/s
>
> For modified 5.10, with all write completions called in io handler:
> 4k: bw=1193MiB/s
> 8k: bw=2258MiB/s
> 12k: bw=2346MiB/s
> 16k: bw=2446MiB/s
>
> Questions:
>
> Why did iomap from the beginning not make the async/sync io and
> mapped/unmapped distinction that fs/direct-io.c did?
>
> Since no issues have been found for ext4 calling completion work
> directly in the io handler pre-iomap, it is unlikely that this is
> unsafe (sleeping within an io handler callback). However, this may
> not
> be true for all filesystems. Does XFS potentially sleep in its
> completion code?
>
> Allison, Ted mentioned you saw a similar problem when doing
> performance testing for the latest version of Unbreakable Linux. Can
> you test/verify this patch addresses your performance regression as
> well?
Hi Jeremy,
Sure I can link this patch to the bug report to see if the tester sees
any improvements. If anything, I think it's a good data point to have,
even if we are still considering other solutions. Thanks!
Allison
>
> Jeremy Bongio (1):
> For DIO writes with no mapped pages for inode, skip deferring
> completion.
>
> fs/iomap/direct-io.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
Powered by blists - more mailing lists