[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <80ce3ec9104c4f0abbcb589b03a5f3c7@honor.com>
Date: Fri, 13 Jun 2025 09:43:08 +0000
From: wangtao <tao.wangtao@...or.com>
To: Christoph Hellwig <hch@...radead.org>, Christian König
<christian.koenig@....com>
CC: "sumit.semwal@...aro.org" <sumit.semwal@...aro.org>, "kraxel@...hat.com"
<kraxel@...hat.com>, "vivek.kasireddy@...el.com" <vivek.kasireddy@...el.com>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>, "brauner@...nel.org"
<brauner@...nel.org>, "hughd@...gle.com" <hughd@...gle.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "amir73il@...il.com"
<amir73il@...il.com>, "benjamin.gaignard@...labora.com"
<benjamin.gaignard@...labora.com>, "Brian.Starkey@....com"
<Brian.Starkey@....com>, "jstultz@...gle.com" <jstultz@...gle.com>,
"tjmercier@...gle.com" <tjmercier@...gle.com>, "jack@...e.cz" <jack@...e.cz>,
"baolin.wang@...ux.alibaba.com" <baolin.wang@...ux.alibaba.com>,
"linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
"linaro-mm-sig@...ts.linaro.org" <linaro-mm-sig@...ts.linaro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "wangbintian(BintianWang)"
<bintian.wang@...or.com>, yipengxiang <yipengxiang@...or.com>, liulu 00013167
<liulu.liu@...or.com>, hanfeng 00012985 <feng.han@...or.com>
Subject: RE: [PATCH v4 0/4] Implement dmabuf direct I/O via copy_file_range
> On Tue, Jun 10, 2025 at 12:52:18PM +0200, Christian König wrote:
> > >> dma_addr_t/len array now that the new DMA API supporting that has
> > >> been merged. Is there any chance the dma-buf maintainers could
> > >> start to kick this off? I'm of course happy to assist.
> >
> > Work on that is already underway for some time.
> >
> > Most GPU drivers already do sg_table -> DMA array conversion, I need
> > to push on the remaining to clean up.
>
> Do you have a pointer?
>
> > >> Yes, that's really puzzling and should be addressed first.
> > > With high CPU performance (e.g., 3GHz), GUP (get_user_pages)
> > > overhead is relatively low (observed in 3GHz tests).
> >
> > Even on a low end CPU walking the page tables and grabbing references
> > shouldn't be that much of an overhead.
>
> Yes.
>
> >
> > There must be some reason why you see so much CPU overhead. E.g.
> > compound pages are broken up or similar which should not happen in the
> > first place.
>
> pin_user_pages outputs an array of PAGE_SIZE (modulo offset and shorter
> last length) array strut pages unfortunately. The block direct I/O code has
> grown code to reassemble folios from them fairly recently which did speed
> up some workloads.
>
> Is this test using the block device or iomap direct I/O code? What kernel
> version is it run on?
Here's my analysis on Linux 6.6 with F2FS/iomap.
Comparing udmabuf+memfd direct read vs dmabuf direct c_f_r:
Systrace: On a high-end 3 GHz CPU, the former occupies >80% runtime vs
<20% for the latter. On a low-end 1 GHz CPU, the former becomes CPU-bound.
Perf: For the former, bio_iov_iter_get_pages/get_user_pages dominate
latency. The latter avoids this via lightweight bvec assignments.
|- 13.03% __arm64_sys_read
|-|- 13.03% f2fs_file_read_iter
|-|-|- 13.03% __iomap_dio_rw
|-|-|-|- 12.95% iomap_dio_bio_iter
|-|-|-|-|- 10.69% bio_iov_iter_get_pages
|-|-|-|-|-|- 10.53% iov_iter_extract_pages
|-|-|-|-|-|-|- 10.53% pin_user_pages_fast
|-|-|-|-|-|-|-|- 10.53% internal_get_user_pages_fast
|-|-|-|-|-|-|-|-|- 10.23% __gup_longterm_locked
|-|-|-|-|-|-|-|-|-|- 8.85% __get_user_pages
|-|-|-|-|-|-|-|-|-|-|- 6.26% handle_mm_fault
|-|-|-|-|- 1.91% iomap_dio_submit_bio
|-|-|-|-|-|- 1.64% submit_bio
|- 1.13% __arm64_sys_copy_file_range
|-|- 1.13% vfs_copy_file_range
|-|-|- 1.13% dma_buf_copy_file_range
|-|-|-|- 1.13% system_heap_dma_buf_rw_file
|-|-|-|-|- 1.13% f2fs_file_read_iter
|-|-|-|-|-|- 1.13% __iomap_dio_rw
|-|-|-|-|-|-|- 1.13% iomap_dio_bio_iter
|-|-|-|-|-|-|-|- 1.13% iomap_dio_submit_bio
|-|-|-|-|-|-|-|-|- 1.08% submit_bio
Large folios can reduce GUP overhead but still significantly slower
than dmabuf to bio_vec conversion.
Regards,
Wangtao.
Powered by blists - more mailing lists