[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47beaca6-8118-4458-898a-42fc3286d991@vivo.com>
Date: Wed, 31 Jul 2024 09:47:00 +0800
From: Huan Yang <link@...o.com>
To: "T.J. Mercier" <tjmercier@...gle.com>
Cc: Christian König <christian.koenig@....com>,
Sumit Semwal <sumit.semwal@...aro.org>,
Benjamin Gaignard <benjamin.gaignard@...labora.com>,
Brian Starkey <Brian.Starkey@....com>, John Stultz <jstultz@...gle.com>,
linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org,
linaro-mm-sig@...ts.linaro.org, linux-kernel@...r.kernel.org,
opensource.kernel@...o.com
Subject: Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
在 2024/7/31 1:19, T.J. Mercier 写道:
> On Tue, Jul 30, 2024 at 1:14 AM Huan Yang <link@...o.com> wrote:
>>
>> 在 2024/7/30 16:03, Christian König 写道:
>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>> Background
>>>> ====
>>>> Some user may need load file into dma-buf, current way is:
>>>> 1. allocate a dma-buf, get dma-buf fd
>>>> 2. mmap dma-buf fd into user vaddr
>>>> 3. read(file_fd, vaddr, fsz)
>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>> must be buffer I/O.
>>>>
>>>> This means that during the process of reading the file into dma-buf,
>>>> page cache needs to be generated, and the corresponding content needs to
>>>> be first copied to the page cache before being copied to the dma-buf.
>>>>
>>>> This way worked well when reading relatively small files before, as
>>>> the page cache can cache the file content, thus improving performance.
>>>>
>>>> However, there are new challenges currently, especially as AI models are
>>>> becoming larger and need to be shared between DMA devices and the CPU
>>>> via dma-buf.
>>>>
>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>> previous would mean generating a total of 3.4GB of page cache
>>>> (even if it will be reclaimed), and also requiring the copying of 3.4GB
>>>> of content between page cache and dma-buf.
>>>>
>>>> Due to the limited resources of system memory, files in the gigabyte
>>>> range
>>>> cannot persist in memory indefinitely, so this portion of page cache may
>>>> not provide much assistance for subsequent reads. Additionally, the
>>>> existence of page cache will consume additional system resources due to
>>>> the extra copying required by the CPU.
>>>>
>>>> Therefore, I think it is necessary for dma-buf to support direct I/O.
>>>>
>>>> However, direct I/O file reads cannot be performed using the buffer
>>>> mmaped by the user space for the dma-buf.[1]
>>>>
>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>
>>>> mmap[1]
>>>> ---
>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>
>>>> udmabuf[2]
>>>> ---
>>>> Currently, udmabuf can use the memfd method to read files into
>>>> dma-buf in direct I/O mode.
>>>>
>>>> However, if the size is large, the current udmabuf needs to adjust the
>>>> corresponding size_limit(default 64MB).
>>>> But using udmabuf for files at the 3GB level is not a very good
>>>> approach.
>>>> It needs to make some adjustments internally to handle this.[3] Or else,
>>>> fail create.
>>>>
>>>> But, it is indeed a viable way to enable dma-buf to support direct I/O.
>>>> However, it is necessary to initiate the file read after the memory
>>>> allocation
>>>> is completed, and handle race conditions carefully.
>>>>
>>>> sendfile/splice[4]
>>>> ---
>>>> Another way to enable dma-buf to support direct I/O is by implementing
>>>> splice_write/write_iter in the dma-buf file operations (fops) to adapt
>>>> to the sendfile method.
>>>> However, the current sendfile/splice calls are based on pipe. When using
>>>> direct I/O to read a file, the content needs to be copied to the buffer
>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>> splice_write needs to be called to write the content into the dma-buf.
>>>> This approach requires serially reading the content of file pipe size
>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>> before reading the next one.(The I/O performance is relatively weak
>>>> under direct I/O.)
>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>> direct I/O and not needing to generate additional page cache,
>>>> there still needs to be a CPU copy.
>>>>
>>>> copy_file_range[5]
>>>> ---
>>>> Consider of copy_file_range, It only supports copying files within the
>>>> same file system. Similarly, it is not very practical.
>>>>
>>>>
>>>> So, currently, there is no particularly suitable solution on VFS to
>>>> allow dma-buf to support direct I/O for large file reads.
>>>>
>>>> This patchset provides an idea to complete file reads when requesting a
>>>> dma-buf.
>>>>
>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>> ===
>>>> This patch provides a method to immediately read the file content after
>>>> the dma-buf is allocated, and only returns the dma-buf file descriptor
>>>> after the file is fully read.
>>>>
>>>> Since the dma-buf file descriptor is not returned, no other thread can
>>>> access it except for the current thread, so we don't need to worry about
>>>> race conditions.
>>> That is a completely false assumption.
>> Can you provide a detailed explanation as to why this assumption is
>> incorrect? thanks.
>>>> Map the dma-buf to the vmalloc area and initiate file reads in kernel
>>>> space, supporting both buffer I/O and direct I/O.
>>>>
>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>> pass this heap flag. As the size of the file being read is fixed,
>>>> there is no
>>>> need to pass the 'len' parameter. Instead, The file_fd needs to be
>>>> passed to
>>>> indicate to the kernel the file that needs to be read.
>>>>
>>>> The file open flag determines the mode of file reading.
>>>> But, please note that if direct I/O(O_DIRECT) is needed to read the
>>>> file,
>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>
>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>> and they are combined using a union.
>>>>
>>>> Once the user obtains the dma-buf fd, the dma-buf directly contains the
>>>> file content.
>>> And I'm repeating myself, but this is a complete NAK from my side to
>>> this approach.
>>>
>>> We pointed out multiple ways of how to implement this cleanly and not
>>> by hacking functionality into the kernel which absolutely doesn't
>>> belong there.
>> In this patchset, I have provided performance comparisons of each of
>> these methods. Can you please provide more opinions?
>>> Regards,
>>> Christian.
>>>
>>>> Patch 1 implement it.
>>>>
>>>> Patch 2-5 provides an approach for performance improvement.
>>>>
>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>> synchronously read files using direct I/O.
>>>>
>>>> This approach helps to save CPU copying and avoid a certain degree of
>>>> memory thrashing (page cache generation and reclamation)
>>>>
>>>> When dealing with large file sizes, the benefits of this approach become
>>>> particularly significant.
>>>>
>>>> However, there are currently some methods that can improve performance,
>>>> not just save system resources:
>>>>
>>>> Due to the large file size, for example, a AI 7B model of around
>>>> 3.4GB, the
>>>> time taken to allocate DMA-BUF memory will be relatively long. Waiting
>>>> for the allocation to complete before reading the file will add to the
>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>> allocation and file read can be calculated using the formula
>>>> T(total) = T(alloc) + T(I/O)
>>>>
>>>> However, if we change our approach, we don't necessarily need to wait
>>>> for the DMA-BUF allocation to complete before initiating I/O. In fact,
>>>> during the allocation process, we already hold a portion of the page,
>>>> which means that waiting for subsequent page allocations to complete
>>>> before carrying out file reads is actually unfair to the pages that have
>>>> already been allocated.
>>>>
>>>> The allocation of pages is sequential, and the reading of the file is
>>>> also sequential, with the content and size corresponding to the file.
>>>> This means that the memory location for each page, which holds the
>>>> content of a specific position in the file, can be determined at the
>>>> time of allocation.
>>>>
>>>> However, to fully leverage I/O performance, it is best to wait and
>>>> gather a certain number of pages before initiating batch processing.
>>>>
>>>> The default gather size is 128MB. So, ever gathered can see as a file
>>>> read
>>>> work, it maps the gather page to the vmalloc area to obtain a continuous
>>>> virtual address, which is used as a buffer to store the contents of the
>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>> content will be written directly to the corresponding dma-buf buffer
>>>> memory
>>>> without any additional copying.(compare to pipe buffer.)
>>>>
>>>> Consider other ways to read into dma-buf. If we assume reading after
>>>> mmap
>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>> address space. Also, udmabuf memfd need do this operations too.
>>>> Even if we support sendfile, the file copy also need buffer, you must
>>>> setup it.
>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>> performance overhead compared to other methods.[6]
>>>>
>>>> Certainly, the administrator can also modify the gather size through
>>>> patch5.
>>>>
>>>> The formula for the time taken for system_heap buffer allocation and
>>>> file reading through async_read is as follows:
>>>>
>>>> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>
>>>> Compared to the synchronous read:
>>>> T(total) = T(alloc) + T(I/O)
>>>>
>>>> If the allocation time or I/O time is long, the time difference will be
>>>> covered by the maximum value between the allocation and I/O. The other
>>>> party will be concealed.
>>>>
>>>> Therefore, the larger the size of the file that needs to be read, the
>>>> greater the corresponding benefits will be.
>>>>
>>>> How to use
>>>> ===
>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>> 1. open dma-heap, get heap fd
>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>> 4. mmap dma-buf fd, get vaddr
>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>> 6. share, attach, whatever you want
>>>>
>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>> 1. open dma-heap, get heap fd
>>>> 2. open file, get file_fd(buffer/direct)
>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
>>>> set file_fd
>>>> instead of len. get dma-buf fd(contains file content)
>>>> 4. share, attach, whatever you want
>>>>
>>>> So, test it is easy.
>>>>
>>>> How to test
>>>> ===
>>>> The performance comparison will be conducted for the following
>>>> scenarios:
>>>> 1. normal
>>>> 2. udmabuf with [3] patch
>>>> 3. sendfile
>>>> 4. only patch 1
>>>> 5. patch1 - patch4.
>>>>
>>>> normal:
>>>> 1. open dma-heap, get heap fd
>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>> 4. mmap dma-buf fd, get vaddr
>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>> 6. share, attach, whatever you want
>>>>
>>>> UDMA-BUF step:
>>>> 1. memfd_create
>>>> 2. open file(buffer/direct)
>>>> 3. udmabuf create
>>>> 4. mmap memfd
>>>> 5. read file into memfd vaddr
>>>>
>>>> Sendfile step(need suit splice_write/write_iter, just use to compare):
>>>> 1. open dma-heap, get heap fd
>>>> 2. open file, get file_fd(buffer/direct)
>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>> 4. sendfile file_fd to dma-buf fd
>>>> 6. share, attach, whatever you want
>>>>
>>>> patch1/patch1-4:
>>>> 1. open dma-heap, get heap fd
>>>> 2. open file, get file_fd(buffer/direct)
>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap flag,
>>>> set file_fd
>>>> instead of len. get dma-buf fd(contains file content)
>>>> 4. share, attach, whatever you want
>>>>
>>>> You can create a file to test it. Compare the performance gap between
>>>> the two.
>>>> It is best to compare the differences in file size from KB to MB to GB.
>>>>
>>>> The following test data will compare the performance differences
>>>> between 512KB,
>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>
>>>> Performance Test
>>>> ===
>>>> 12G RAM phone
>>>> UFS4.0(the maximum speed is 4GB/s. ),
>>>> f2fs
>>>> kernel 6.1 with patch[7] (or else, can't support kvec direct I/O
>>>> read.)
>>>> no memory pressure.
>>>> drop_cache is used for each test.
>>>>
>>>> The average of 5 test results:
>>>> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
>>>> 3GB(ns) |
>>>> | ------------------- | ---------- | ---------- | ------------- |
>>>> ------------- |
>>>> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 |
>>>> 3,332,438,754 |
>>>> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
>>>> 2,108,419,923 |
>>>> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 |
>>>> 3,062,052,984 |
>>>> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
>>>> 2,187,570,861 |
>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
>>>> 9,777,661,077 |
>>>> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 |
>>>> 5,648,897,554 |
>>>> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
>>>> 2,158,305,738 |
>>>> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
>>>> 1,400,006,107 |
>> With this test, sendfile can't give a good help base on pipe buffer.
>>
>> udmabuf is good, but I think our oem driver can't suit it. (And, AOSP do
>> not open this feature)
> Hi Huan,
>
> We should be able to turn on udmabuf for the Android kernels. We don't
> have CONFIG_UDMABUF because nobody has wanted it so far. It's
> encouraging to see your latest results!
OK, that's greate. I will further study the use of udmabuf and notify
our partners, and make every effort to encourage them to adapt udmabuf.
>
> -T.J.
>
>
>> Anyway, I am sending this patchset in the hope of further discussion.
>>
>> Thanks.
>>
>>>> So, based on the test results:
>>>>
>>>> When the file is large, the patchset has the highest performance.
>>>> Compared to normal, patchset is a 50% improvement;
>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>> patch1 typical performance breakdown is as follows:
>>>> 1. alloc cost 188,802,693 ns
>>>> 2. vmap cost 42,491,385 ns
>>>> 3. file read cost 4,180,876,702 ns
>>>> Therefore, directly performing a single direct I/O read on a large file
>>>> may not be the most optimal way for performance.
>>>>
>>>> The performance of direct I/O implemented by the sendfile method is
>>>> the worst.
>>>>
>>>> When file size is small, The difference in performance is not
>>>> significant. This is consistent with expectations.
>>>>
>>>>
>>>>
>>>> Suggested use cases
>>>> ===
>>>> 1. When there is a need to read large files and system resources
>>>> are scarce,
>>>> especially when the size of memory is limited.(GB level) In this
>>>> scenario, using direct I/O for file reading can even bring
>>>> performance
>>>> improvements.(may need patch2-3)
>>>> 2. For embedded devices with limited RAM, using direct I/O can
>>>> save system
>>>> resources and avoid unnecessary data copying. Therefore, even
>>>> if the
>>>> performance is lower when read small file, it can still be used
>>>> effectively.
>>>> 3. If there is sufficient memory, pinning the page cache of the
>>>> model files
>>>> in memory and placing file in the EROFS file system for
>>>> read-only access
>>>> maybe better.(EROFS do not support direct I/O)
>>>>
>>>>
>>>> Changlog
>>>> ===
>>>> v1 [8]
>>>> v1->v2:
>>>> Uses the heap flag method for alloc and read instead of adding a new
>>>> DMA-buf ioctl command. [9]
>>>> Split the patchset to facilitate review and test.
>>>> patch 1 implement alloc and read, offer heap flag into it.
>>>> patch 2-4 offer async read
>>>> patch 5 can change gather limit.
>>>>
>>>> Reference
>>>> ===
>>>> [1]
>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>> [2]
>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>> [3]
>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>> [4]
>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>> [5]
>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>> [6]
>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>> [7]
>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>> [8]
>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>> [9]
>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>
>>>> Huan Yang (5):
>>>> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>> dma-buf: heaps: Introduce async alloc read ops
>>>> dma-buf: heaps: support alloc async read file
>>>> dma-buf: heaps: system_heap alloc support async read
>>>> dma-buf: heaps: configurable async read gather limit
>>>>
>>>> drivers/dma-buf/dma-heap.c | 552 +++++++++++++++++++++++++++-
>>>> drivers/dma-buf/heaps/system_heap.c | 70 +++-
>>>> include/linux/dma-heap.h | 53 ++-
>>>> include/uapi/linux/dma-heap.h | 11 +-
>>>> 4 files changed, 673 insertions(+), 13 deletions(-)
>>>>
>>>>
>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
Powered by blists - more mailing lists