[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d94446c-72d1-4c4d-b7e0-696767b98654@vivo.com>
Date: Tue, 30 Jul 2024 19:36:15 +0800
From: Huan Yang <link@...o.com>
To: Christian König <christian.koenig@....com>,
Sumit Semwal <sumit.semwal@...aro.org>,
Benjamin Gaignard <benjamin.gaignard@...labora.com>,
Brian Starkey <Brian.Starkey@....com>, John Stultz <jstultz@...gle.com>,
"T.J. Mercier" <tjmercier@...gle.com>, linux-media@...r.kernel.org,
dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org,
linux-kernel@...r.kernel.org
Cc: opensource.kernel@...o.com
Subject: Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
在 2024/7/30 18:43, Christian König 写道:
> Am 30.07.24 um 10:46 schrieb Huan Yang:
>>
>> 在 2024/7/30 16:37, Christian König 写道:
>>> Am 30.07.24 um 10:14 schrieb Huan Yang:
>>>> 在 2024/7/30 16:03, Christian König 写道:
>>>>> Am 30.07.24 um 09:57 schrieb Huan Yang:
>>>>>> Background
>>>>>> ====
>>>>>> Some user may need load file into dma-buf, current way is:
>>>>>> 1. allocate a dma-buf, get dma-buf fd
>>>>>> 2. mmap dma-buf fd into user vaddr
>>>>>> 3. read(file_fd, vaddr, fsz)
>>>>>> Due to dma-buf user map can't support direct I/O[1], the file read
>>>>>> must be buffer I/O.
>>>>>>
>>>>>> This means that during the process of reading the file into dma-buf,
>>>>>> page cache needs to be generated, and the corresponding content
>>>>>> needs to
>>>>>> be first copied to the page cache before being copied to the
>>>>>> dma-buf.
>>>>>>
>>>>>> This way worked well when reading relatively small files before, as
>>>>>> the page cache can cache the file content, thus improving
>>>>>> performance.
>>>>>>
>>>>>> However, there are new challenges currently, especially as AI
>>>>>> models are
>>>>>> becoming larger and need to be shared between DMA devices and the
>>>>>> CPU
>>>>>> via dma-buf.
>>>>>>
>>>>>> For example, our 7B model file size is around 3.4GB. Using the
>>>>>> previous would mean generating a total of 3.4GB of page cache
>>>>>> (even if it will be reclaimed), and also requiring the copying of
>>>>>> 3.4GB
>>>>>> of content between page cache and dma-buf.
>>>>>>
>>>>>> Due to the limited resources of system memory, files in the
>>>>>> gigabyte range
>>>>>> cannot persist in memory indefinitely, so this portion of page
>>>>>> cache may
>>>>>> not provide much assistance for subsequent reads. Additionally, the
>>>>>> existence of page cache will consume additional system resources
>>>>>> due to
>>>>>> the extra copying required by the CPU.
>>>>>>
>>>>>> Therefore, I think it is necessary for dma-buf to support direct
>>>>>> I/O.
>>>>>>
>>>>>> However, direct I/O file reads cannot be performed using the buffer
>>>>>> mmaped by the user space for the dma-buf.[1]
>>>>>>
>>>>>> Here are some discussions on implementing direct I/O using dma-buf:
>>>>>>
>>>>>> mmap[1]
>>>>>> ---
>>>>>> dma-buf never support user map vaddr use of direct I/O.
>>>>>>
>>>>>> udmabuf[2]
>>>>>> ---
>>>>>> Currently, udmabuf can use the memfd method to read files into
>>>>>> dma-buf in direct I/O mode.
>>>>>>
>>>>>> However, if the size is large, the current udmabuf needs to
>>>>>> adjust the
>>>>>> corresponding size_limit(default 64MB).
>>>>>> But using udmabuf for files at the 3GB level is not a very good
>>>>>> approach.
>>>>>> It needs to make some adjustments internally to handle this.[3]
>>>>>> Or else,
>>>>>> fail create.
>>>>>>
>>>>>> But, it is indeed a viable way to enable dma-buf to support
>>>>>> direct I/O.
>>>>>> However, it is necessary to initiate the file read after the
>>>>>> memory allocation
>>>>>> is completed, and handle race conditions carefully.
>>>>>>
>>>>>> sendfile/splice[4]
>>>>>> ---
>>>>>> Another way to enable dma-buf to support direct I/O is by
>>>>>> implementing
>>>>>> splice_write/write_iter in the dma-buf file operations (fops) to
>>>>>> adapt
>>>>>> to the sendfile method.
>>>>>> However, the current sendfile/splice calls are based on pipe.
>>>>>> When using
>>>>>> direct I/O to read a file, the content needs to be copied to the
>>>>>> buffer
>>>>>> allocated by the pipe (default 64KB), and then the dma-buf fops'
>>>>>> splice_write needs to be called to write the content into the
>>>>>> dma-buf.
>>>>>> This approach requires serially reading the content of file pipe
>>>>>> size
>>>>>> into the pipe buffer and then waiting for the dma-buf to be written
>>>>>> before reading the next one.(The I/O performance is relatively weak
>>>>>> under direct I/O.)
>>>>>> Moreover, due to the existence of the pipe buffer, even when using
>>>>>> direct I/O and not needing to generate additional page cache,
>>>>>> there still needs to be a CPU copy.
>>>>>>
>>>>>> copy_file_range[5]
>>>>>> ---
>>>>>> Consider of copy_file_range, It only supports copying files
>>>>>> within the
>>>>>> same file system. Similarly, it is not very practical.
>>>>>>
>>>>>>
>>>>>> So, currently, there is no particularly suitable solution on VFS to
>>>>>> allow dma-buf to support direct I/O for large file reads.
>>>>>>
>>>>>> This patchset provides an idea to complete file reads when
>>>>>> requesting a
>>>>>> dma-buf.
>>>>>>
>>>>>> Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>> ===
>>>>>> This patch provides a method to immediately read the file content
>>>>>> after
>>>>>> the dma-buf is allocated, and only returns the dma-buf file
>>>>>> descriptor
>>>>>> after the file is fully read.
>>>>>>
>>>>>> Since the dma-buf file descriptor is not returned, no other
>>>>>> thread can
>>>>>> access it except for the current thread, so we don't need to
>>>>>> worry about
>>>>>> race conditions.
>>>>>
>>>>> That is a completely false assumption.
>>>> Can you provide a detailed explanation as to why this assumption is
>>>> incorrect? thanks.
>>>
>>> File descriptors can be guessed and is available to userspace as
>>> soon as dma_buf_fd() is called.
>>>
>>> What could potentially work is to call system_heap_allocate()
>>> without calling dma_buf_fd(), but I'm not sure if you can then make
>>> I/O to the underlying pages.
>>
>> Actually, the dma-buf file descriptor is obtained only after a
>> successful file read in the code, so there is no issue.
>>
>> If you are interested, you can take a look at the
>> dma_heap_buffer_alloc_and_read function in patch1.
>>
>>>
>>>>>
>>>>>>
>>>>>> Map the dma-buf to the vmalloc area and initiate file reads in
>>>>>> kernel
>>>>>> space, supporting both buffer I/O and direct I/O.
>>>>>>
>>>>>> This patch adds the DMA_HEAP_ALLOC_AND_READ heap_flag for user.
>>>>>> When a user needs to allocate a dma-buf and read a file, they should
>>>>>> pass this heap flag. As the size of the file being read is fixed,
>>>>>> there is no
>>>>>> need to pass the 'len' parameter. Instead, The file_fd needs to
>>>>>> be passed to
>>>>>> indicate to the kernel the file that needs to be read.
>>>>>>
>>>>>> The file open flag determines the mode of file reading.
>>>>>> But, please note that if direct I/O(O_DIRECT) is needed to read
>>>>>> the file,
>>>>>> the file size must be page aligned. (with patch 2-5, no need)
>>>>>>
>>>>>> Therefore, for the user, len and file_fd are mutually exclusive,
>>>>>> and they are combined using a union.
>>>>>>
>>>>>> Once the user obtains the dma-buf fd, the dma-buf directly
>>>>>> contains the
>>>>>> file content.
>>>>>
>>>>> And I'm repeating myself, but this is a complete NAK from my side
>>>>> to this approach.
>>>>>
>>>>> We pointed out multiple ways of how to implement this cleanly and
>>>>> not by hacking functionality into the kernel which absolutely
>>>>> doesn't belong there.
>>>> In this patchset, I have provided performance comparisons of each
>>>> of these methods. Can you please provide more opinions?
>>>
>>> Either drop the whole approach or change udmabuf to do what you want
>>> to do.
>> OK, if so, do I need to send a patch to make dma-buf support sendfile?
>
> Well the udmabuf approach doesn't need to use sendfile, so no.
Get it, I'll not send again.
About udmabuf, I test find it can't support larget find read due to page
array alloc.
I already upload this patch, but do not recive answer.
https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
Is there anything wrong with my understanding of it?
>
> Regards,
> Christian.
>
>>
>>>
>>> Apart from that I don't see a doable way which can be accepted into
>>> the kernel.
>> Thanks for your suggestion.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Patch 1 implement it.
>>>>>>
>>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>>
>>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>>> synchronously read files using direct I/O.
>>>>>>
>>>>>> This approach helps to save CPU copying and avoid a certain
>>>>>> degree of
>>>>>> memory thrashing (page cache generation and reclamation)
>>>>>>
>>>>>> When dealing with large file sizes, the benefits of this approach
>>>>>> become
>>>>>> particularly significant.
>>>>>>
>>>>>> However, there are currently some methods that can improve
>>>>>> performance,
>>>>>> not just save system resources:
>>>>>>
>>>>>> Due to the large file size, for example, a AI 7B model of around
>>>>>> 3.4GB, the
>>>>>> time taken to allocate DMA-BUF memory will be relatively long.
>>>>>> Waiting
>>>>>> for the allocation to complete before reading the file will add
>>>>>> to the
>>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>>> allocation and file read can be calculated using the formula
>>>>>> T(total) = T(alloc) + T(I/O)
>>>>>>
>>>>>> However, if we change our approach, we don't necessarily need to
>>>>>> wait
>>>>>> for the DMA-BUF allocation to complete before initiating I/O. In
>>>>>> fact,
>>>>>> during the allocation process, we already hold a portion of the
>>>>>> page,
>>>>>> which means that waiting for subsequent page allocations to complete
>>>>>> before carrying out file reads is actually unfair to the pages
>>>>>> that have
>>>>>> already been allocated.
>>>>>>
>>>>>> The allocation of pages is sequential, and the reading of the
>>>>>> file is
>>>>>> also sequential, with the content and size corresponding to the
>>>>>> file.
>>>>>> This means that the memory location for each page, which holds the
>>>>>> content of a specific position in the file, can be determined at the
>>>>>> time of allocation.
>>>>>>
>>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>>> gather a certain number of pages before initiating batch processing.
>>>>>>
>>>>>> The default gather size is 128MB. So, ever gathered can see as a
>>>>>> file read
>>>>>> work, it maps the gather page to the vmalloc area to obtain a
>>>>>> continuous
>>>>>> virtual address, which is used as a buffer to store the contents
>>>>>> of the
>>>>>> corresponding file. So, if using direct I/O to read a file, the file
>>>>>> content will be written directly to the corresponding dma-buf
>>>>>> buffer memory
>>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>>
>>>>>> Consider other ways to read into dma-buf. If we assume reading
>>>>>> after mmap
>>>>>> dma-buf, we need to map the pages of the dma-buf to the user virtual
>>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>>> Even if we support sendfile, the file copy also need buffer, you
>>>>>> must
>>>>>> setup it.
>>>>>> So, mapping pages to the vmalloc area does not incur any additional
>>>>>> performance overhead compared to other methods.[6]
>>>>>>
>>>>>> Certainly, the administrator can also modify the gather size
>>>>>> through patch5.
>>>>>>
>>>>>> The formula for the time taken for system_heap buffer allocation and
>>>>>> file reading through async_read is as follows:
>>>>>>
>>>>>> T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>>
>>>>>> Compared to the synchronous read:
>>>>>> T(total) = T(alloc) + T(I/O)
>>>>>>
>>>>>> If the allocation time or I/O time is long, the time difference
>>>>>> will be
>>>>>> covered by the maximum value between the allocation and I/O. The
>>>>>> other
>>>>>> party will be concealed.
>>>>>>
>>>>>> Therefore, the larger the size of the file that needs to be read,
>>>>>> the
>>>>>> greater the corresponding benefits will be.
>>>>>>
>>>>>> How to use
>>>>>> ===
>>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>> 1. open dma-heap, get heap fd
>>>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>>> 4. mmap dma-buf fd, get vaddr
>>>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>> 6. share, attach, whatever you want
>>>>>>
>>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>> 1. open dma-heap, get heap fd
>>>>>> 2. open file, get file_fd(buffer/direct)
>>>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
>>>>>> flag, set file_fd
>>>>>> instead of len. get dma-buf fd(contains file content)
>>>>>> 4. share, attach, whatever you want
>>>>>>
>>>>>> So, test it is easy.
>>>>>>
>>>>>> How to test
>>>>>> ===
>>>>>> The performance comparison will be conducted for the following
>>>>>> scenarios:
>>>>>> 1. normal
>>>>>> 2. udmabuf with [3] patch
>>>>>> 3. sendfile
>>>>>> 4. only patch 1
>>>>>> 5. patch1 - patch4.
>>>>>>
>>>>>> normal:
>>>>>> 1. open dma-heap, get heap fd
>>>>>> 2. open file, get file_fd(can't use O_DIRECT)
>>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>>> 4. mmap dma-buf fd, get vaddr
>>>>>> 5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>> 6. share, attach, whatever you want
>>>>>>
>>>>>> UDMA-BUF step:
>>>>>> 1. memfd_create
>>>>>> 2. open file(buffer/direct)
>>>>>> 3. udmabuf create
>>>>>> 4. mmap memfd
>>>>>> 5. read file into memfd vaddr
>>>>>>
>>>>>> Sendfile step(need suit splice_write/write_iter, just use to
>>>>>> compare):
>>>>>> 1. open dma-heap, get heap fd
>>>>>> 2. open file, get file_fd(buffer/direct)
>>>>>> 3. use file len to allocate dma-buf, get dma-buf fd
>>>>>> 4. sendfile file_fd to dma-buf fd
>>>>>> 6. share, attach, whatever you want
>>>>>>
>>>>>> patch1/patch1-4:
>>>>>> 1. open dma-heap, get heap fd
>>>>>> 2. open file, get file_fd(buffer/direct)
>>>>>> 3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap
>>>>>> flag, set file_fd
>>>>>> instead of len. get dma-buf fd(contains file content)
>>>>>> 4. share, attach, whatever you want
>>>>>>
>>>>>> You can create a file to test it. Compare the performance gap
>>>>>> between the two.
>>>>>> It is best to compare the differences in file size from KB to MB
>>>>>> to GB.
>>>>>>
>>>>>> The following test data will compare the performance differences
>>>>>> between 512KB,
>>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>>
>>>>>> Performance Test
>>>>>> ===
>>>>>> 12G RAM phone
>>>>>> UFS4.0(the maximum speed is 4GB/s. ),
>>>>>> f2fs
>>>>>> kernel 6.1 with patch[7] (or else, can't support kvec direct
>>>>>> I/O read.)
>>>>>> no memory pressure.
>>>>>> drop_cache is used for each test.
>>>>>>
>>>>>> The average of 5 test results:
>>>>>> | scheme-size | 512KB(ns) | 8MB(ns) | 1GB(ns) |
>>>>>> 3GB(ns) |
>>>>>> | ------------------- | ---------- | ---------- | ------------- |
>>>>>> ------------- |
>>>>>> | normal | 2,790,861 | 14,535,784 | 1,520,790,492 |
>>>>>> 3,332,438,754 |
>>>>>> | udmabuf buffer I/O | 1,704,046 | 11,313,476 | 821,348,000 |
>>>>>> 2,108,419,923 |
>>>>>> | sendfile buffer I/O | 3,261,261 | 12,112,292 | 1,565,939,938 |
>>>>>> 3,062,052,984 |
>>>>>> | patch1-4 buffer I/O | 2,064,538 | 10,771,474 | 986,338,800 |
>>>>>> 2,187,570,861 |
>>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 |
>>>>>> 9,777,661,077 |
>>>>>> | patch1 direct I/O | 813,215 | 6,962,092 | 2,364,211,877 |
>>>>>> 5,648,897,554 |
>>>>>> | udmabuf direct I/O | 1,289,554 | 8,968,138 | 921,480,784 |
>>>>>> 2,158,305,738 |
>>>>>> | patch1-4 direct I/O | 1,957,661 | 6,581,999 | 520,003,538 |
>>>>>> 1,400,006,107 |
>>>>
>>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>>
>>>> udmabuf is good, but I think our oem driver can't suit it. (And,
>>>> AOSP do not open this feature)
>>>>
>>>>
>>>> Anyway, I am sending this patchset in the hope of further discussion.
>>>>
>>>> Thanks.
>>>>
>>>>>>
>>>>>> So, based on the test results:
>>>>>>
>>>>>> When the file is large, the patchset has the highest performance.
>>>>>> Compared to normal, patchset is a 50% improvement;
>>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>>> patch1 typical performance breakdown is as follows:
>>>>>> 1. alloc cost 188,802,693 ns
>>>>>> 2. vmap cost 42,491,385 ns
>>>>>> 3. file read cost 4,180,876,702 ns
>>>>>> Therefore, directly performing a single direct I/O read on a
>>>>>> large file
>>>>>> may not be the most optimal way for performance.
>>>>>>
>>>>>> The performance of direct I/O implemented by the sendfile method
>>>>>> is the worst.
>>>>>>
>>>>>> When file size is small, The difference in performance is not
>>>>>> significant. This is consistent with expectations.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Suggested use cases
>>>>>> ===
>>>>>> 1. When there is a need to read large files and system
>>>>>> resources are scarce,
>>>>>> especially when the size of memory is limited.(GB level) In
>>>>>> this
>>>>>> scenario, using direct I/O for file reading can even bring
>>>>>> performance
>>>>>> improvements.(may need patch2-3)
>>>>>> 2. For embedded devices with limited RAM, using direct I/O can
>>>>>> save system
>>>>>> resources and avoid unnecessary data copying. Therefore,
>>>>>> even if the
>>>>>> performance is lower when read small file, it can still be
>>>>>> used
>>>>>> effectively.
>>>>>> 3. If there is sufficient memory, pinning the page cache of
>>>>>> the model files
>>>>>> in memory and placing file in the EROFS file system for
>>>>>> read-only access
>>>>>> maybe better.(EROFS do not support direct I/O)
>>>>>>
>>>>>>
>>>>>> Changlog
>>>>>> ===
>>>>>> v1 [8]
>>>>>> v1->v2:
>>>>>> Uses the heap flag method for alloc and read instead of
>>>>>> adding a new
>>>>>> DMA-buf ioctl command. [9]
>>>>>> Split the patchset to facilitate review and test.
>>>>>> patch 1 implement alloc and read, offer heap flag into it.
>>>>>> patch 2-4 offer async read
>>>>>> patch 5 can change gather limit.
>>>>>>
>>>>>> Reference
>>>>>> ===
>>>>>> [1]
>>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>>> [2]
>>>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>>> [3]
>>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>>> [4]
>>>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>>> [5]
>>>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>>> [6]
>>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>>> [7]
>>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>>> [8]
>>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>>> [9]
>>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>>
>>>>>> Huan Yang (5):
>>>>>> dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag
>>>>>> dma-buf: heaps: Introduce async alloc read ops
>>>>>> dma-buf: heaps: support alloc async read file
>>>>>> dma-buf: heaps: system_heap alloc support async read
>>>>>> dma-buf: heaps: configurable async read gather limit
>>>>>>
>>>>>> drivers/dma-buf/dma-heap.c | 552
>>>>>> +++++++++++++++++++++++++++-
>>>>>> drivers/dma-buf/heaps/system_heap.c | 70 +++-
>>>>>> include/linux/dma-heap.h | 53 ++-
>>>>>> include/uapi/linux/dma-heap.h | 11 +-
>>>>>> 4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>>
>>>>>>
>>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>>
>>>
>
Powered by blists - more mailing lists