linux-kernel - Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ca3588a-fd09-4117-9f96-4d935e0295e5@vivo.com>
Date: Wed, 31 Jul 2024 09:48:08 +0800
From: Huan Yang <link@...o.com>
To: Christian König <christian.koenig@....com>,
 Sumit Semwal <sumit.semwal@...aro.org>,
 Benjamin Gaignard <benjamin.gaignard@...labora.com>,
 Brian Starkey <Brian.Starkey@....com>, John Stultz <jstultz@...gle.com>,
 "T.J. Mercier" <tjmercier@...gle.com>, linux-media@...r.kernel.org,
 dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org,
 linux-kernel@...r.kernel.org
Cc: opensource.kernel@...o.com
Subject: Re: [PATCH v2 0/5] Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap flag


在 2024/7/30 21:11, Christian König 写道:
> Am 30.07.24 um 13:36 schrieb Huan Yang:
>>>>> Either drop the whole approach or change udmabuf to do what you 
>>>>> want to do.
>>>> OK, if so, do I need to send a patch to make dma-buf support sendfile?
>>>
>>> Well the udmabuf approach doesn't need to use sendfile, so no.
>>
>> Get it, I'll not send again.
>>
>> About udmabuf, I test find it can't support larget find read due to 
>> page array alloc.
>>
>> I already upload this patch, but do not recive answer.
>>
>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/ 
>>
>>
>> Is there anything wrong with my understanding of it?
>
> No, that patch was totally fine. Not getting a response is usually 
> something good.
>
> In other words when maintainer see something which won't work at all 
> they immediately react, but when nobody complains it usually means you 
> are on the right track.
Thank you for your answer.
>
> As long as nobody has any good arguments against it I'm happy to take 
> that one upstream through drm-misc-next immediately since it's clearly 
> a stand a lone improvement on it's own.

OK, well to know this.

Thank you

>
> Regards,
> Christian.
>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Apart from that I don't see a doable way which can be accepted 
>>>>> into the kernel.
>>>> Thanks for your suggestion.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Patch 1 implement it.
>>>>>>>>
>>>>>>>> Patch 2-5 provides an approach for performance improvement.
>>>>>>>>
>>>>>>>> The DMA_HEAP_ALLOC_AND_READ_FILE heap flag patch enables us to
>>>>>>>> synchronously read files using direct I/O.
>>>>>>>>
>>>>>>>> This approach helps to save CPU copying and avoid a certain 
>>>>>>>> degree of
>>>>>>>> memory thrashing (page cache generation and reclamation)
>>>>>>>>
>>>>>>>> When dealing with large file sizes, the benefits of this 
>>>>>>>> approach become
>>>>>>>> particularly significant.
>>>>>>>>
>>>>>>>> However, there are currently some methods that can improve 
>>>>>>>> performance,
>>>>>>>> not just save system resources:
>>>>>>>>
>>>>>>>> Due to the large file size, for example, a AI 7B model of 
>>>>>>>> around 3.4GB, the
>>>>>>>> time taken to allocate DMA-BUF memory will be relatively long. 
>>>>>>>> Waiting
>>>>>>>> for the allocation to complete before reading the file will add 
>>>>>>>> to the
>>>>>>>> overall time consumption. Therefore, the total time for DMA-BUF
>>>>>>>> allocation and file read can be calculated using the formula
>>>>>>>>     T(total) = T(alloc) + T(I/O)
>>>>>>>>
>>>>>>>> However, if we change our approach, we don't necessarily need 
>>>>>>>> to wait
>>>>>>>> for the DMA-BUF allocation to complete before initiating I/O. 
>>>>>>>> In fact,
>>>>>>>> during the allocation process, we already hold a portion of the 
>>>>>>>> page,
>>>>>>>> which means that waiting for subsequent page allocations to 
>>>>>>>> complete
>>>>>>>> before carrying out file reads is actually unfair to the pages 
>>>>>>>> that have
>>>>>>>> already been allocated.
>>>>>>>>
>>>>>>>> The allocation of pages is sequential, and the reading of the 
>>>>>>>> file is
>>>>>>>> also sequential, with the content and size corresponding to the 
>>>>>>>> file.
>>>>>>>> This means that the memory location for each page, which holds the
>>>>>>>> content of a specific position in the file, can be determined 
>>>>>>>> at the
>>>>>>>> time of allocation.
>>>>>>>>
>>>>>>>> However, to fully leverage I/O performance, it is best to wait and
>>>>>>>> gather a certain number of pages before initiating batch 
>>>>>>>> processing.
>>>>>>>>
>>>>>>>> The default gather size is 128MB. So, ever gathered can see as 
>>>>>>>> a file read
>>>>>>>> work, it maps the gather page to the vmalloc area to obtain a 
>>>>>>>> continuous
>>>>>>>> virtual address, which is used as a buffer to store the 
>>>>>>>> contents of the
>>>>>>>> corresponding file. So, if using direct I/O to read a file, the 
>>>>>>>> file
>>>>>>>> content will be written directly to the corresponding dma-buf 
>>>>>>>> buffer memory
>>>>>>>> without any additional copying.(compare to pipe buffer.)
>>>>>>>>
>>>>>>>> Consider other ways to read into dma-buf. If we assume reading 
>>>>>>>> after mmap
>>>>>>>> dma-buf, we need to map the pages of the dma-buf to the user 
>>>>>>>> virtual
>>>>>>>> address space. Also, udmabuf memfd need do this operations too.
>>>>>>>> Even if we support sendfile, the file copy also need buffer, 
>>>>>>>> you must
>>>>>>>> setup it.
>>>>>>>> So, mapping pages to the vmalloc area does not incur any 
>>>>>>>> additional
>>>>>>>> performance overhead compared to other methods.[6]
>>>>>>>>
>>>>>>>> Certainly, the administrator can also modify the gather size 
>>>>>>>> through patch5.
>>>>>>>>
>>>>>>>> The formula for the time taken for system_heap buffer 
>>>>>>>> allocation and
>>>>>>>> file reading through async_read is as follows:
>>>>>>>>
>>>>>>>>    T(total) = T(first gather page) + Max(T(remain alloc), T(I/O))
>>>>>>>>
>>>>>>>> Compared to the synchronous read:
>>>>>>>>    T(total) = T(alloc) + T(I/O)
>>>>>>>>
>>>>>>>> If the allocation time or I/O time is long, the time difference 
>>>>>>>> will be
>>>>>>>> covered by the maximum value between the allocation and I/O. 
>>>>>>>> The other
>>>>>>>> party will be concealed.
>>>>>>>>
>>>>>>>> Therefore, the larger the size of the file that needs to be 
>>>>>>>> read, the
>>>>>>>> greater the corresponding benefits will be.
>>>>>>>>
>>>>>>>> How to use
>>>>>>>> ===
>>>>>>>> Consider the current pathway for loading model files into DMA-BUF:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>>>    6. share, attach, whatever you want
>>>>>>>>
>>>>>>>> Use DMA_HEAP_ALLOC_AND_READ_FILE JUST a little change:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>>> flag, set file_fd
>>>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>>>    4. share, attach, whatever you want
>>>>>>>>
>>>>>>>> So, test it is easy.
>>>>>>>>
>>>>>>>> How to test
>>>>>>>> ===
>>>>>>>> The performance comparison will be conducted for the following 
>>>>>>>> scenarios:
>>>>>>>>    1. normal
>>>>>>>>    2. udmabuf with [3] patch
>>>>>>>>    3. sendfile
>>>>>>>>    4. only patch 1
>>>>>>>>    5. patch1 - patch4.
>>>>>>>>
>>>>>>>> normal:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(can't use O_DIRECT)
>>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>>    4. mmap dma-buf fd, get vaddr
>>>>>>>>    5. read(file_fd, vaddr, file_size) into dma-buf pages
>>>>>>>>    6. share, attach, whatever you want
>>>>>>>>
>>>>>>>> UDMA-BUF step:
>>>>>>>>    1. memfd_create
>>>>>>>>    2. open file(buffer/direct)
>>>>>>>>    3. udmabuf create
>>>>>>>>    4. mmap memfd
>>>>>>>>    5. read file into memfd vaddr
>>>>>>>>
>>>>>>>> Sendfile step(need suit splice_write/write_iter, just use to 
>>>>>>>> compare):
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>>    3. use file len to allocate dma-buf, get dma-buf fd
>>>>>>>>    4. sendfile file_fd to dma-buf fd
>>>>>>>>    6. share, attach, whatever you want
>>>>>>>>
>>>>>>>> patch1/patch1-4:
>>>>>>>>    1. open dma-heap, get heap fd
>>>>>>>>    2. open file, get file_fd(buffer/direct)
>>>>>>>>    3. allocate dma-buf with DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>>> flag, set file_fd
>>>>>>>>       instead of len. get dma-buf fd(contains file content)
>>>>>>>>    4. share, attach, whatever you want
>>>>>>>>
>>>>>>>> You can create a file to test it. Compare the performance gap 
>>>>>>>> between the two.
>>>>>>>> It is best to compare the differences in file size from KB to 
>>>>>>>> MB to GB.
>>>>>>>>
>>>>>>>> The following test data will compare the performance 
>>>>>>>> differences between 512KB,
>>>>>>>> 8MB, 1GB, and 3GB under various scenarios.
>>>>>>>>
>>>>>>>> Performance Test
>>>>>>>> ===
>>>>>>>>    12G RAM phone
>>>>>>>>    UFS4.0(the maximum speed is 4GB/s. ),
>>>>>>>>    f2fs
>>>>>>>>    kernel 6.1 with patch[7] (or else, can't support kvec direct 
>>>>>>>> I/O read.)
>>>>>>>>    no memory pressure.
>>>>>>>>    drop_cache is used for each test.
>>>>>>>>
>>>>>>>> The average of 5 test results:
>>>>>>>> | scheme-size         | 512KB(ns)  | 8MB(ns)    | 1GB(ns) | 
>>>>>>>> 3GB(ns)       |
>>>>>>>> | ------------------- | ---------- | ---------- | ------------- 
>>>>>>>> | ------------- |
>>>>>>>> | normal              | 2,790,861  | 14,535,784 | 1,520,790,492 
>>>>>>>> | 3,332,438,754 |
>>>>>>>> | udmabuf buffer I/O  | 1,704,046  | 11,313,476 | 821,348,000 | 
>>>>>>>> 2,108,419,923 |
>>>>>>>> | sendfile buffer I/O | 3,261,261  | 12,112,292 | 1,565,939,938 
>>>>>>>> | 3,062,052,984 |
>>>>>>>> | patch1-4 buffer I/O | 2,064,538  | 10,771,474 | 986,338,800 | 
>>>>>>>> 2,187,570,861 |
>>>>>>>> | sendfile direct I/O | 12,844,231 | 37,883,938 | 5,110,299,184 
>>>>>>>> | 9,777,661,077 |
>>>>>>>> | patch1 direct I/O   | 813,215    | 6,962,092  | 2,364,211,877 
>>>>>>>> | 5,648,897,554 |
>>>>>>>> | udmabuf direct I/O  | 1,289,554  | 8,968,138  | 921,480,784 | 
>>>>>>>> 2,158,305,738 |
>>>>>>>> | patch1-4 direct I/O | 1,957,661  | 6,581,999  | 520,003,538 | 
>>>>>>>> 1,400,006,107 |
>>>>>>
>>>>>> With this test, sendfile can't give a good help base on pipe buffer.
>>>>>>
>>>>>> udmabuf is good, but I think our oem driver can't suit it. (And, 
>>>>>> AOSP do not open this feature)
>>>>>>
>>>>>>
>>>>>> Anyway, I am sending this patchset in the hope of further 
>>>>>> discussion.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>>>
>>>>>>>> So, based on the test results:
>>>>>>>>
>>>>>>>> When the file is large, the patchset has the highest performance.
>>>>>>>> Compared to normal, patchset is a 50% improvement;
>>>>>>>> Compared to normal, patch1 only showed a degradation of 41%.
>>>>>>>> patch1 typical performance breakdown is as follows:
>>>>>>>>    1. alloc cost 188,802,693 ns
>>>>>>>>    2. vmap cost 42,491,385 ns
>>>>>>>>    3. file read cost 4,180,876,702 ns
>>>>>>>> Therefore, directly performing a single direct I/O read on a 
>>>>>>>> large file
>>>>>>>> may not be the most optimal way for performance.
>>>>>>>>
>>>>>>>> The performance of direct I/O implemented by the sendfile 
>>>>>>>> method is the worst.
>>>>>>>>
>>>>>>>> When file size is small, The difference in performance is not
>>>>>>>> significant. This is consistent with expectations.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Suggested use cases
>>>>>>>> ===
>>>>>>>>    1. When there is a need to read large files and system 
>>>>>>>> resources are scarce,
>>>>>>>>       especially when the size of memory is limited.(GB level) 
>>>>>>>> In this
>>>>>>>>       scenario, using direct I/O for file reading can even 
>>>>>>>> bring performance
>>>>>>>>       improvements.(may need patch2-3)
>>>>>>>>    2. For embedded devices with limited RAM, using direct I/O 
>>>>>>>> can save system
>>>>>>>>       resources and avoid unnecessary data copying. Therefore, 
>>>>>>>> even if the
>>>>>>>>       performance is lower when read small file, it can still 
>>>>>>>> be used
>>>>>>>>       effectively.
>>>>>>>>    3. If there is sufficient memory, pinning the page cache of 
>>>>>>>> the model files
>>>>>>>>       in memory and placing file in the EROFS file system for 
>>>>>>>> read-only access
>>>>>>>>       maybe better.(EROFS do not support direct I/O)
>>>>>>>>
>>>>>>>>
>>>>>>>> Changlog
>>>>>>>> ===
>>>>>>>>   v1 [8]
>>>>>>>>   v1->v2:
>>>>>>>>     Uses the heap flag method for alloc and read instead of 
>>>>>>>> adding a new
>>>>>>>>     DMA-buf ioctl command. [9]
>>>>>>>>     Split the patchset to facilitate review and test.
>>>>>>>>       patch 1 implement alloc and read, offer heap flag into it.
>>>>>>>>       patch 2-4 offer async read
>>>>>>>>       patch 5 can change gather limit.
>>>>>>>>
>>>>>>>> Reference
>>>>>>>> ===
>>>>>>>> [1] 
>>>>>>>> https://lore.kernel.org/all/0393cf47-3fa2-4e32-8b3d-d5d5bdece298@amd.com/
>>>>>>>> [2] 
>>>>>>>> https://lore.kernel.org/all/ZpTnzkdolpEwFbtu@phenom.ffwll.local/
>>>>>>>> [3] 
>>>>>>>> https://lore.kernel.org/all/20240725021349.580574-1-link@vivo.com/
>>>>>>>> [4] 
>>>>>>>> https://lore.kernel.org/all/Zpf5R7fRZZmEwVuR@infradead.org/
>>>>>>>> [5] 
>>>>>>>> https://lore.kernel.org/all/ZpiHKY2pGiBuEq4z@infradead.org/
>>>>>>>> [6] 
>>>>>>>> https://lore.kernel.org/all/9b70db2e-e562-4771-be6b-1fa8df19e356@amd.com/
>>>>>>>> [7] 
>>>>>>>> https://patchew.org/linux/20230209102954.528942-1-dhowells@redhat.com/20230209102954.528942-7-dhowells@redhat.com/
>>>>>>>> [8] 
>>>>>>>> https://lore.kernel.org/all/20240711074221.459589-1-link@vivo.com/
>>>>>>>> [9] 
>>>>>>>> https://lore.kernel.org/all/5ccbe705-883c-4651-9e66-6b452c414c74@amd.com/
>>>>>>>>
>>>>>>>> Huan Yang (5):
>>>>>>>>    dma-buf: heaps: Introduce DMA_HEAP_ALLOC_AND_READ_FILE heap 
>>>>>>>> flag
>>>>>>>>    dma-buf: heaps: Introduce async alloc read ops
>>>>>>>>    dma-buf: heaps: support alloc async read file
>>>>>>>>    dma-buf: heaps: system_heap alloc support async read
>>>>>>>>    dma-buf: heaps: configurable async read gather limit
>>>>>>>>
>>>>>>>>   drivers/dma-buf/dma-heap.c          | 552 
>>>>>>>> +++++++++++++++++++++++++++-
>>>>>>>>   drivers/dma-buf/heaps/system_heap.c |  70 +++-
>>>>>>>>   include/linux/dma-heap.h            |  53 ++-
>>>>>>>>   include/uapi/linux/dma-heap.h       |  11 +-
>>>>>>>>   4 files changed, 673 insertions(+), 13 deletions(-)
>>>>>>>>
>>>>>>>>
>>>>>>>> base-commit: 931a3b3bccc96e7708c82b30b2b5fa82dfd04890
>>>>>>>
>>>>>
>>>
>