linux-kernel - Re: [PATCH 1/2] dma-buf: heaps: DMA_HEAP_IOCTL_ALLOC_READ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ea841c1c-fd7f-4958-add1-d3e96f68a11b@vivo.com>
Date: Fri, 12 Jul 2024 10:14:39 +0800
From: Huan Yang <link@...o.com>
To: Christian König <christian.koenig@....com>,
 Sumit Semwal <sumit.semwal@...aro.org>,
 Benjamin Gaignard <benjamin.gaignard@...labora.com>,
 Brian Starkey <Brian.Starkey@....com>, John Stultz <jstultz@...gle.com>,
 "T.J. Mercier" <tjmercier@...gle.com>, linux-media@...r.kernel.org,
 dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org,
 linux-kernel@...r.kernel.org
Cc: opensource.kernel@...o.com
Subject: Re: [PATCH 1/2] dma-buf: heaps: DMA_HEAP_IOCTL_ALLOC_READ_FILE
 framework


在 2024/7/12 9:59, Huan Yang 写道:
> Hi Christian,
>
> 在 2024/7/11 19:39, Christian König 写道:
>> Am 11.07.24 um 11:18 schrieb Huan Yang:
>>> Hi Christian,
>>>
>>> Thanks for your reply.
>>>
>>> 在 2024/7/11 17:00, Christian König 写道:
>>>> Am 11.07.24 um 09:42 schrieb Huan Yang:
>>>>> Some user may need load file into dma-buf, current
>>>>> way is:
>>>>>    1. allocate a dma-buf, get dma-buf fd
>>>>>    2. mmap dma-buf fd into vaddr
>>>>>    3. read(file_fd, vaddr, fsz)
>>>>> This is too heavy if fsz reached to GB.
>>>>
>>>> You need to describe a bit more why that is to heavy. I can only 
>>>> assume you need to save memory bandwidth and avoid the extra copy 
>>>> with the CPU.
>>>
>>> Sorry for the oversimplified explanation. But, yes, you're right, we 
>>> want to avoid this.
>>>
>>> As we are dealing with embedded devices, the available memory and 
>>> computing power for users are usually limited.(The maximum available 
>>> memory is currently
>>>
>>> 24GB, typically ranging from 8-12GB. )
>>>
>>> Also, the CPU computing power is also usually in short supply, due 
>>> to limited battery capacity and limited heat dissipation capabilities.
>>>
>>> So, we hope to avoid ineffective paths as much as possible.
>>>
>>>>
>>>>> This patch implement a feature called DMA_HEAP_IOCTL_ALLOC_READ_FILE.
>>>>> User need to offer a file_fd which you want to load into dma-buf, 
>>>>> then,
>>>>> it promise if you got a dma-buf fd, it will contains the file 
>>>>> content.
>>>>
>>>> Interesting idea, that has at least more potential than trying to 
>>>> enable direct I/O on mmap()ed DMA-bufs.
>>>>
>>>> The approach with the new IOCTL might not work because it is a very 
>>>> specialized use case.
>>>
>>> Thank you for your advice. maybe the "read file" behavior can be 
>>> attached to an existing allocation?
>>
>> The point is there are already system calls to do something like that.
>>
>> See copy_file_range() 
>> (https://man7.org/linux/man-pages/man2/copy_file_range.2.html) 
>> and send_file() 
>> (https://man7.org/linux/man-pages/man2/sendfile.2.html).
>
> That's helpfull to learn it, thanks.
>
> In terms of only DMA-BUF supporting direct I/O, 
> copy_file_range/send_file may help to achieve this functionality.
>
> However, my patchset also aims to achieve parallel copying of file 
> contents while allocating the DMA-BUF, which is something that the 
> current set of calls may not be able to accomplish.

You can see cover-letter, here are the normal test and this IOCTL's 
compare in memory pressure, even if buffered I/O in this ioctl can have 
50% improve by  parallel.

dd a 3GB file for test, 12G RAM phone, UFS4.0, stressapptest 4G memory pressure.

1. original
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached normal

> result is total cost 13087213847ns

```

2.DMA_HEAP_IOCTL_ALLOC_AND_READ O_DIRECT
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached direct_io

> result is total cost 2902386846ns

# use direct_io_check can check the content if is same to file.
```

3. DMA_HEAP_IOCTL_ALLOC_AND_READ BUFFER I/O
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached normal_io

> result is total cost 5735579385ns

```

>
> Perhaps simply returning the DMA-BUF file descriptor and then 
> implementing copy_file_range, while populating the memory and content 
> during the copy process, could achieve this? At present, it seems that 
> it will be quite complex - We need to ensure that only the returned 
> DMA-BUF file descriptor will fail in case of memory not fill, like 
> mmap, vmap, attach, and so on.
>
>>
>> What we probably could do is to internally optimize those.
>>
>>> I am currently creating a new ioctl to remind the user that memory 
>>> is being allocated and read, and I am also unsure
>>>
>>> whether it is appropriate to add additional parameters to the 
>>> existing allocate behavior.
>>>
>>> Please, give me more suggestion. Thanks.
>>>
>>>>
>>>> But IIRC there was a copy_file_range callback in the 
>>>> file_operations structure you could use for that. I'm just not sure 
>>>> when and how that's used with the copy_file_range() system call.
>>>
>>> Sorry, I'm not familiar with this, but I will look into it. However, 
>>> this type of callback function is not currently implemented when 
>>> exporting
>>>
>>> the dma_buf file, which means that I need to implement the callback 
>>> for it?
>>
>> If I'm not completely mistaken the copy_file_range, splice_read and 
>> splice_write callbacks on the struct file_operations 
>> (https://elixir.bootlin.com/linux/v6.10-rc7/source/include/linux/fs.h#L1999).
>>
>> Can be used to implement what you want to do.
> Yes.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Notice, file_fd depends on user how to open this file. So, both 
>>>>> buffer
>>>>> I/O and Direct I/O is supported.
>>>>>
>>>>> Signed-off-by: Huan Yang <link@...o.com>
>>>>> ---
>>>>>   drivers/dma-buf/dma-heap.c    | 525 
>>>>> +++++++++++++++++++++++++++++++++-
>>>>>   include/linux/dma-heap.h      |  57 +++-
>>>>>   include/uapi/linux/dma-heap.h |  32 +++
>>>>>   3 files changed, 611 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
>>>>> index 2298ca5e112e..abe17281adb8 100644
>>>>> --- a/drivers/dma-buf/dma-heap.c
>>>>> +++ b/drivers/dma-buf/dma-heap.c
>>>>> @@ -15,9 +15,11 @@
>>>>>   #include <linux/list.h>
>>>>>   #include <linux/slab.h>
>>>>>   #include <linux/nospec.h>
>>>>> +#include <linux/highmem.h>
>>>>>   #include <linux/uaccess.h>
>>>>>   #include <linux/syscalls.h>
>>>>>   #include <linux/dma-heap.h>
>>>>> +#include <linux/vmalloc.h>
>>>>>   #include <uapi/linux/dma-heap.h>
>>>>>     #define DEVNAME "dma_heap"
>>>>> @@ -43,12 +45,462 @@ struct dma_heap {
>>>>>       struct cdev heap_cdev;
>>>>>   };
>>>>>   +/**
>>>>> + * struct dma_heap_file - wrap the file, read task for dma_heap 
>>>>> allocate use.
>>>>> + * @file:        file to read from.
>>>>> + *
>>>>> + * @cred:        kthread use, user cred copy to use for the read.
>>>>> + *
>>>>> + * @max_batch:        maximum batch size to read, if collect 
>>>>> match batch,
>>>>> + *            trigger read, default 128MB, must below file size.
>>>>> + *
>>>>> + * @fsz:        file size.
>>>>> + *
>>>>> + * @direct:        use direct IO?
>>>>> + */
>>>>> +struct dma_heap_file {
>>>>> +    struct file *file;
>>>>> +    struct cred *cred;
>>>>> +    size_t max_batch;
>>>>> +    size_t fsz;
>>>>> +    bool direct;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * struct dma_heap_file_work - represents a dma_heap file read 
>>>>> real work.
>>>>> + * @vaddr:        contigous virtual address alloc by vmap, file 
>>>>> read need.
>>>>> + *
>>>>> + * @start_size:        file read start offset, same to 
>>>>> @dma_heap_file_task->roffset.
>>>>> + *
>>>>> + * @need_size:        file read need size, same to 
>>>>> @dma_heap_file_task->rsize.
>>>>> + *
>>>>> + * @heap_file:        file wrapper.
>>>>> + *
>>>>> + * @list:        child node of @dma_heap_file_control->works.
>>>>> + *
>>>>> + * @refp:        same @dma_heap_file_task->ref, if end of read, 
>>>>> put ref.
>>>>> + *
>>>>> + * @failp:        if any work io failed, set it true, pointp 
>>>>> @dma_heap_file_task->fail.
>>>>> + */
>>>>> +struct dma_heap_file_work {
>>>>> +    void *vaddr;
>>>>> +    ssize_t start_size;
>>>>> +    ssize_t need_size;
>>>>> +    struct dma_heap_file *heap_file;
>>>>> +    struct list_head list;
>>>>> +    atomic_t *refp;
>>>>> +    bool *failp;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * struct dma_heap_file_task - represents a dma_heap file read 
>>>>> process
>>>>> + * @ref:        current file work counter, if zero, allocate and 
>>>>> read
>>>>> + *            done.
>>>>> + *
>>>>> + * @roffset:        last read offset, current prepared work' 
>>>>> begin file
>>>>> + *            start offset.
>>>>> + *
>>>>> + * @rsize:        current allocated page size use to read, if 
>>>>> reach rbatch,
>>>>> + *            trigger commit.
>>>>> + *
>>>>> + * @rbatch:        current prepared work's batch, below 
>>>>> @dma_heap_file's
>>>>> + *            batch.
>>>>> + *
>>>>> + * @heap_file:        current dma_heap_file
>>>>> + *
>>>>> + * @parray:        used for vmap, size is @dma_heap_file's 
>>>>> batch's number
>>>>> + *            pages.(this is maximum). Due to single thread file 
>>>>> read,
>>>>> + *            one page array reuse each work prepare is OK.
>>>>> + *            Each index in parray is PAGE_SIZE.(vmap need)
>>>>> + *
>>>>> + * @pindex:        current allocated page filled in @parray's index.
>>>>> + *
>>>>> + * @fail:        any work failed when file read?
>>>>> + *
>>>>> + * dma_heap_file_task is the production of file read, will 
>>>>> prepare each work
>>>>> + * during allocate dma_buf pages, if match current batch, then 
>>>>> trigger commit
>>>>> + * and prepare next work. After all batch queued, user going on 
>>>>> prepare dma_buf
>>>>> + * and so on, but before return dma_buf fd, need to wait file 
>>>>> read end and
>>>>> + * check read result.
>>>>> + */
>>>>> +struct dma_heap_file_task {
>>>>> +    atomic_t ref;
>>>>> +    size_t roffset;
>>>>> +    size_t rsize;
>>>>> +    size_t rbatch;
>>>>> +    struct dma_heap_file *heap_file;
>>>>> +    struct page **parray;
>>>>> +    unsigned int pindex;
>>>>> +    bool fail;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * struct dma_heap_file_control - global control of dma_heap file 
>>>>> read.
>>>>> + * @works:        @dma_heap_file_work's list head.
>>>>> + *
>>>>> + * @lock:        only lock for @works.
>>>>> + *
>>>>> + * @threadwq:        wait queue for @work_thread, if commit work, 
>>>>> @work_thread
>>>>> + *            wakeup and read this work's file contains.
>>>>> + *
>>>>> + * @workwq:        used for main thread wait for file read end, 
>>>>> if allocation
>>>>> + *            end before file read. @dma_heap_file_task ref 
>>>>> effect this.
>>>>> + *
>>>>> + * @work_thread:    file read kthread. the dma_heap_file_task 
>>>>> work's consumer.
>>>>> + *
>>>>> + * @heap_fwork_cachep:    @dma_heap_file_work's cachep, it's 
>>>>> alloc/free frequently.
>>>>> + *
>>>>> + * @nr_work:        global number of how many work committed.
>>>>> + */
>>>>> +struct dma_heap_file_control {
>>>>> +    struct list_head works;
>>>>> +    spinlock_t lock;
>>>>> +    wait_queue_head_t threadwq;
>>>>> +    wait_queue_head_t workwq;
>>>>> +    struct task_struct *work_thread;
>>>>> +    struct kmem_cache *heap_fwork_cachep;
>>>>> +    atomic_t nr_work;
>>>>> +};
>>>>> +
>>>>> +static struct dma_heap_file_control *heap_fctl;
>>>>>   static LIST_HEAD(heap_list);
>>>>>   static DEFINE_MUTEX(heap_list_lock);
>>>>>   static dev_t dma_heap_devt;
>>>>>   static struct class *dma_heap_class;
>>>>>   static DEFINE_XARRAY_ALLOC(dma_heap_minors);
>>>>>   +/**
>>>>> + * map_pages_to_vaddr - map each scatter page into contiguous 
>>>>> virtual address.
>>>>> + * @heap_ftask:        prepared and need to commit's work.
>>>>> + *
>>>>> + * Cached pages need to trigger file read, this function map each 
>>>>> scatter page
>>>>> + * into contiguous virtual address, so that file read can easy use.
>>>>> + * Now that we get vaddr page, cached pages can return to 
>>>>> original user, so we
>>>>> + * will not effect dma-buf export even if file read not end.
>>>>> + */
>>>>> +static void *map_pages_to_vaddr(struct dma_heap_file_task 
>>>>> *heap_ftask)
>>>>> +{
>>>>> +    return vmap(heap_ftask->parray, heap_ftask->pindex, VM_MAP,
>>>>> +            PAGE_KERNEL);
>>>>> +}
>>>>> +
>>>>> +bool dma_heap_prepare_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask,
>>>>> +                struct page *page)
>>>>> +{
>>>>> +    struct page **array = heap_ftask->parray;
>>>>> +    int index = heap_ftask->pindex;
>>>>> +    int num = compound_nr(page), i;
>>>>> +    unsigned long sz = page_size(page);
>>>>> +
>>>>> +    heap_ftask->rsize += sz;
>>>>> +    for (i = 0; i < num; ++i)
>>>>> +        array[index++] = &page[i];
>>>>> +    heap_ftask->pindex = index;
>>>>> +
>>>>> +    return heap_ftask->rsize >= heap_ftask->rbatch;
>>>>> +}
>>>>> +
>>>>> +static struct dma_heap_file_work *
>>>>> +init_file_work(struct dma_heap_file_task *heap_ftask)
>>>>> +{
>>>>> +    struct dma_heap_file_work *heap_fwork;
>>>>> +    struct dma_heap_file *heap_file = heap_ftask->heap_file;
>>>>> +
>>>>> +    if (READ_ONCE(heap_ftask->fail))
>>>>> +        return NULL;
>>>>> +
>>>>> +    heap_fwork = kmem_cache_alloc(heap_fctl->heap_fwork_cachep, 
>>>>> GFP_KERNEL);
>>>>> +    if (unlikely(!heap_fwork))
>>>>> +        return NULL;
>>>>> +
>>>>> +    heap_fwork->vaddr = map_pages_to_vaddr(heap_ftask);
>>>>> +    if (unlikely(!heap_fwork->vaddr)) {
>>>>> +        kmem_cache_free(heap_fctl->heap_fwork_cachep, heap_fwork);
>>>>> +        return NULL;
>>>>> +    }
>>>>> +
>>>>> +    heap_fwork->heap_file = heap_file;
>>>>> +    heap_fwork->start_size = heap_ftask->roffset;
>>>>> +    heap_fwork->need_size = heap_ftask->rsize;
>>>>> +    heap_fwork->refp = &heap_ftask->ref;
>>>>> +    heap_fwork->failp = &heap_ftask->fail;
>>>>> +    atomic_inc(&heap_ftask->ref);
>>>>> +    return heap_fwork;
>>>>> +}
>>>>> +
>>>>> +static void destroy_file_work(struct dma_heap_file_work *heap_fwork)
>>>>> +{
>>>>> +    vunmap(heap_fwork->vaddr);
>>>>> +    atomic_dec(heap_fwork->refp);
>>>>> +    wake_up(&heap_fctl->workwq);
>>>>> +
>>>>> +    kmem_cache_free(heap_fctl->heap_fwork_cachep, heap_fwork);
>>>>> +}
>>>>> +
>>>>> +int dma_heap_submit_file_read(struct dma_heap_file_task *heap_ftask)
>>>>> +{
>>>>> +    struct dma_heap_file_work *heap_fwork = 
>>>>> init_file_work(heap_ftask);
>>>>> +    struct page *last = NULL;
>>>>> +    struct dma_heap_file *heap_file = heap_ftask->heap_file;
>>>>> +    size_t start = heap_ftask->roffset;
>>>>> +    struct file *file = heap_file->file;
>>>>> +    size_t fsz = heap_file->fsz;
>>>>> +
>>>>> +    if (unlikely(!heap_fwork))
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    /**
>>>>> +     * If file size is not page aligned, direct io can't process 
>>>>> the tail.
>>>>> +     * So, if reach to tail, remain the last page use buffer read.
>>>>> +     */
>>>>> +    if (heap_file->direct && start + heap_ftask->rsize > fsz) {
>>>>> +        heap_fwork->need_size -= PAGE_SIZE;
>>>>> +        last = heap_ftask->parray[heap_ftask->pindex - 1];
>>>>> +    }
>>>>> +
>>>>> +    spin_lock(&heap_fctl->lock);
>>>>> +    list_add_tail(&heap_fwork->list, &heap_fctl->works);
>>>>> +    spin_unlock(&heap_fctl->lock);
>>>>> +    atomic_inc(&heap_fctl->nr_work);
>>>>> +
>>>>> +    wake_up(&heap_fctl->threadwq);
>>>>> +
>>>>> +    if (last) {
>>>>> +        char *buf, *pathp;
>>>>> +        ssize_t err;
>>>>> +        void *buffer;
>>>>> +
>>>>> +        buf = kmalloc(PATH_MAX, GFP_KERNEL);
>>>>> +        if (unlikely(!buf))
>>>>> +            return -ENOMEM;
>>>>> +
>>>>> +        start = PAGE_ALIGN_DOWN(fsz);
>>>>> +
>>>>> +        pathp = file_path(file, buf, PATH_MAX);
>>>>> +        if (IS_ERR(pathp)) {
>>>>> +            kfree(buf);
>>>>> +            return PTR_ERR(pathp);
>>>>> +        }
>>>>> +
>>>>> +        buffer = kmap_local_page(last); // use page's kaddr.
>>>>> +        err = kernel_read_file_from_path(pathp, start, &buffer,
>>>>> +                         fsz - start, &fsz,
>>>>> +                         READING_POLICY);
>>>>> +        kunmap_local(buffer);
>>>>> +        kfree(buf);
>>>>> +        if (err < 0) {
>>>>> +            pr_err("failed to use buffer kernel_read_file %s, 
>>>>> err=%ld, [%ld, %ld], f_sz=%ld\n",
>>>>> +                   pathp, err, start, fsz, fsz);
>>>>> +
>>>>> +            return err;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    heap_ftask->roffset += heap_ftask->rsize;
>>>>> +    heap_ftask->rsize = 0;
>>>>> +    heap_ftask->pindex = 0;
>>>>> +    heap_ftask->rbatch = min_t(size_t,
>>>>> +                   PAGE_ALIGN(fsz) - heap_ftask->roffset,
>>>>> +                   heap_ftask->rbatch);
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +bool dma_heap_wait_for_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask)
>>>>> +{
>>>>> +    wait_event_freezable(heap_fctl->workwq,
>>>>> +                 atomic_read(&heap_ftask->ref) == 0);
>>>>> +    return heap_ftask->fail;
>>>>> +}
>>>>> +
>>>>> +bool dma_heap_destroy_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask)
>>>>> +{
>>>>> +    bool fail;
>>>>> +
>>>>> +    dma_heap_wait_for_file_read(heap_ftask);
>>>>> +    fail = heap_ftask->fail;
>>>>> +    kvfree(heap_ftask->parray);
>>>>> +    kfree(heap_ftask);
>>>>> +    return fail;
>>>>> +}
>>>>> +
>>>>> +struct dma_heap_file_task *
>>>>> +dma_heap_declare_file_read(struct dma_heap_file *heap_file)
>>>>> +{
>>>>> +    struct dma_heap_file_task *heap_ftask =
>>>>> +        kzalloc(sizeof(*heap_ftask), GFP_KERNEL);
>>>>> +    if (unlikely(!heap_ftask))
>>>>> +        return NULL;
>>>>> +
>>>>> +    /**
>>>>> +     * Batch is the maximum size which we prepare work will meet.
>>>>> +     * So, direct alloc this number's page array is OK.
>>>>> +     */
>>>>> +    heap_ftask->parray = kvmalloc_array(heap_file->max_batch >> 
>>>>> PAGE_SHIFT,
>>>>> +                        sizeof(struct page *), GFP_KERNEL);
>>>>> +    if (unlikely(!heap_ftask->parray))
>>>>> +        goto put;
>>>>> +
>>>>> +    heap_ftask->heap_file = heap_file;
>>>>> +    heap_ftask->rbatch = heap_file->max_batch;
>>>>> +    return heap_ftask;
>>>>> +put:
>>>>> +    kfree(heap_ftask);
>>>>> +    return NULL;
>>>>> +}
>>>>> +
>>>>> +static void __work_this_io(struct dma_heap_file_work *heap_fwork)
>>>>> +{
>>>>> +    struct dma_heap_file *heap_file = heap_fwork->heap_file;
>>>>> +    struct file *file = heap_file->file;
>>>>> +    ssize_t start = heap_fwork->start_size;
>>>>> +    ssize_t size = heap_fwork->need_size;
>>>>> +    void *buffer = heap_fwork->vaddr;
>>>>> +    const struct cred *old_cred;
>>>>> +    ssize_t err;
>>>>> +
>>>>> +    // use real task's cred to read this file.
>>>>> +    old_cred = override_creds(heap_file->cred);
>>>>> +    err = kernel_read_file(file, start, &buffer, size, 
>>>>> &heap_file->fsz,
>>>>> +                   READING_POLICY);
>>>>> +    if (err < 0) {
>>>>> +        pr_err("use kernel_read_file, err=%ld, [%ld, %ld], 
>>>>> f_sz=%ld\n",
>>>>> +               err, start, (start + size), heap_file->fsz);
>>>>> +        WRITE_ONCE(*heap_fwork->failp, true);
>>>>> +    }
>>>>> +    // recovery to my cred.
>>>>> +    revert_creds(old_cred);
>>>>> +}
>>>>> +
>>>>> +static int dma_heap_file_control_thread(void *data)
>>>>> +{
>>>>> +    struct dma_heap_file_control *heap_fctl =
>>>>> +        (struct dma_heap_file_control *)data;
>>>>> +    struct dma_heap_file_work *worker, *tmp;
>>>>> +    int nr_work;
>>>>> +
>>>>> +    LIST_HEAD(pages);
>>>>> +    LIST_HEAD(workers);
>>>>> +
>>>>> +    while (true) {
>>>>> +        wait_event_freezable(heap_fctl->threadwq,
>>>>> + atomic_read(&heap_fctl->nr_work) > 0);
>>>>> +recheck:
>>>>> +        spin_lock(&heap_fctl->lock);
>>>>> +        list_splice_init(&heap_fctl->works, &workers);
>>>>> +        spin_unlock(&heap_fctl->lock);
>>>>> +
>>>>> +        if (unlikely(kthread_should_stop())) {
>>>>> +            list_for_each_entry_safe(worker, tmp, &workers, list) {
>>>>> +                list_del(&worker->list);
>>>>> +                destroy_file_work(worker);
>>>>> +            }
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        nr_work = 0;
>>>>> +        list_for_each_entry_safe(worker, tmp, &workers, list) {
>>>>> +            ++nr_work;
>>>>> +            list_del(&worker->list);
>>>>> +            __work_this_io(worker);
>>>>> +
>>>>> +            destroy_file_work(worker);
>>>>> +        }
>>>>> +        atomic_sub(nr_work, &heap_fctl->nr_work);
>>>>> +
>>>>> +        if (atomic_read(&heap_fctl->nr_work) > 0)
>>>>> +            goto recheck;
>>>>> +    }
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +size_t dma_heap_file_size(struct dma_heap_file *heap_file)
>>>>> +{
>>>>> +    return heap_file->fsz;
>>>>> +}
>>>>> +
>>>>> +static int prepare_dma_heap_file(struct dma_heap_file *heap_file, 
>>>>> int file_fd,
>>>>> +                 size_t batch)
>>>>> +{
>>>>> +    struct file *file;
>>>>> +    size_t fsz;
>>>>> +    int ret;
>>>>> +
>>>>> +    file = fget(file_fd);
>>>>> +    if (!file)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    fsz = i_size_read(file_inode(file));
>>>>> +    if (fsz < batch) {
>>>>> +        ret = -EINVAL;
>>>>> +        goto err;
>>>>> +    }
>>>>> +
>>>>> +    /**
>>>>> +     * Selinux block our read, but actually we are reading the 
>>>>> stand-in
>>>>> +     * for this file.
>>>>> +     * So save current's cred and when going to read, override 
>>>>> mine, and
>>>>> +     * end of read, revert.
>>>>> +     */
>>>>> +    heap_file->cred = prepare_kernel_cred(current);
>>>>> +    if (unlikely(!heap_file->cred)) {
>>>>> +        ret = -ENOMEM;
>>>>> +        goto err;
>>>>> +    }
>>>>> +
>>>>> +    heap_file->file = file;
>>>>> +    heap_file->max_batch = batch;
>>>>> +    heap_file->fsz = fsz;
>>>>> +
>>>>> +    heap_file->direct = file->f_flags & O_DIRECT;
>>>>> +
>>>>> +#define DMA_HEAP_SUGGEST_DIRECT_IO_SIZE (1UL << 30)
>>>>> +    if (!heap_file->direct && fsz >= 
>>>>> DMA_HEAP_SUGGEST_DIRECT_IO_SIZE)
>>>>> +        pr_warn("alloc read file better to use O_DIRECT to read 
>>>>> larget file\n");
>>>>> +
>>>>> +    return 0;
>>>>> +
>>>>> +err:
>>>>> +    fput(file);
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static void destroy_dma_heap_file(struct dma_heap_file *heap_file)
>>>>> +{
>>>>> +    fput(heap_file->file);
>>>>> +    put_cred(heap_file->cred);
>>>>> +}
>>>>> +
>>>>> +static int dma_heap_buffer_alloc_read_file(struct dma_heap *heap, 
>>>>> int file_fd,
>>>>> +                       size_t batch, unsigned int fd_flags,
>>>>> +                       unsigned int heap_flags)
>>>>> +{
>>>>> +    struct dma_buf *dmabuf;
>>>>> +    int fd;
>>>>> +    struct dma_heap_file heap_file;
>>>>> +
>>>>> +    fd = prepare_dma_heap_file(&heap_file, file_fd, batch);
>>>>> +    if (fd)
>>>>> +        goto error_file;
>>>>> +
>>>>> +    dmabuf = heap->ops->allocate_read_file(heap, &heap_file, 
>>>>> fd_flags,
>>>>> +                           heap_flags);
>>>>> +    if (IS_ERR(dmabuf)) {
>>>>> +        fd = PTR_ERR(dmabuf);
>>>>> +        goto error;
>>>>> +    }
>>>>> +
>>>>> +    fd = dma_buf_fd(dmabuf, fd_flags);
>>>>> +    if (fd < 0) {
>>>>> +        dma_buf_put(dmabuf);
>>>>> +        /* just return, as put will call release and that will 
>>>>> free */
>>>>> +    }
>>>>> +
>>>>> +error:
>>>>> +    destroy_dma_heap_file(&heap_file);
>>>>> +error_file:
>>>>> +    return fd;
>>>>> +}
>>>>> +
>>>>>   static int dma_heap_buffer_alloc(struct dma_heap *heap, size_t len,
>>>>>                    u32 fd_flags,
>>>>>                    u64 heap_flags)
>>>>> @@ -93,6 +545,38 @@ static int dma_heap_open(struct inode *inode, 
>>>>> struct file *file)
>>>>>       return 0;
>>>>>   }
>>>>>   +static long dma_heap_ioctl_allocate_read_file(struct file 
>>>>> *file, void *data)
>>>>> +{
>>>>> +    struct dma_heap_allocation_file_data *heap_allocation_file = 
>>>>> data;
>>>>> +    struct dma_heap *heap = file->private_data;
>>>>> +    int fd;
>>>>> +
>>>>> +    if (heap_allocation_file->fd || !heap_allocation_file->file_fd)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (heap_allocation_file->fd_flags & ~DMA_HEAP_VALID_FD_FLAGS)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (heap_allocation_file->heap_flags & 
>>>>> ~DMA_HEAP_VALID_HEAP_FLAGS)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!heap->ops->allocate_read_file)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    fd = dma_heap_buffer_alloc_read_file(
>>>>> +        heap, heap_allocation_file->file_fd,
>>>>> +        heap_allocation_file->batch ?
>>>>> +            PAGE_ALIGN(heap_allocation_file->batch) :
>>>>> +            DEFAULT_ADI_BATCH,
>>>>> +        heap_allocation_file->fd_flags,
>>>>> +        heap_allocation_file->heap_flags);
>>>>> +    if (fd < 0)
>>>>> +        return fd;
>>>>> +
>>>>> +    heap_allocation_file->fd = fd;
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>>   static long dma_heap_ioctl_allocate(struct file *file, void *data)
>>>>>   {
>>>>>       struct dma_heap_allocation_data *heap_allocation = data;
>>>>> @@ -121,6 +605,7 @@ static long dma_heap_ioctl_allocate(struct 
>>>>> file *file, void *data)
>>>>>     static unsigned int dma_heap_ioctl_cmds[] = {
>>>>>       DMA_HEAP_IOCTL_ALLOC,
>>>>> +    DMA_HEAP_IOCTL_ALLOC_AND_READ,
>>>>>   };
>>>>>     static long dma_heap_ioctl(struct file *file, unsigned int ucmd,
>>>>> @@ -170,6 +655,9 @@ static long dma_heap_ioctl(struct file *file, 
>>>>> unsigned int ucmd,
>>>>>       case DMA_HEAP_IOCTL_ALLOC:
>>>>>           ret = dma_heap_ioctl_allocate(file, kdata);
>>>>>           break;
>>>>> +    case DMA_HEAP_IOCTL_ALLOC_AND_READ:
>>>>> +        ret = dma_heap_ioctl_allocate_read_file(file, kdata);
>>>>> +        break;
>>>>>       default:
>>>>>           ret = -ENOTTY;
>>>>>           goto err;
>>>>> @@ -316,11 +804,44 @@ static int dma_heap_init(void)
>>>>>         dma_heap_class = class_create(DEVNAME);
>>>>>       if (IS_ERR(dma_heap_class)) {
>>>>> -        unregister_chrdev_region(dma_heap_devt, NUM_HEAP_MINORS);
>>>>> -        return PTR_ERR(dma_heap_class);
>>>>> +        ret = PTR_ERR(dma_heap_class);
>>>>> +        goto fail_class;
>>>>>       }
>>>>>       dma_heap_class->devnode = dma_heap_devnode;
>>>>>   +    heap_fctl = kzalloc(sizeof(*heap_fctl), GFP_KERNEL);
>>>>> +    if (unlikely(!heap_fctl)) {
>>>>> +        ret =  -ENOMEM;
>>>>> +        goto fail_alloc;
>>>>> +    }
>>>>> +
>>>>> +    INIT_LIST_HEAD(&heap_fctl->works);
>>>>> +    init_waitqueue_head(&heap_fctl->threadwq);
>>>>> +    init_waitqueue_head(&heap_fctl->workwq);
>>>>> +
>>>>> +    heap_fctl->work_thread = 
>>>>> kthread_run(dma_heap_file_control_thread,
>>>>> +                         heap_fctl, "heap_fwork_t");
>>>>> +    if (IS_ERR(heap_fctl->work_thread)) {
>>>>> +        ret = -ENOMEM;
>>>>> +        goto fail_thread;
>>>>> +    }
>>>>> +
>>>>> +    heap_fctl->heap_fwork_cachep = KMEM_CACHE(dma_heap_file_work, 
>>>>> 0);
>>>>> +    if (unlikely(!heap_fctl->heap_fwork_cachep)) {
>>>>> +        ret = -ENOMEM;
>>>>> +        goto fail_cache;
>>>>> +    }
>>>>> +
>>>>>       return 0;
>>>>> +
>>>>> +fail_cache:
>>>>> +    kthread_stop(heap_fctl->work_thread);
>>>>> +fail_thread:
>>>>> +    kfree(heap_fctl);
>>>>> +fail_alloc:
>>>>> +    class_destroy(dma_heap_class);
>>>>> +fail_class:
>>>>> +    unregister_chrdev_region(dma_heap_devt, NUM_HEAP_MINORS);
>>>>> +    return ret;
>>>>>   }
>>>>>   subsys_initcall(dma_heap_init);
>>>>> diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
>>>>> index 064bad725061..9c25383f816c 100644
>>>>> --- a/include/linux/dma-heap.h
>>>>> +++ b/include/linux/dma-heap.h
>>>>> @@ -12,12 +12,17 @@
>>>>>   #include <linux/cdev.h>
>>>>>   #include <linux/types.h>
>>>>>   +#define DEFAULT_ADI_BATCH (128 << 20)
>>>>> +
>>>>>   struct dma_heap;
>>>>> +struct dma_heap_file_task;
>>>>> +struct dma_heap_file;
>>>>>     /**
>>>>>    * struct dma_heap_ops - ops to operate on a given heap
>>>>>    * @allocate:        allocate dmabuf and return struct dma_buf ptr
>>>>> - *
>>>>> + * @allocate_read_file: allocate dmabuf and read file, then 
>>>>> return struct
>>>>> + * dma_buf ptr.
>>>>>    * allocate returns dmabuf on success, ERR_PTR(-errno) on error.
>>>>>    */
>>>>>   struct dma_heap_ops {
>>>>> @@ -25,6 +30,11 @@ struct dma_heap_ops {
>>>>>                       unsigned long len,
>>>>>                       u32 fd_flags,
>>>>>                       u64 heap_flags);
>>>>> +
>>>>> +    struct dma_buf *(*allocate_read_file)(struct dma_heap *heap,
>>>>> +                          struct dma_heap_file *heap_file,
>>>>> +                          u32 fd_flags,
>>>>> +                          u64 heap_flags);
>>>>>   };
>>>>>     /**
>>>>> @@ -65,4 +75,49 @@ const char *dma_heap_get_name(struct dma_heap 
>>>>> *heap);
>>>>>    */
>>>>>   struct dma_heap *dma_heap_add(const struct dma_heap_export_info 
>>>>> *exp_info);
>>>>>   +/**
>>>>> + * dma_heap_destroy_file_read - waits for a file read to complete 
>>>>> then destroy it
>>>>> + * Returns: true if the file read failed, false otherwise
>>>>> + */
>>>>> +bool dma_heap_destroy_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask);
>>>>> +
>>>>> +/**
>>>>> + * dma_heap_wait_for_file_read - waits for a file read to complete
>>>>> + * Returns: true if the file read failed, false otherwise
>>>>> + */
>>>>> +bool dma_heap_wait_for_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask);
>>>>> +
>>>>> +/**
>>>>> + * dma_heap_alloc_file_read - Declare a task to read file when 
>>>>> allocate pages.
>>>>> + * @heap_file:        target file to read
>>>>> + *
>>>>> + * Return NULL if failed, otherwise return a struct pointer.
>>>>> + */
>>>>> +struct dma_heap_file_task *
>>>>> +dma_heap_declare_file_read(struct dma_heap_file *heap_file);
>>>>> +
>>>>> +/**
>>>>> + * dma_heap_prepare_file_read - cache each allocated page until 
>>>>> we meet this batch.
>>>>> + * @heap_ftask:        prepared and need to commit's work.
>>>>> + * @page:        current allocated page. don't care which order.
>>>>> + *
>>>>> + * Returns true if reach to batch, false so go on prepare.
>>>>> + */
>>>>> +bool dma_heap_prepare_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask,
>>>>> +                struct page *page);
>>>>> +
>>>>> +/**
>>>>> + * dma_heap_commit_file_read -  prepare collect enough memory, 
>>>>> going to trigger IO
>>>>> + * @heap_ftask:            info that current IO needs
>>>>> + *
>>>>> + * This commit will also check if reach to tail read.
>>>>> + * For direct I/O submissions, it is necessary to pay attention 
>>>>> to file reads
>>>>> + * that are not page-aligned. For the unaligned portion of the 
>>>>> read, buffer IO
>>>>> + * needs to be triggered.
>>>>> + * Returns:
>>>>> + *   0 if all right, -errno if something wrong
>>>>> + */
>>>>> +int dma_heap_submit_file_read(struct dma_heap_file_task 
>>>>> *heap_ftask);
>>>>> +size_t dma_heap_file_size(struct dma_heap_file *heap_file);
>>>>> +
>>>>>   #endif /* _DMA_HEAPS_H */
>>>>> diff --git a/include/uapi/linux/dma-heap.h 
>>>>> b/include/uapi/linux/dma-heap.h
>>>>> index a4cf716a49fa..8c20e8b74eed 100644
>>>>> --- a/include/uapi/linux/dma-heap.h
>>>>> +++ b/include/uapi/linux/dma-heap.h
>>>>> @@ -39,6 +39,27 @@ struct dma_heap_allocation_data {
>>>>>       __u64 heap_flags;
>>>>>   };
>>>>>   +/**
>>>>> + * struct dma_heap_allocation_file_data - metadata passed from 
>>>>> userspace for
>>>>> + *                                      allocations and read file
>>>>> + * @fd:            will be populated with a fd which provides the
>>>>> + *     ��      handle to the allocated dma-buf
>>>>> + * @file_fd:        file descriptor to read from(suggested to use 
>>>>> O_DIRECT open file)
>>>>> + * @batch:        how many memory alloced then file read(bytes), 
>>>>> default 128MB
>>>>> + *            will auto aligned to PAGE_SIZE
>>>>> + * @fd_flags:        file descriptor flags used when allocating
>>>>> + * @heap_flags:        flags passed to heap
>>>>> + *
>>>>> + * Provided by userspace as an argument to the ioctl
>>>>> + */
>>>>> +struct dma_heap_allocation_file_data {
>>>>> +    __u32 fd;
>>>>> +    __u32 file_fd;
>>>>> +    __u32 batch;
>>>>> +    __u32 fd_flags;
>>>>> +    __u64 heap_flags;
>>>>> +};
>>>>> +
>>>>>   #define DMA_HEAP_IOC_MAGIC        'H'
>>>>>     /**
>>>>> @@ -50,4 +71,15 @@ struct dma_heap_allocation_data {
>>>>>   #define DMA_HEAP_IOCTL_ALLOC _IOWR(DMA_HEAP_IOC_MAGIC, 0x0,\
>>>>>                         struct dma_heap_allocation_data)
>>>>>   +/**
>>>>> + * DOC: DMA_HEAP_IOCTL_ALLOC_AND_READ - allocate memory from pool 
>>>>> and both
>>>>> + *                    read file when allocate memory.
>>>>> + *
>>>>> + * Takes a dma_heap_allocation_file_data struct and returns it 
>>>>> with the fd field
>>>>> + * populated with the dmabuf handle of the allocation. When 
>>>>> return, the dma-buf
>>>>> + * content is read from file.
>>>>> + */
>>>>> +#define DMA_HEAP_IOCTL_ALLOC_AND_READ \
>>>>> +    _IOWR(DMA_HEAP_IOC_MAGIC, 0x1, struct 
>>>>> dma_heap_allocation_file_data)
>>>>> +
>>>>>   #endif /* _UAPI_LINUX_DMABUF_POOL_H */
>>>>
>>