[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <63c40ed8-c28c-df17-d4e1-b823c5508020@amd.com>
Date: Fri, 17 Jun 2022 08:11:20 +0200
From: Christian König <christian.koenig@....com>
To: Daniel Vetter <daniel.vetter@...ll.ch>,
"T.J. Mercier" <tjmercier@...gle.com>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Suren Baghdasaryan <surenb@...gle.com>,
Kalesh Singh <kaleshsingh@...gle.com>,
Minchan Kim <minchan@...gle.com>,
Greg Kroah-Hartman <gregkh@...gle.com>,
John Stultz <jstultz@...gle.com>,
Sumit Semwal <sumit.semwal@...aro.org>,
Hridya Valsaraju <hridya@...gle.com>, kernel-team@...roid.com,
linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org,
linaro-mm-sig@...ts.linaro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] dma-buf: Move sysfs work out of DMA-BUF export path
Am 15.06.22 um 20:32 schrieb Daniel Vetter:
> On Wed, 15 Jun 2022 at 19:43, T.J. Mercier <tjmercier@...gle.com> wrote:
>> On Wed, Jun 1, 2022 at 5:40 AM Daniel Vetter <daniel@...ll.ch> wrote:
>>> On Mon, May 30, 2022 at 08:12:16AM +0200, Christian König wrote:
>>>> Am 25.05.22 um 23:05 schrieb T.J. Mercier:
>>>>> On Wed, May 25, 2022 at 7:38 AM Daniel Vetter <daniel@...ll.ch> wrote:
>>>>>> On Tue, May 17, 2022 at 08:13:24AM +0200, Greg Kroah-Hartman wrote:
>>>>>>> On Mon, May 16, 2022 at 05:08:05PM -0700, T.J. Mercier wrote:
>>>>>>>> On Mon, May 16, 2022 at 12:21 PM Christian König
>>>>>>>> <christian.koenig@....com> wrote:
>>>>>>>>> Am 16.05.22 um 20:08 schrieb T.J. Mercier:
>>>>>>>>>> On Mon, May 16, 2022 at 10:20 AM Christian König
>>>>>>>>>> <christian.koenig@....com> wrote:
>>>>>>>>>>> Am 16.05.22 um 19:13 schrieb T.J. Mercier:
>>>>>>>>>>>> Recently, we noticed an issue where a process went into direct reclaim
>>>>>>>>>>>> while holding the kernfs rw semaphore for sysfs in write (exclusive)
>>>>>>>>>>>> mode. This caused processes who were doing DMA-BUF exports and releases
>>>>>>>>>>>> to go into uninterruptible sleep since they needed to acquire the same
>>>>>>>>>>>> semaphore for the DMA-BUF sysfs entry creation/deletion. In order to avoid
>>>>>>>>>>>> blocking DMA-BUF export for an indeterminate amount of time while
>>>>>>>>>>>> another process is holding the sysfs rw semaphore in exclusive mode,
>>>>>>>>>>>> this patch moves the per-buffer sysfs file creation to the default work
>>>>>>>>>>>> queue. Note that this can lead to a short-term inaccuracy in the dmabuf
>>>>>>>>>>>> sysfs statistics, but this is a tradeoff to prevent the hot path from
>>>>>>>>>>>> being blocked. A work_struct is added to dma_buf to achieve this, but as
>>>>>>>>>>>> it is unioned with the kobject in the sysfs_entry, dma_buf does not
>>>>>>>>>>>> increase in size.
>>>>>>>>>>> I'm still not very keen of this approach as it strongly feels like we
>>>>>>>>>>> are working around shortcoming somewhere else.
>>>>>>>>>>>
>>>>>>>>>> My read of the thread for the last version is that we're running into
>>>>>>>>>> a situation where sysfs is getting used for something it wasn't
>>>>>>>>>> originally intended for, but we're also stuck with this sysfs
>>>>>>>>>> functionality for dmabufs.
>>>>>>>>>>
>>>>>>>>>>>> Fixes: bdb8d06dfefd ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
>>>>>>>>>>>> Originally-by: Hridya Valsaraju <hridya@...gle.com>
>>>>>>>>>>>> Signed-off-by: T.J. Mercier <tjmercier@...gle.com>
>>>>>>>>>>>>
>>>>>>>>>>>> ---
>>>>>>>>>>>> See the originally submitted patch by Hridya Valsaraju here:
>>>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2022%2F1%2F4%2F1066&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=to4ENY8GoofFrTi035VZThY1hxiEVMyIpO80LYpVSVo%3D&reserved=0
>>>>>>>>>>>>
>>>>>>>>>>>> v2 changes:
>>>>>>>>>>>> - Defer only sysfs creation instead of creation and teardown per
>>>>>>>>>>>> Christian König
>>>>>>>>>>>>
>>>>>>>>>>>> - Use a work queue instead of a kthread for deferred work per
>>>>>>>>>>>> Christian König
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/dma-buf/dma-buf-sysfs-stats.c | 56 ++++++++++++++++++++-------
>>>>>>>>>>>> include/linux/dma-buf.h | 14 ++++++-
>>>>>>>>>>>> 2 files changed, 54 insertions(+), 16 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/dma-buf/dma-buf-sysfs-stats.c b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>>>>>> index 2bba0babcb62..67b0a298291c 100644
>>>>>>>>>>>> --- a/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>>>>>> +++ b/drivers/dma-buf/dma-buf-sysfs-stats.c
>>>>>>>>>>>> @@ -11,6 +11,7 @@
>>>>>>>>>>>> #include <linux/printk.h>
>>>>>>>>>>>> #include <linux/slab.h>
>>>>>>>>>>>> #include <linux/sysfs.h>
>>>>>>>>>>>> +#include <linux/workqueue.h>
>>>>>>>>>>>>
>>>>>>>>>>>> #include "dma-buf-sysfs-stats.h"
>>>>>>>>>>>>
>>>>>>>>>>>> @@ -168,10 +169,46 @@ void dma_buf_uninit_sysfs_statistics(void)
>>>>>>>>>>>> kset_unregister(dma_buf_stats_kset);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> +static void sysfs_add_workfn(struct work_struct *work)
>>>>>>>>>>>> +{
>>>>>>>>>>>> + struct dma_buf_sysfs_entry *sysfs_entry =
>>>>>>>>>>>> + container_of(work, struct dma_buf_sysfs_entry, sysfs_add_work);
>>>>>>>>>>>> + struct dma_buf *dmabuf = sysfs_entry->dmabuf;
>>>>>>>>>>>> +
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * A dmabuf is ref-counted via its file member. If this handler holds the only
>>>>>>>>>>>> + * reference to the dmabuf, there is no need for sysfs kobject creation. This is an
>>>>>>>>>>>> + * optimization and a race; when the reference count drops to 1 immediately after
>>>>>>>>>>>> + * this check it is not harmful as the sysfs entry will still get cleaned up in
>>>>>>>>>>>> + * dma_buf_stats_teardown, which won't get called until the final dmabuf reference
>>>>>>>>>>>> + * is released, and that can't happen until the end of this function.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + if (file_count(dmabuf->file) > 1) {
>>>>>>>>>>> Please completely drop that. I see absolutely no justification for this
>>>>>>>>>>> additional complexity.
>>>>>>>>>>>
>>>>>>>>>> This case gets hit around 5% of the time in my testing so the else is
>>>>>>>>>> not a completely unused branch.
>>>>>>>>> Well I can only repeat myself: This means that your userspace is
>>>>>>>>> severely broken!
>>>>>>>>>
>>>>>>>>> DMA-buf are meant to be long living objects
>>>>>>>> This patch addresses export *latency* regardless of how long-lived the
>>>>>>>> object is. Even a single, long-lived export will benefit from this
>>>>>>>> change if it would otherwise be blocked on adding an object to sysfs.
>>>>>>>> I think attempting to improve this latency still has merit.
>>>>>>> Fixing the latency is nice, but as it's just pushing the needed work off
>>>>>>> to another code path, it will take longer overall for the sysfs stuff to
>>>>>>> be ready for userspace to see.
>>>>>>>
>>>>>>> Perhaps we need to step back and understand what this code is supposed
>>>>>>> to be doing. As I recall, it was created because some systems do not
>>>>>>> allow debugfs anymore, and they wanted the debugging information that
>>>>>>> the dmabuf code was exposing to debugfs on a "normal" system. Moving
>>>>>>> that logic to sysfs made sense, but now I am wondering why we didn't see
>>>>>>> these issues in the debugfs code previously?
>>>>>>>
>>>>>>> Perhaps we should go just one step further and make a misc device node
>>>>>>> for dmabug debugging information to be in and just have userspace
>>>>>>> poll/read on the device node and we spit the info that used to be in
>>>>>>> debugfs out through that? That way this only affects systems when they
>>>>>>> want to read the information and not normal code paths? Yeah that's a
>>>>>>> hack, but this whole thing feels overly complex now.
>>>>>> A bit late on this discussion, but just wanted to add my +1 that we should
>>>>>> either redesign the uapi, or fix the underlying latency issue in sysfs, or
>>>>>> whatever else is deemed the proper fix.
>>>>>>
>>>>>> Making uapi interfaces async in ways that userspace can't discover is a
>>>>>> hack that we really shouldn't consider, at least for upstream. All kinds
>>>>>> of hilarious things might start to happen when an object exists, but not
>>>>>> consistently in all the places where it should be visible. There's a
>>>>>> reason sysfs has all these neat property groups so that absolutely
>>>>>> everything is added atomically. Doing stuff later on just because usually
>>>>>> no one notices that the illusion falls apart isn't great.
>>>>>>
>>>>>> Unfortunately I don't have a clear idea here what would be the right
>>>>>> solution :-/ One idea perhaps: Should we dynamically enumerate the objects
>>>>>> when userspace does a readdir()? That's absolutely not how sysfs works,
>>>>>> but procfs works like that and there's discussions going around about
>>>>>> moving these optimizations to other kernfs implementations. At least there
>>>>>> was a recent lwn article on this:
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F895111%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bjwKigpeAGgDDJWaL3zBDLgaVRRkIE%2BY59%2Be3q0Vts0%3D&reserved=0
>>>>>>
>>>>>> But that would be serious amounts of work I guess.
>>>>>> -Daniel
>>>>>> --
>>>>>> Daniel Vetter"
>>>>>> Software Engineer, Intel Corporation
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TzraByzC5slsrIrpqxNC860WPXqbanhgjt2%2FIhkWpyA%3D&reserved=0
>>>>> Hi Daniel,
>>>>>
>>>>> My team has been discussing this, and I think we're approaching a
>>>>> consensus on a way forward that involves deprecating the existing
>>>>> uapi.
>>>>>
>>>>> I actually proposed a similar (but less elegant) idea to the readdir()
>>>>> one. A new "dump_dmabuf_data" sysfs file that a user would write to,
>>>>> which would cause a one-time creation of the per-buffer files. These
>>>>> could be left around to become stale, or get cleaned up after first
>>>>> read. However to me it seems impossible to correctly deal with
>>>>> multiple simultaneous users with this technique. We're not currently
>>>>> planning to pursue this.
>>>>>
>>>>> Thanks for the link to the article. That on-demand creation sounds
>>>>> like it would allow us to keep the existing structure and files for
>>>>> DMA-buf, assuming there is not a similar lock contention issue when
>>>>> adding a new node to the virtual tree. :)
>>>> I think that this on demand creation is even worse than the existing ideas,
>>>> but if you can get Greg to accept the required sysfs changes than that's at
>>>> least outside of my maintenance domain any more :)
>>> I think doing it cleanly in sysfs without changing the current uapi sounds
>>> pretty good. The hand-rolled "touch a magic file to force update all the
>>> files into existence" sounds like a horror show to me :-) Plus I don't see
>>> how you can actually avoid the locking pain with that since once the files
>>> are created, you have to remove them synchronously again, plus you get to
>>> deal with races on top (and likely some locking inversion fun on top).
>>> -Daniel
>> Yes, lots of reasons not to pursue that angle. :)
>>
>> So I asked Greg about modifying sysfs for this purpose, and he's quite
>> convincing that it's not the right way to approach this problem. So
>> that leaves deprecating the per-buffer statistics. It looks like we
>> can maintain the userspace functionality that depended on this by
>> replacing it with a single sysfs node for "dmabuf_total_size" along
>> with adding exporter information to procfs (via Kalesh's path patch
>> [1]). However there is a separate effort to account dmabufs from heaps
>> with cgroups [2], so if I'm able to make that work then we would not
>> need the new "dmabuf_total_size" file since this same information
>> could be obtained from the root cgroup instead. So I'd like to try
>> that first before falling back to adding a new dmabuf_total_size file.
> Sounds like a plan.
Yep, totally agree.
Christian.
> -Daniel
>
>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F875yll1fp1.fsf%40stepbren-lnx.us.oracle.com%2FT%2F%23m43a3d345f821a02babd4ebb1f4257982d027c9e4&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=n0q9VFkEiTkJDMfGxYEMfAJxgyGU0MQ%2BAUDp4drx3Gc%3D&reserved=0
>> [2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2FCABdmKX1xvm87WMEDkMc9Aye46E4zv1-scenwgaRxHesrOCsaYg%40mail.gmail.com%2FT%2F%23mb82eca5438a4ea7ab157ab9cd7f044cbcfeb5509&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AxWwvx01V9JOsMELzggmQxAjXEai%2BA65rDH6F0ueul0%3D&reserved=0
>>
>>
>>
>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C2555333ba37e41f2ec4408da4efd7fb4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909147926948292%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TzraByzC5slsrIrpqxNC860WPXqbanhgjt2%2FIhkWpyA%3D&reserved=0
>
>
Powered by blists - more mailing lists