[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <110c944c-7380-4fe4-b10f-49dde79fa553@linux.dev>
Date: Fri, 29 Nov 2024 18:33:47 +0800
From: Sui Jingfeng <sui.jingfeng@...ux.dev>
To: Chen-Yu Tsai <wenst@...omium.org>
Cc: Maxime Ripard <mripard@...nel.org>, Fei Shao <fshao@...omium.org>,
Andrzej Hajda <andrzej.hajda@...el.com>,
Neil Armstrong <neil.armstrong@...aro.org>, Robert Foss <rfoss@...nel.org>,
Laurent Pinchart <Laurent.pinchart@...asonboard.com>,
AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>,
David Airlie <airlied@...il.com>, Jernej Skrabec <jernej.skrabec@...il.com>,
Jonas Karlman <jonas@...boo.se>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Simona Vetter <simona@...ll.ch>, Thomas Zimmermann <tzimmermann@...e.de>,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] drm/bridge: panel: Use devm_drm_bridge_add()
Hi,
On 2024/11/29 12:52, Chen-Yu Tsai wrote:
> On Thu, Nov 28, 2024 at 2:46 AM Sui Jingfeng <sui.jingfeng@...ux.dev> wrote:
>> Hi,
>>
>> On 2024/11/27 17:58, Chen-Yu Tsai wrote:
>>> Revisiting this thread since I just stepped on the same problem on a
>>> different device.
>>>
>>> On Thu, Nov 14, 2024 at 9:12 PM Maxime Ripard <mripard@...nel.org> wrote:
>>>> On Tue, Oct 29, 2024 at 10:53:49PM +0800, Fei Shao wrote:
>>>>> On Thu, Oct 24, 2024 at 8:36 PM Maxime Ripard <mripard@...nel.org> wrote:
>>>>>> On Wed, Oct 09, 2024 at 01:23:31PM +0800, Fei Shao wrote:
>>>>>>> In the mtk_dsi driver, its DSI host attach callback calls
>>>>>>> devm_drm_of_get_bridge() to get the next bridge. If that next bridge is
>>>>>>> a panel bridge, a panel_bridge object is allocated and managed by the
>>>>>>> panel device.
>>>>>>>
>>>>>>> Later, if the attach callback fails with -EPROBE_DEFER from subsequent
>>>>>>> component_add(), the panel device invoking the callback at probe time
>>>>>>> also fails, and all device-managed resources are freed accordingly.
>>>>>>>
>>>>>>> This exposes a drm_bridge bridge_list corruption due to the unbalanced
>>>>>>> lifecycle between the DSI host and the panel devices: the panel_bridge
>>>>>>> object managed by panel device is freed, while drm_bridge_remove() is
>>>>>>> bound to DSI host device and never gets called.
>>>>>>> The next drm_bridge_add() will trigger UAF against the freed bridge list
>>>>>>> object and result in kernel panic.
>>>>>>>
>>>>>>> This bug is observed on a MediaTek MT8188-based Chromebook with MIPI DSI
>>>>>>> outputting to a DSI panel (DT is WIP for upstream).
>>>>>>>
>>>>>>> As a fix, using devm_drm_bridge_add() with the panel device in the panel
>>>>>>> path seems reasonable. This also implies a chain of potential cleanup
>>>>>>> actions:
>>>>>>>
>>>>>>> 1. Removing drm_bridge_remove() means devm_drm_panel_bridge_release()
>>>>>>> becomes hollow and can be removed.
>>>>>>>
>>>>>>> 2. devm_drm_panel_bridge_add_typed() is almost emptied except for the
>>>>>>> `bridge->pre_enable_prev_first` line. Itself can be also removed if
>>>>>>> we move the line into drm_panel_bridge_add_typed(). (maybe?)
>>>>>>>
>>>>>>> 3. drm_panel_bridge_add_typed() now calls all the needed devm_* calls,
>>>>>>> so it's essentially the new devm_drm_panel_bridge_add_typed().
>>>>>>>
>>>>>>> 4. drmm_panel_bridge_add() needs to be updated accordingly since it
>>>>>>> calls drm_panel_bridge_add_typed(). But now there's only one bridge
>>>>>>> object to be freed, and it's already being managed by panel device.
>>>>>>> I wonder if we still need both drmm_ and devm_ version in this case.
>>>>>>> (maybe yes from DRM PoV, I don't know much about the context)
>>>>>>>
>>>>>>> This is a RFC patch since I'm not sure if my understanding is correct
>>>>>>> (for both the fix and the cleanup). It fixes the issue I encountered,
>>>>>>> but I don't expect it to be picked up directly due to the redundant
>>>>>>> commit message and the dangling devm_drm_panel_bridge_release().
>>>>>>> I plan to resend the official patch(es) once I know what I supposed to
>>>>>>> do next.
>>>>>>>
>>>>>>> For reference, here's the KASAN report from the device:
>>>>>>> ==================================================================
>>>>>>> BUG: KASAN: slab-use-after-free in drm_bridge_add+0x98/0x230
>>>>>>> Read of size 8 at addr ffffff80c4e9e100 by task kworker/u32:1/69
>>>>>>>
>>>>>>> CPU: 1 UID: 0 PID: 69 Comm: kworker/u32:1 Not tainted 6.12.0-rc1-next-20241004-kasan-00030-g062135fa4046 #1
>>>>>>> Hardware name: Google Ciri sku0/unprovisioned board (DT)
>>>>>>> Workqueue: events_unbound deferred_probe_work_func
>>>>>>> Call trace:
>>>>>>> dump_backtrace+0xfc/0x140
>>>>>>> show_stack+0x24/0x38
>>>>>>> dump_stack_lvl+0x40/0xc8
>>>>>>> print_report+0x140/0x700
>>>>>>> kasan_report+0xcc/0x130
>>>>>>> __asan_report_load8_noabort+0x20/0x30
>>>>>>> drm_bridge_add+0x98/0x230
>>>>>>> devm_drm_panel_bridge_add_typed+0x174/0x298
>>>>>>> devm_drm_of_get_bridge+0xe8/0x190
>>>>>>> mtk_dsi_host_attach+0x130/0x2b0
>>>>>>> mipi_dsi_attach+0x8c/0xe8
>>>>>>> hx83102_probe+0x1a8/0x368
>>>>>>> mipi_dsi_drv_probe+0x6c/0x88
>>>>>>> really_probe+0x1c4/0x698
>>>>>>> __driver_probe_device+0x160/0x298
>>>>>>> driver_probe_device+0x7c/0x2a8
>>>>>>> __device_attach_driver+0x2a0/0x398
>>>>>>> bus_for_each_drv+0x198/0x200
>>>>>>> __device_attach+0x1c0/0x308
>>>>>>> device_initial_probe+0x20/0x38
>>>>>>> bus_probe_device+0x11c/0x1f8
>>>>>>> deferred_probe_work_func+0x80/0x250
>>>>>>> worker_thread+0x9b4/0x2780
>>>>>>> kthread+0x274/0x350
>>>>>>> ret_from_fork+0x10/0x20
>>>>>>>
>>>>>>> Allocated by task 69:
>>>>>>> kasan_save_track+0x40/0x78
>>>>>>> kasan_save_alloc_info+0x44/0x58
>>>>>>> __kasan_kmalloc+0x84/0xa0
>>>>>>> __kmalloc_node_track_caller_noprof+0x228/0x450
>>>>>>> devm_kmalloc+0x6c/0x288
>>>>>>> devm_drm_panel_bridge_add_typed+0xa0/0x298
>>>>>>> devm_drm_of_get_bridge+0xe8/0x190
>>>>>>> mtk_dsi_host_attach+0x130/0x2b0
>>>>>>> mipi_dsi_attach+0x8c/0xe8
>>>>>>> hx83102_probe+0x1a8/0x368
>>>>>>> mipi_dsi_drv_probe+0x6c/0x88
>>>>>>> really_probe+0x1c4/0x698
>>>>>>> __driver_probe_device+0x160/0x298
>>>>>>> driver_probe_device+0x7c/0x2a8
>>>>>>> __device_attach_driver+0x2a0/0x398
>>>>>>> bus_for_each_drv+0x198/0x200
>>>>>>> __device_attach+0x1c0/0x308
>>>>>>> device_initial_probe+0x20/0x38
>>>>>>> bus_probe_device+0x11c/0x1f8
>>>>>>> deferred_probe_work_func+0x80/0x250
>>>>>>> worker_thread+0x9b4/0x2780
>>>>>>> kthread+0x274/0x350
>>>>>>> ret_from_fork+0x10/0x20
>>>>>>>
>>>>>>> Freed by task 69:
>>>>>>> kasan_save_track+0x40/0x78
>>>>>>> kasan_save_free_info+0x58/0x78
>>>>>>> __kasan_slab_free+0x48/0x68
>>>>>>> kfree+0xd4/0x750
>>>>>>> devres_release_all+0x144/0x1e8
>>>>>>> really_probe+0x48c/0x698
>>>>>>> __driver_probe_device+0x160/0x298
>>>>>>> driver_probe_device+0x7c/0x2a8
>>>>>>> __device_attach_driver+0x2a0/0x398
>>>>>>> bus_for_each_drv+0x198/0x200
>>>>>>> __device_attach+0x1c0/0x308
>>>>>>> device_initial_probe+0x20/0x38
>>>>>>> bus_probe_device+0x11c/0x1f8
>>>>>>> deferred_probe_work_func+0x80/0x250
>>>>>>> worker_thread+0x9b4/0x2780
>>>>>>> kthread+0x274/0x350
>>>>>>> ret_from_fork+0x10/0x20
>>>>>>>
>>>>>>> The buggy address belongs to the object at ffffff80c4e9e000
>>>>>>> which belongs to the cache kmalloc-4k of size 4096
>>>>>>> The buggy address is located 256 bytes inside of
>>>>>>> freed 4096-byte region [ffffff80c4e9e000, ffffff80c4e9f000)
>>>>>>>
>>>>>>> The buggy address belongs to the physical page:
>>>>>>> head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>>>>>> flags: 0x8000000000000040(head|zone=2)
>>>>>>> page_type: f5(slab)
>>>>>>> page: refcount:1 mapcount:0 mapping:0000000000000000
>>>>>>> index:0x0 pfn:0x104e98
>>>>>>> raw: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>> raw: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>> head: 8000000000000040 ffffff80c0003040 dead000000000122 0000000000000000
>>>>>>> head: 0000000000000000 0000000000040004 00000001f5000000 0000000000000000
>>>>>>> head: 8000000000000003 fffffffec313a601 ffffffffffffffff 0000000000000000
>>>>>>> head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>> page dumped because: kasan: bad access detected
>>>>>>>
>>>>>>> Memory state around the buggy address:
>>>>>>> ffffff80c4e9e000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>> ffffff80c4e9e080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>> >ffffff80c4e9e100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>> ^
>>>>>>> ffffff80c4e9e180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>> ffffff80c4e9e200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>>>>> ===================================================================
>>>>>>>
>>>>>>> Signed-off-by: Fei Shao <fshao@...omium.org>
>>>>>> I was looking at the driver to try to follow your (awesome btw, thanks)
>>>>>> commit log, and it does have a quite different structure compared to
>>>>>> what we recommend.
>>>>>>
>>>>>> Would following
>>>>>> https://docs.kernel.org/gpu/drm-kms-helpers.html#special-care-with-mipi-dsi-bridges
>>>>>> help?
>>>>> Hi Maxime,
>>>>>
>>>>> Thank you for the pointer.
>>>>> I read the suggested pattern in the doc and compared it with the
>>>>> drivers. If I understand correctly, both the MIPI-DSI host and panel
>>>>> drivers follow the instructions:
>>>>>
>>>>> 1. The MIPI-DSI host driver must run mipi_dsi_host_register() in its probe hook.
>>>>> >> drm/mediatek/mtk_dsi.c runs mipi_dsi_host_register() in the probe hook.
>>>>> 2. In its probe hook, the bridge driver must try to find its MIPI-DSI
>>>>> host, register as a MIPI-DSI device and attach the MIPI-DSI device to
>>>>> its host.
>>>>> >> drm/panel/panel-himax-hx83102.c follows and runs
>>>>> mipi_dsi_attach() at the end of probe hook.
>>>>> 3. In its struct mipi_dsi_host_ops.attach hook, the MIPI-DSI host can
>>>>> now add its component.
>>>>> >> drm/mediatek/mtk_dsi.c calls component_add() in the attach callback.
>>>>>
>>>>> Could you elaborate on the "different structures" you mentioned?
>>>> Yeah, you're right, sorry.
>>>>
>>>>> To clarify my point: the issue is that component_add() may return
>>>>> -EPROBE_DEFER if the component (e.g. DSI encoder) is not ready,
>>>>> causing the panel bridge to be removed. However, drm_bridge_remove()
>>>>> is bound to MIPI-DSI host instead of panel bridge, which owns the
>>>>> actual list_head object.
>>>>>
>>>>> This might be reproducible with other MIPI-DSI host + panel
>>>>> combinations by forcibly returning -EPROBE_DEFER in the host attach
>>>>> hook (verification with another device is needed), so the fix may be
>>>>> required in drm/bridge/panel.c.
>>>> Yeah, I think you're just hitting another bridge lifetime issue, and
>>>> it's not the only one unfortunately. Tying the bridge structure lifetime
>>>> itself to the device is wrong, it should be tied to the DRM device
>>>> lifetime instead.
>>> I think the more immediate issue is that the bridge object's lifetime
>>> and drm_bridge_add/remove are inconsistent when devm_drm_of_get_bridge()
>>> or drmm_of_get_bridge() are used.
>> Well, I think this is more of probe issue of multiple kernel modules.
>>
>> The root issue is that the global bridge list still stores the
>> pointer to *old* the bridge instance which has been freed after
>> the first '-EPROBE_DEFER' happened. The next time the
>> 'drm_bridge_add(&panel_bridge->bridge);' is called, we will deference
>> the *old* NULL pointer. Because it will touch the 'struct drm_bridge::list'
>> field, which's backing memory has been freed.
> Yes. That is what is causing the crash.
>
>>> These helpers tie the bridge add/removal to the device or drm_device
>>> passed in, but internally they call down to drm_panel_bridge_add_typed()
>>> which allocates the bridge object tied to the panel device.
>> When the devm_drm_panel_bridge_add_typed() is passed a pointer of
>> DSI host device, we essentially tie the lifetime of the freshly
>> created drm bridge instance to the DSI host device. But, the
>> 'struct panel_bridge' clearly hint that the bridge instance has
>> the same lifetime with the backing panel, after all, it's the
>> underlying panel baking the bridge.
> Exactly.
>
>>>> But then, the discussion becomes that bridges typically probe outside of
>>>> the "main" DRM device probe path, so you don't have access to the DRM
>>>> device structure until attach at best.
>>>>
>>>> That's why I'm a bit skeptical about your patch. It might workaround
>>>> your issue, but it doesn't actually solve the problem. I guess the best
>>>> way about it would be to convert bridges to reference counting, with the
>>>> device taking a reference at probe time when it allocates the structure
>>>> (and giving it back at remove time), and the DRM device taking one when
>>>> it's attached and one when it's detached.
>>> Without going as far, it's probably better to align the lifecycle of
>>> the two parts. Most other bridge drivers in the kernel have |drm_bridge|
>>> lifecycle tied to their underlying |device|, either with explicit
>>> drm_bridge_{add,remove}() calls in their probe/bind and remove/unbind
>>> callbacks respectively, or with devm_drm_bridge_add in the probe/bind
>>> path. The only ones with a narrower lifecycle are the DSI hosts, which
>>> add the bridge in during host attach and remove it during host detach.
>>>
>>> I'm thinking about fixing the panel_bridge lifecycle such that it is
>>> tied to the panel itself. Maybe that would involve making
>>> devm_drm_of_get_bridge() correctly return bridges even if a panel was
>>> found, and then making the panels create and add panel bridges directly,
>>> possibly within drm_panel_add(). Would that make sense?
>> I think, align the lifetime of the bridge with 'panel->dev' probably helps.
>> Modifying the devm_drm_of_get_bridge() function like the following pattern:
>>
>>
>> ```
>>
>> struct drm_bridge *devm_drm_of_get_bridge(struct device_node *np, u32
>> port, u32 endpoint)
>> {
>> struct drm_bridge *bridge;
>> struct drm_panel *panel;
>> int ret;
>>
>> ret = drm_of_find_panel_or_bridge(np, port, endpoint, &panel, &bridge);
>> if (ret)
>> return ERR_PTR(ret);
>>
>> if (panel)
>> bridge = devm_drm_panel_bridge_add(panel->dev, panel);
>>
>> return bridge;
>> }
> That's one possible solution I thought of, but then the devm_ prefix
> no longer makes sense.
It still is dev managed...
Despite of_get_something() typically increase the reference count of
something. of_find_something() will return a pointer to the underlyinginstance, but will not touch the reference counter.
> Also the panel_bridge is still implicitly created,
Yeah.
And it will always create a new bridge instance when you call the function.
(if the backing panel is already populated).Similar with Dmitry's AUX/HPD bridge.
> and we might as well move that to the panel side.
Your idea is possible and may bring back better lifetime control,
I don't know which method is the best, gonna to run away.
>> ```
>>
>>
>> Or alternatively, inline this to drm/mediatek,
>> rename it as mtk_drm_of_get_bridge().
> I would prefer to not do that, since that only fixes the issue for
> MediaTek, while we have some 30 odd users of devm_drm_of_get_bridge().
>
>> Or alternatively, manage the bridge's lifetime manually.
>> Remove it from the global bridge list if errors happen.
> That's also one way; it's just messy.
Right.
>
> Thanks
> ChenYu
>
>>> Thanks
>>> ChenYu
>>>
>>>> It's much more involved than just another helper though :/
>>>>
>>>> Maxime
>> --
>> Best regards,
>> Sui
>>
--
Best regards,
Sui
Powered by blists - more mailing lists