lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ab83b8ad-2076-425f-974e-6352be7882b5@amd.com>
Date: Wed, 2 Apr 2025 15:06:46 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Chris Bainbridge <chris.bainbridge@...il.com>,
 amd-gfx@...ts.freedesktop.org
Cc: dakr@...nel.org, christian.koenig@....com, daniel@...ll.ch,
 ville.syrjala@...ux.intel.com, maarten.lankhorst@...ux.intel.com,
 mripard@...nel.org, tzimmermann@...e.de, airlied@...il.com, simona@...ll.ch,
 lyude@...hat.com, sumit.semwal@...aro.org, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [BUG] drm_connector reference counting and USB-C docks

On 3/23/2025 8:09 PM, Chris Bainbridge wrote:
> There is a reference couting / lifecycle issue with drm_connector when used
> with a USB-C dock. The problem has been previously reproduced on both Intel and
> AMD GPUs.
> 
> On both Intel and AMD, the symptoms are:
> 
>    - multiple connectors being listed in sysfs `sys/class/drm/cardX/` (because
>      the old connectors are not removed when the dock is unplugged)
>    - no display on the external monitors.
>    - "Payload for VCPI 1 not in topology, not sending remove" error if drm.debug
>      is enabled
> 
> On AMD, this issue is the root cause of a number of errors when re-plugging in
> a dock:
> 
>    - *ERROR* Failed to get ACT after 3000ms
>    - kernel NULL pointer dereference calling setcrtc
>    - UBSAN: shift-out-of-bounds in drivers/gpu/drm/display/drm_dp_mst_topology.c
>    - use-after-free in dc_stream_release
>    - refcount_t: underflow; use-after-free.
>    - slab-use-after-free in event_property_validate
>    - WARNING display/dc/dcn21/dcn21_link_encoder.c:215 dcn21_link_encoder_acquire_phy
>    - Part 1 of payload creation for DP-2 failed, skipping part 2
>    - probably most bug reports relating to suspend/resume and a dock
> 
> This bug has been reproduced on both Ubuntu/Gnome and Debian/XFCE. The symptoms
> are intermittent and vary (as above), but the consistent initial symptom is
> multiple connectors being listed in sysfs.
> 
> To reproduce, annotate drm_dp_delayed_destroy_port with something like:
> 
> --- a/drivers/gpu/drm/display/drm_dp_mst_topology.c
> +++ b/drivers/gpu/drm/display/drm_dp_mst_topology.c
> @@ -5014,6 +5014,9 @@ drm_dp_delayed_destroy_port(struct drm_dp_mst_port *port)
>   
>          if (port->connector) {
>                  drm_connector_unregister(port->connector);
> +               printk("drm_dp_delayed_destroy_port %s refcount=%d\n",
> +                               port->connector->name,
> +                               kref_read(&port->connector->base.refcount));
>                  drm_connector_put(port->connector);
>          }
>   
> Boot laptop with dock connected, activate external monitors, suspend, unplug
> the dock, and resume. This problem is intermittent, so these steps may need to
> be repeated. But when the problem is hit, the drm_dp_mst_port will be
> destroyed, but the drm_connector will still be alive. (This can also be
> reproduced with just plugging and unplugging without suspend/resume, but, on my
> laptop, it happens almost every time with suspend/resume).
> 
> The cause of this problem appears to be:
> 
>    - calling setcrtc to enable a CRTC results in the drm_connector refcount
>      being incremented:
>    - drm_atomic_get_connector_state appears to add connectors into
>      drm_atomic_state->connectors, and increments the refcount
> 
>    - on disabling the external monitors, a call to drm_mode_setcrtc results in
>      the drm_connector being destroyed via call chain:
> 
>      amdgpu_drm_ioctl
>        drm_ioctl
>          drm_ioctl_kernel
>            drm_mode_setcrtc (via func)
>              drm_atomic_helper_set_config (via crtc->funcs->set_config)
>                drm_atomic_state_put
>                  __drm_atomic_state_free (via kref_put)
>                    drm_atomic_state_clear
>                      drm_atomic_state_default_clear
>                        drm_connector_put
>                          drm_mode_object_put
>                            drm_connector_free (via ->free_cb put destroyer)
>                              dm_dp_mst_connector_destroy
> 
>    - so the drm_connector is not destroyed until/if userspace calls setcrtc to
>      clear the CRTC (set.num_connectors=0). If this does not happen for whatever
>      reason (userspace process is terminated, frozen due to suspend, etc.) then
>      the drm_connector object will still be alive even though the corresponding
>      drm_dp_mst_port is dead.
> 
>    - in normal usage, drm_connector_cleanup releases the connector ID:
> 
>      ida_free(&dev->mode_config.connector_ida, connector->index);
> 
>    - when dock is replugged, a connector ID is allocated:
> 
>      connector->connector_type_id = ida_alloc_min(connector_ida, 1, GFP_KERNEL);
> 
>    - if setcrtc has not been called to free the old ID, then ida_alloc_min
>      allocates a new connector ID instead of reusing the old one. This explains
>      the "multiple connectors being listed in sysfs" problem.
> 
>    - the other problems occur after this, due to the multiple half-dead
>      connector objects.
> 
>    - UBSAN: shift-out-of-bounds in drivers/gpu/drm/display/drm_dp_mst_topology.c:4568
>      occurs because vcpi==0 in this payload, so BIT op is a left-shift by -1.
> 
>    - slab-use-after-free in event_property_validate: looks like it happens
>      because hdcp_update_display, hdcp_remove_display copy references to
>      amdgpu_dm_connector (which contains a nested drm_connector) in to the
>      delayed_work struct hdcp_workqueue without incrementing the reference count
>      (see pair of lines "hdcp_w->aconnector[conn_index] = aconnector;").
>      If the connector is freed, &aconnector[conn_index] will become a dangling
>      pointer. Actually, I can reproduce this easily by just booting to
>      gdm then plugging and unplugging the dock a few times, so it's
>      possible this is an independent issue that also needs fixing.
> 
>    - use-after-free in dc_stream_release - there appears to be a few
>      points where a dc_stream_state pointer is copied without refcounting
>      ("pipe_ctx->stream = stream;") but I don't know if this is the problem. It
>      could also just be that earlier failures have left something in a bad state.
> 
> I'm unsure of the best approach to fix the root cause. One way is to try
> and release the references by disabling the CRTC. I tried calling
> drm_mode_crtc from drm_dp_delayed_destroy_port. This was a bit hacky,
> but did seem to work, the reference count got reduced to 0, and the
> drm_connector was destroyed. Another option would be to call the
> drm_connector destructor from drm_dp_delayed_destroy_port (protected by
> some mutex so that it doesn't get called twice when the actual refcount
> goes to 0) - that might work to free up the connector ID, but I suspect
> there could be other issues with having the drm_connector object still
> alive and potentially holding references to other objects, even though
> the dock has been physically disconnected.

Thanks for the thorough analysis and all your thoughts here.  It sounds 
like fixing this is going to resolve quite a large number of issues.

Working up your call path, drm_dp_destroy_port() is what queues the 
delayed destroy work and specifically has a comment about holding the 
mode config mutex from EDID retrieval being the reason that it can't be 
destroyed immediately.

Perhaps could this be solved by reading EDID immediately when the device
shows up and caching it for all callers?  Then this could be an 
immediate destroy.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ