[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABBYNZJuic=HfeF1-ybuKELCOEOYk9OWtvqXC4vyrnnZLUp7RQ@mail.gmail.com>
Date: Mon, 16 Dec 2024 22:15:45 -0500
From: Luiz Augusto von Dentz <luiz.dentz@...il.com>
To: "Cheng Jiang (IOE)" <quic_chejiang@...cinc.com>
Cc: Marcel Holtmann <marcel@...tmann.org>, Johan Hedberg <johan.hedberg@...il.com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
linux-bluetooth@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, quic_jiaymao@...cinc.com,
quic_shuaz@...cinc.com, quic_zijuhu@...cinc.com, quic_mohamull@...cinc.com
Subject: Re: [PATCH v1] Bluetooth: hci_sync: Fix disconnect complete event
timeout issue
Hi Cheng,
On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
<quic_chejiang@...cinc.com> wrote:
>
> Hi Luiz,
>
> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> > Hi Cheng,
> >
> > On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> > <luiz.dentz@...il.com> wrote:
> >>
> >> Hi Cheng,
> >>
> >> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@...cinc.com> wrote:
> >>>
> >>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> >>> in time, requiring the controller to wait for the supervision timeout,
> >>> which may exceed 2 seconds. In the current implementation, the
> >>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> >>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> >>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> >>> indefinitely for the mgmt response to the disconnect. To recover,
> >>> restarting bluetoothd is necessary.
> >>>
> >>> bluetoothctl log like this:
> >>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> >>> Attempting to disconnect from D9:B5:6C:F2:51:91
> >>> [Designer Mouse]#
> >>> [Designer Mouse]# power off
> >>> [Designer Mouse]#
> >>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >>>
> >>> Signed-off-by: Cheng Jiang <quic_chejiang@...cinc.com>
> >>> ---
> >>> include/net/bluetooth/hci_core.h | 2 ++
> >>> net/bluetooth/hci_conn.c | 9 +++++++++
> >>> net/bluetooth/hci_event.c | 9 +++++++++
> >>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++
> >>> 4 files changed, 38 insertions(+)
> >>>
> >>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> >>> index 734cd50cd..2ab079dcf 100644
> >>> --- a/include/net/bluetooth/hci_core.h
> >>> +++ b/include/net/bluetooth/hci_core.h
> >>> @@ -753,6 +753,8 @@ struct hci_conn {
> >>>
> >>> struct bt_codec codec;
> >>>
> >>> + struct completion disc_ev_comp;
> >>> +
> >>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status);
> >>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason);
> >>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> >>> index d097e308a..e0244e191 100644
> >>> --- a/net/bluetooth/hci_conn.c
> >>> +++ b/net/bluetooth/hci_conn.c
> >>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >>>
> >>> hci_conn_init_sysfs(conn);
> >>>
> >>> + /* This disc_ev_comp is inited when we send a disconnect request to
> >>> + * the remote device but fail to receive the disconnect complete
> >>> + * event within the expected time (2 seconds). This occurs because
> >>> + * the remote device doesn't ack the terminate indication, forcing
> >>> + * the controller to wait for the supervision timeout.
> >>> + */
> >>> + init_completion(&conn->disc_ev_comp);
> >>> + complete(&conn->disc_ev_comp);
> >>> +
> >>> return conn;
> >>> }
> >>>
> >>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> >>> index 2cc7a9306..60ecb2b18 100644
> >>> --- a/net/bluetooth/hci_event.c
> >>> +++ b/net/bluetooth/hci_event.c
> >>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >>> if (!conn)
> >>> goto unlock;
> >>>
> >>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> >>> + * hci_abort_conn_sync will wait hdev lock release to continue.
> >>> + */
> >>> + if (!completion_done(&conn->disc_ev_comp)) {
> >>> + complete(&conn->disc_ev_comp);
> >>> + /* Add some delay for hci_abort_conn_sync to handle the complete */
> >>> + usleep_range(100, 1000);
> >>> + }
> >>> +
> >>> if (ev->status) {
> >>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>> conn->dst_type, ev->status);
> >>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> >>> index 0badec712..783d04b57 100644
> >>> --- a/net/bluetooth/hci_sync.c
> >>> +++ b/net/bluetooth/hci_sync.c
> >>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >>> break;
> >>> }
> >>>
> >>> + /* Check whether the connection is successfully disconnected.
> >>> + * Sometimes the remote device doesn't acknowledge the
> >>> + * LL_TERMINATE_IND in time, requiring the controller to wait
> >>> + * for the supervision timeout, which may exceed 2 seconds. In
> >>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> >>> + * event before cleaning up the connection.
> >>> + */
> >>> + if (err == -ETIMEDOUT) {
> >>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> >>> +
> >>> + reinit_completion(&conn->disc_ev_comp);
> >>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> >>> + bt_dev_warn(hdev, "Failed to get complete");
> >>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>> + conn->dst_type, conn->abort_reason);
> >>> + }
> >>> + }
> >>
> >> Why don't we just set the supervision timeout as timeout then? If we
> >> will have to wait for it anyway just change hci_disconnect_sync to use
> >> 10 * conn->le_supv_timeout as timeout instead.
> >>
> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
> for the event. So even in hci_disconnect_sync we can set the timeout to
> supervision timeout, hci_disconnect_sync still timeout after 2s.
Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
the supervision timeout instead? If it still timeout after to 2
seconds then there is something still forcing HCI_CMD_TIMEOUT which
shouldn't happen.
> >> That said, we really need to fix bluetoothd if it is not able to be
> >> cleaned up if SET_POWERED command fails, but it looks like it is
> >> handling errors correctly so it sounds like something else is at play.
> >
> The issue arises after a 2-second timeout of hci_disconnect_sync. During
> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
> after supervision timeout, the disconnect complete event arrives, but
> it returns at line 3370 since the connection has already been removed.
> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
> and unable to perform other mgmt commands.
The command shall have completed regardless if disconnect complete has
been received or not, the is the whole point of having a timeout, so
this makes no sense to me, or you are not describing what is happening
here. Also there is no MGMT_OP_DISCONNECT pending, it is
MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
that is a totally different thing and perhaps that is the source of
the problem because if we do cleanup hci_conn even in case the command
fails/times out then we should be generating a disconnected event as
well.
>
> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> 3356 struct sk_buff *skb)
> 3357 {
> 3358 struct hci_ev_disconn_complete *ev = data;
> 3359 u8 reason;
> 3360 struct hci_conn_params *params;
> 3361 struct hci_conn *conn;
> 3362 bool mgmt_connected;
> 3363
> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
> 3365
> 3366 hci_dev_lock(hdev);
> 3367
> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
> 3369 if (!conn)
> 3370 goto unlock;
> 3371
> 3372 if (ev->status) {
> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> 3374 conn->dst_type, ev->status);
> 3375 goto unlock;
> 3376 }
> 3377
> 3378 conn->state = BT_CLOSED;
> 3379
> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
> 3381
>
> > I double checked this and apparently this could no longer fail:
> >
> > + /* Disregard possible errors since hci_conn_del shall have been
> > + * called even in case of errors had occurred since it would
> > + * then cause hci_conn_failed to be called which calls
> > + * hci_conn_del internally.
> > + */
> > + hci_abort_conn_sync(hdev, conn, reason);
> >
> > So it will clean up the hci_conn no matter what is the timeout, so
> > either you don't have this change or it is not working for some
> > reason.
> >
> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
> blocked. It does not happen during the power off stage. It happened when disconnect
> a BLE device, but the disconnect complete event sent from controller to host 2s later.
> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
There is a difference about a MGMT command, initiated by bluetoothd,
versus a MGMT event initiated by the kernel, so the daemon is not
blocked it just don't get a disconnection event, which is different
than a command complete.
What is the command sequence that you use to reproduce the problem?
> >>> hci_dev_lock(hdev);
> >>>
> >>> /* Check if the connection has been cleaned up concurrently */
> >>>
> >>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >>> --
> >>> 2.34.1
> >>>
> >>
> >>
> >> --
> >> Luiz Augusto von Dentz
> >
> >
> >
>
--
Luiz Augusto von Dentz
Powered by blists - more mailing lists