[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABBYNZJcHvVYBUmS6AeA6OGnV8n1EKh5JxB8aLqP9OBqA-amyg@mail.gmail.com>
Date: Tue, 17 Dec 2024 10:13:40 -0500
From: Luiz Augusto von Dentz <luiz.dentz@...il.com>
To: "Cheng Jiang (IOE)" <quic_chejiang@...cinc.com>
Cc: Marcel Holtmann <marcel@...tmann.org>, Johan Hedberg <johan.hedberg@...il.com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
linux-bluetooth@...r.kernel.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, quic_jiaymao@...cinc.com,
quic_shuaz@...cinc.com, quic_zijuhu@...cinc.com, quic_mohamull@...cinc.com
Subject: Re: [PATCH v1] Bluetooth: hci_sync: Fix disconnect complete event
timeout issue
Hi Cheng,
On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE)
<quic_chejiang@...cinc.com> wrote:
>
> Hi Luiz,
>
> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
> > Hi Cheng,
> >
> > On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
> > <quic_chejiang@...cinc.com> wrote:
> >>
> >> Hi Luiz,
> >>
> >> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> >>> Hi Cheng,
> >>>
> >>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> >>> <luiz.dentz@...il.com> wrote:
> >>>>
> >>>> Hi Cheng,
> >>>>
> >>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@...cinc.com> wrote:
> >>>>>
> >>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> >>>>> in time, requiring the controller to wait for the supervision timeout,
> >>>>> which may exceed 2 seconds. In the current implementation, the
> >>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> >>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> >>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> >>>>> indefinitely for the mgmt response to the disconnect. To recover,
> >>>>> restarting bluetoothd is necessary.
> >>>>>
> >>>>> bluetoothctl log like this:
> >>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> >>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
> >>>>> [Designer Mouse]#
> >>>>> [Designer Mouse]# power off
> >>>>> [Designer Mouse]#
> >>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >>>>>
> >>>>> Signed-off-by: Cheng Jiang <quic_chejiang@...cinc.com>
> >>>>> ---
> >>>>> include/net/bluetooth/hci_core.h | 2 ++
> >>>>> net/bluetooth/hci_conn.c | 9 +++++++++
> >>>>> net/bluetooth/hci_event.c | 9 +++++++++
> >>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++
> >>>>> 4 files changed, 38 insertions(+)
> >>>>>
> >>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> >>>>> index 734cd50cd..2ab079dcf 100644
> >>>>> --- a/include/net/bluetooth/hci_core.h
> >>>>> +++ b/include/net/bluetooth/hci_core.h
> >>>>> @@ -753,6 +753,8 @@ struct hci_conn {
> >>>>>
> >>>>> struct bt_codec codec;
> >>>>>
> >>>>> + struct completion disc_ev_comp;
> >>>>> +
> >>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status);
> >>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason);
> >>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> >>>>> index d097e308a..e0244e191 100644
> >>>>> --- a/net/bluetooth/hci_conn.c
> >>>>> +++ b/net/bluetooth/hci_conn.c
> >>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >>>>>
> >>>>> hci_conn_init_sysfs(conn);
> >>>>>
> >>>>> + /* This disc_ev_comp is inited when we send a disconnect request to
> >>>>> + * the remote device but fail to receive the disconnect complete
> >>>>> + * event within the expected time (2 seconds). This occurs because
> >>>>> + * the remote device doesn't ack the terminate indication, forcing
> >>>>> + * the controller to wait for the supervision timeout.
> >>>>> + */
> >>>>> + init_completion(&conn->disc_ev_comp);
> >>>>> + complete(&conn->disc_ev_comp);
> >>>>> +
> >>>>> return conn;
> >>>>> }
> >>>>>
> >>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> >>>>> index 2cc7a9306..60ecb2b18 100644
> >>>>> --- a/net/bluetooth/hci_event.c
> >>>>> +++ b/net/bluetooth/hci_event.c
> >>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >>>>> if (!conn)
> >>>>> goto unlock;
> >>>>>
> >>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> >>>>> + * hci_abort_conn_sync will wait hdev lock release to continue.
> >>>>> + */
> >>>>> + if (!completion_done(&conn->disc_ev_comp)) {
> >>>>> + complete(&conn->disc_ev_comp);
> >>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */
> >>>>> + usleep_range(100, 1000);
> >>>>> + }
> >>>>> +
> >>>>> if (ev->status) {
> >>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>>> conn->dst_type, ev->status);
> >>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> >>>>> index 0badec712..783d04b57 100644
> >>>>> --- a/net/bluetooth/hci_sync.c
> >>>>> +++ b/net/bluetooth/hci_sync.c
> >>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >>>>> break;
> >>>>> }
> >>>>>
> >>>>> + /* Check whether the connection is successfully disconnected.
> >>>>> + * Sometimes the remote device doesn't acknowledge the
> >>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait
> >>>>> + * for the supervision timeout, which may exceed 2 seconds. In
> >>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> >>>>> + * event before cleaning up the connection.
> >>>>> + */
> >>>>> + if (err == -ETIMEDOUT) {
> >>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> >>>>> +
> >>>>> + reinit_completion(&conn->disc_ev_comp);
> >>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> >>>>> + bt_dev_warn(hdev, "Failed to get complete");
> >>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>>> + conn->dst_type, conn->abort_reason);
> >>>>> + }
> >>>>> + }
> >>>>
> >>>> Why don't we just set the supervision timeout as timeout then? If we
> >>>> will have to wait for it anyway just change hci_disconnect_sync to use
> >>>> 10 * conn->le_supv_timeout as timeout instead.
> >>>>
> >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
> >> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
> >> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
> >> for the event. So even in hci_disconnect_sync we can set the timeout to
> >> supervision timeout, hci_disconnect_sync still timeout after 2s.
> >
> > Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
> > the supervision timeout instead? If it still timeout after to 2
> > seconds then there is something still forcing HCI_CMD_TIMEOUT which
> > shouldn't happen.
> >
> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so
> even the upper layer set to a larger timeout value, like supervision timeout,
> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) ->
> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the
> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.
>
> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
> since it's waked up by other event, not the real timeout.
>
> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
> for the disconnect complete event rather than command status event. According to
> BT core spec, disconnect complete event has to wait for remote's ack or wait until
> supervision timeout. It's a valid case that the disconnect complete event taking
> more than 2s.
You seems to be confusing the role of 2 different timers:
err = wait_event_interruptible_timeout(hdev->req_wait_q,
hdev->req_status != HCI_REQ_PEND,
timeout);
and
queue_delayed_work(hdev->workqueue, &hdev->cmd_timer,
HCI_CMD_TIMEOUT);
The former waits for a specific event, while the later waits for
handle_cmd_cnt_and_timer, each can have a distinct timeout as in the
code bellow:
return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN,
plen, data,
HCI_EV_LE_ENHANCED_CONN_COMPLETE,
conn->conn_timeout, NULL);
The reason there are 2 timers is that we need to track the number of
commands outstanding in the controller, and no you can't delay Command
Status:
When the Controller receives the HCI_Disconnect command, it _shall_ send the
HCI_Command_Status event to the Host.
So even if HCI_Disconnection_Complete is delayed the following shall
still work provided the HCI_Command_Status is still being generated
accordingly:
index 0badec7120ab..0ab607fb6433 100644
--- a/net/bluetooth/hci_sync.c
+++ b/net/bluetooth/hci_sync.c
@@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev
*hdev, struct hci_conn *conn,
return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
sizeof(cp), &cp,
HCI_EV_DISCONN_COMPLETE,
- HCI_CMD_TIMEOUT, NULL);
+ conn->disc_timeout, NULL);
return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp,
HCI_CMD_TIMEOUT);
Then we need to adjust conn->disc_timeout according to supervision
timeout, that said it appears disc_timeout is actually acting as idle
timeout, so perhaps we need a separate field, also it doesn't look
like we track the supervision timeout for non-LE links.
> >>>> That said, we really need to fix bluetoothd if it is not able to be
> >>>> cleaned up if SET_POWERED command fails, but it looks like it is
> >>>> handling errors correctly so it sounds like something else is at play.
> >>>
> >> The issue arises after a 2-second timeout of hci_disconnect_sync. During
> >> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
> >> after supervision timeout, the disconnect complete event arrives, but
> >> it returns at line 3370 since the connection has already been removed.
> >> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
> >> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
> >> and unable to perform other mgmt commands.
> >
> > The command shall have completed regardless if disconnect complete has
> > been received or not, the is the whole point of having a timeout, so
> > this makes no sense to me, or you are not describing what is happening
> > here. Also there is no MGMT_OP_DISCONNECT pending, it is
> > MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
> > that is a totally different thing and perhaps that is the source of
> > the problem because if we do cleanup hci_conn even in case the command
> > fails/times out then we should be generating a disconnected event as
> > well.
> >
> Here is the flow describe the issue. For normal case:
> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐
> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘
> │ disconnect cmd │ │ │ │
> │─────────────────────>│ │ │ │
> │ │ │ │ │
> │ │ MGMT_OP │ │ │
> │ │ _DISCONNECT │ │ │
> │ │───────────────────>│ │ │
> │ │ │ │ │
> │ │ │ HCI_Disconnect │ │
> │ │ │───────────────────>│ │
> │ │ │ │ │
> │ │ │ │ LL_TERMINATE │
> │ │ │ │ _IND │
> │ │ │ │───────────────────>│
> │ │ │ │ │
> │ │ │ │ ACK │
> │ │ │ │<───────────────────│
> │ │ │ │ │
> │ │ │ Disc_Comp_Evt │ │
> │ │ │<───────────────────│ │
> │ │ │ │ │
> │ │ MGMT Response │ │ │
> │ │<───────────────────│ │ │
> │ │ │ │ │
> │ disc succ │ │ │ │
> │<─────────────────────│ │ │ │
> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐
> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘
>
>
> The failure case like this:
>
> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐
> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘
> │ disconnect │ │ │ │
> │ cmd │ │ │ │
> │─────────────────────>│ │ │ │
> │ │ │ │ │
> │ │ MGMT_OP_ │ │ │
> │ │ DISCONNECT │ │ │
> │ │───────────────────>│ │ │
> │ │ │ │ │
> │ │ ┌┴┐ HCI_ │ │
> │ │ │ │ Disconnect │ │
> │ │ │ │ ──────────────────>│ │
> │ │ │ │ │ │
> │ │ │ │ │ LL_TERMINATE ┌┴┐
> │ │ │ │ │ _IND │ │
> │ │ │ │ │─────────────────> │ │
> │ │ │ │ │ │ │
> │ │ │ │ 2s │ │ │
> │ │ │ │ │ │ │
> │ │ │ │ │ │ │ More
> │ │ │ │ │ │ │ than
> │ │ │ │ │ │ │ 2s
> │ │ │ │ │ │ │
> │ │ │ │ │ │ │
> │ │ │ │ │ │ │
> │ │ └┬┘ │ │ │
> │ │ │────┐ │ │ │
> │ │ │ │ hci_disconnect │ │ │
> │ │ │<───┘ sync timeout, │ │ │
> │ │ │ del 'con' by │ │ │
> │ │ │ hci_conn_failed│ │ │
> │ │ │ │ └┬┘
> │ │ │ │ ACK │
> │ │ │ │<───────────────────│
> │ │ │ │ │
> │ │ │ Disc_Comp_Evt │ │
> │ │ │<────────────────────│ │
> │ │ │ │ │
> │ │ │────┐ │ │
> │ \│/ │ │ ignore the │ │
> │ X │<───┘ event since │ │
> │ /│\ │ 'con' has been │ │
> │ │ MGMT │ deleted │ │
> │ │ Response │ │ │
> │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │
> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐
> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘
>
>
> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
> our test, bluetoothd can't accept any command related to mgmt from bluetothctl.
>
> >>
> >> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >> 3356 struct sk_buff *skb)
> >> 3357 {
> >> 3358 struct hci_ev_disconn_complete *ev = data;
> >> 3359 u8 reason;
> >> 3360 struct hci_conn_params *params;
> >> 3361 struct hci_conn *conn;
> >> 3362 bool mgmt_connected;
> >> 3363
> >> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
> >> 3365
> >> 3366 hci_dev_lock(hdev);
> >> 3367
> >> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
> >> 3369 if (!conn)
> >> 3370 goto unlock;
> >> 3371
> >> 3372 if (ev->status) {
> >> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >> 3374 conn->dst_type, ev->status);
> >> 3375 goto unlock;
> >> 3376 }
> >> 3377
> >> 3378 conn->state = BT_CLOSED;
> >> 3379
> >> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
> >> 3381
> >>
> >>> I double checked this and apparently this could no longer fail:
> >>>
> >>> + /* Disregard possible errors since hci_conn_del shall have been
> >>> + * called even in case of errors had occurred since it would
> >>> + * then cause hci_conn_failed to be called which calls
> >>> + * hci_conn_del internally.
> >>> + */
> >>> + hci_abort_conn_sync(hdev, conn, reason);
> >>>
> >>> So it will clean up the hci_conn no matter what is the timeout, so
> >>> either you don't have this change or it is not working for some
> >>> reason.
> >>>
> >> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
> >> blocked. It does not happen during the power off stage. It happened when disconnect
> >> a BLE device, but the disconnect complete event sent from controller to host 2s later.
> >> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
> >
> > There is a difference about a MGMT command, initiated by bluetoothd,
> > versus a MGMT event initiated by the kernel, so the daemon is not
> > blocked it just don't get a disconnection event, which is different
> > than a command complete.
> >
> > What is the command sequence that you use to reproduce the problem?
> Here is the command log:
> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
> [bluetooth]# power on
> Changing power on succeeded
> [bluetooth]# connect CF:90:67:3C:7A:56
> Attempting to connect to CF:90:67:3C:7A:56
> [CHG] Device CF:90:67:3C:7A:56 Connected: yes
> Connection successful
> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
> [Designer Mouse]#
> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
> [Designer Mouse]#
> [Designer Mouse]# power off
> [Designer Mouse]#
> Failed to set power off: org.freedesktop.DBus.Error.NoReply
>
> To easily reproduce this issue, we use a firmware which always send the disconnect
> complete event more than 2s. Then the issue occurred 100%.
>
> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it
> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.
>
> >
> >>>>> hci_dev_lock(hdev);
> >>>>>
> >>>>> /* Check if the connection has been cleaned up concurrently */
> >>>>>
> >>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >>>>> --
> >>>>> 2.34.1
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Luiz Augusto von Dentz
> >>>
> >>>
> >>>
> >>
> >
> >
>
--
Luiz Augusto von Dentz
Powered by blists - more mailing lists