lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABBYNZJuic=HfeF1-ybuKELCOEOYk9OWtvqXC4vyrnnZLUp7RQ@mail.gmail.com>
Date: Mon, 16 Dec 2024 22:15:45 -0500
From: Luiz Augusto von Dentz <luiz.dentz@...il.com>
To: "Cheng Jiang (IOE)" <quic_chejiang@...cinc.com>
Cc: Marcel Holtmann <marcel@...tmann.org>, Johan Hedberg <johan.hedberg@...il.com>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, 
	linux-bluetooth@...r.kernel.org, netdev@...r.kernel.org, 
	linux-kernel@...r.kernel.org, quic_jiaymao@...cinc.com, 
	quic_shuaz@...cinc.com, quic_zijuhu@...cinc.com, quic_mohamull@...cinc.com
Subject: Re: [PATCH v1] Bluetooth: hci_sync: Fix disconnect complete event
 timeout issue

Hi Cheng,

On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
<quic_chejiang@...cinc.com> wrote:
>
> Hi Luiz,
>
> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> > Hi Cheng,
> >
> > On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> > <luiz.dentz@...il.com> wrote:
> >>
> >> Hi Cheng,
> >>
> >> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@...cinc.com> wrote:
> >>>
> >>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> >>> in time, requiring the controller to wait for the supervision timeout,
> >>> which may exceed 2 seconds. In the current implementation, the
> >>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> >>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> >>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> >>> indefinitely for the mgmt response to the disconnect. To recover,
> >>> restarting bluetoothd is necessary.
> >>>
> >>> bluetoothctl log like this:
> >>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> >>> Attempting to disconnect from D9:B5:6C:F2:51:91
> >>> [Designer Mouse]#
> >>> [Designer Mouse]# power off
> >>> [Designer Mouse]#
> >>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >>>
> >>> Signed-off-by: Cheng Jiang <quic_chejiang@...cinc.com>
> >>> ---
> >>>  include/net/bluetooth/hci_core.h |  2 ++
> >>>  net/bluetooth/hci_conn.c         |  9 +++++++++
> >>>  net/bluetooth/hci_event.c        |  9 +++++++++
> >>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
> >>>  4 files changed, 38 insertions(+)
> >>>
> >>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> >>> index 734cd50cd..2ab079dcf 100644
> >>> --- a/include/net/bluetooth/hci_core.h
> >>> +++ b/include/net/bluetooth/hci_core.h
> >>> @@ -753,6 +753,8 @@ struct hci_conn {
> >>>
> >>>         struct bt_codec codec;
> >>>
> >>> +       struct completion disc_ev_comp;
> >>> +
> >>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
> >>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
> >>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> >>> index d097e308a..e0244e191 100644
> >>> --- a/net/bluetooth/hci_conn.c
> >>> +++ b/net/bluetooth/hci_conn.c
> >>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >>>
> >>>         hci_conn_init_sysfs(conn);
> >>>
> >>> +       /* This disc_ev_comp is inited when we send a disconnect request to
> >>> +        * the remote device but fail to receive the disconnect complete
> >>> +        * event within the expected time (2 seconds). This occurs because
> >>> +        * the remote device doesn't ack the terminate indication, forcing
> >>> +        * the controller to wait for the supervision timeout.
> >>> +        */
> >>> +       init_completion(&conn->disc_ev_comp);
> >>> +       complete(&conn->disc_ev_comp);
> >>> +
> >>>         return conn;
> >>>  }
> >>>
> >>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> >>> index 2cc7a9306..60ecb2b18 100644
> >>> --- a/net/bluetooth/hci_event.c
> >>> +++ b/net/bluetooth/hci_event.c
> >>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >>>         if (!conn)
> >>>                 goto unlock;
> >>>
> >>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> >>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
> >>> +        */
> >>> +       if (!completion_done(&conn->disc_ev_comp)) {
> >>> +               complete(&conn->disc_ev_comp);
> >>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
> >>> +               usleep_range(100, 1000);
> >>> +       }
> >>> +
> >>>         if (ev->status) {
> >>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>                                        conn->dst_type, ev->status);
> >>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> >>> index 0badec712..783d04b57 100644
> >>> --- a/net/bluetooth/hci_sync.c
> >>> +++ b/net/bluetooth/hci_sync.c
> >>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >>>                 break;
> >>>         }
> >>>
> >>> +       /* Check whether the connection is successfully disconnected.
> >>> +        * Sometimes the remote device doesn't acknowledge the
> >>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
> >>> +        * for the supervision timeout, which may exceed 2 seconds. In
> >>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> >>> +        * event before cleaning up the connection.
> >>> +        */
> >>> +       if (err == -ETIMEDOUT) {
> >>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> >>> +
> >>> +               reinit_completion(&conn->disc_ev_comp);
> >>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> >>> +                       bt_dev_warn(hdev, "Failed to get complete");
> >>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>> +                                              conn->dst_type, conn->abort_reason);
> >>> +               }
> >>> +       }
> >>
> >> Why don't we just set the supervision timeout as timeout then? If we
> >> will have to wait for it anyway just change hci_disconnect_sync to use
> >> 10 * conn->le_supv_timeout as timeout instead.
> >>
> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
> for the event. So even in hci_disconnect_sync we can set the timeout to
> supervision timeout, hci_disconnect_sync still timeout after 2s.

Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
the supervision timeout instead? If it still timeout after to 2
seconds then there is something still forcing HCI_CMD_TIMEOUT which
shouldn't happen.

> >> That said, we really need to fix bluetoothd if it is not able to be
> >> cleaned up if SET_POWERED command fails, but it looks like it is
> >> handling errors correctly so it sounds like something else is at play.
> >
> The issue arises after a 2-second timeout of hci_disconnect_sync. During
> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
> after supervision timeout, the disconnect complete event arrives, but
> it returns at line 3370 since the connection has already been removed.
> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
> and unable to perform other mgmt commands.

The command shall have completed regardless if disconnect complete has
been received or not, the is the whole point of having a timeout, so
this makes no sense to me, or you are not describing what is happening
here. Also there is no MGMT_OP_DISCONNECT pending, it is
MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
that is a totally different thing and perhaps that is the source of
the problem because if we do cleanup hci_conn even in case the command
fails/times out then we should be generating a disconnected event as
well.

>
> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> 3356              struct sk_buff *skb)
> 3357 {
> 3358   struct hci_ev_disconn_complete *ev = data;
> 3359   u8 reason;
> 3360   struct hci_conn_params *params;
> 3361   struct hci_conn *conn;
> 3362   bool mgmt_connected;
> 3363
> 3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
> 3365
> 3366   hci_dev_lock(hdev);
> 3367
> 3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
> 3369   if (!conn)
> 3370     goto unlock;
> 3371
> 3372   if (ev->status) {
> 3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> 3374                conn->dst_type, ev->status);
> 3375     goto unlock;
> 3376   }
> 3377
> 3378   conn->state = BT_CLOSED;
> 3379
> 3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
> 3381
>
> > I double checked this and apparently this could no longer fail:
> >
> > +               /* Disregard possible errors since hci_conn_del shall have been
> > +                * called even in case of errors had occurred since it would
> > +                * then cause hci_conn_failed to be called which calls
> > +                * hci_conn_del internally.
> > +                */
> > +               hci_abort_conn_sync(hdev, conn, reason);
> >
> > So it will clean up the hci_conn no matter what is the timeout, so
> > either you don't have this change or it is not working for some
> > reason.
> >
> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
> blocked. It does not happen during the power off stage. It happened when disconnect
> a BLE device, but the disconnect complete event sent from controller to host 2s later.
> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.

There is a difference about a MGMT command, initiated by bluetoothd,
versus a MGMT event initiated by the kernel, so the daemon is not
blocked it just don't get a disconnection event, which is different
than a command complete.

What is the command sequence that you use to reproduce the problem?

> >>>         hci_dev_lock(hdev);
> >>>
> >>>         /* Check if the connection has been cleaned up concurrently */
> >>>
> >>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >>> --
> >>> 2.34.1
> >>>
> >>
> >>
> >> --
> >> Luiz Augusto von Dentz
> >
> >
> >
>


-- 
Luiz Augusto von Dentz

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ