[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a4dc4f221a6e1de559e1a24a93e7b13d3fed130e.camel@iki.fi>
Date: Thu, 18 Dec 2025 20:11:00 +0200
From: Pauli Virtanen <pav@....fi>
To: Luiz Augusto von Dentz <luiz.dentz@...il.com>
Cc: Cihangir Akturk <cakturk@...il.com>, linux-bluetooth@...r.kernel.org,
Marcel Holtmann <marcel@...tmann.org>, Johan Hedberg
<johan.hedberg@...il.com>, linux-kernel@...r.kernel.org,
syzbot+87badbb9094e008e0685@...kaller.appspotmail.com
Subject: Re: [PATCH] Bluetooth: hci: fix refcounts in LE remote features
command
Hi,
to, 2025-12-18 kello 12:09 -0500, Luiz Augusto von Dentz kirjoitti:
> Hi Pauli,
>
> On Thu, Dec 18, 2025 at 11:33 AM Pauli Virtanen <pav@....fi> wrote:
> >
> > Hi,
> >
> > ke, 2025-12-17 kello 16:36 -0500, Luiz Augusto von Dentz kirjoitti:
> > > Hi Cihangir,
> > >
> > > On Tue, Dec 16, 2025 at 2:13 PM Cihangir Akturk <cakturk@...il.com> wrote:
> > > >
> > > > KASAN reported a slab-use-after-free in le_read_features_complete()
> > > > running from hci_cmd_sync_work. le_read_features_complete() can run
> > > > after hci_rx_work/hci_conn_del() has removed the link, so the destroy
> > > > callback may touch a freed hci_conn and trigger a KASAN use-after-free.
> > > > Duplicate submissions also need to drop the references to avoid leaking
> > > > the hold and blocking teardown.
> > > >
> > > > Fix this by taking hci_conn_get() before queueing, passing the conn
> > > > directly as work data, and balancing hci_conn_hold()/drop() and
> > > > hci_conn_get()/put() in the completion and all error/-EEXIST paths so
> > > > the connection stays referenced while the work runs and is released
> > > > afterwards.
> > > >
> > > > Reported-by: syzbot+87badbb9094e008e0685@...kaller.appspotmail.com
> > > > Signed-off-by: Cihangir Akturk <cakturk@...il.com>
> > > > ---
> > > > net/bluetooth/hci_sync.c | 37 ++++++++++++++++++++++++++-----------
> > > > 1 file changed, 26 insertions(+), 11 deletions(-)
> > > >
> > > > I am not entirely sure that removing the err == -ECANCELED early return
> > > > is strictly correct. My assumption is that, with the changes in this
> > > > patch, there does not appear to be another cancellation path that
> > > > reliably balances hci_conn_drop() and hci_conn_put() for this command.
> > > > Based on that assumption, keeping the early return could leave the
> > > > references taken before queuing unbalanced on cancellation, so I opted
> > > > to remove it to keep the lifetime handling consistent.
> > > >
> > > > diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> > > > index a9f5b1a68356..5a3d288e7015 100644
> > > > --- a/net/bluetooth/hci_sync.c
> > > > +++ b/net/bluetooth/hci_sync.c
> > > > @@ -776,14 +776,23 @@ _hci_cmd_sync_lookup_entry(struct hci_dev *hdev, hci_cmd_sync_work_func_t func,
> > > > * - Lookup if an entry already exist and only if it doesn't creates a new entry
> > > > * and queue it.
> > > > */
> > > > -int hci_cmd_sync_queue_once(struct hci_dev *hdev, hci_cmd_sync_work_func_t func,
> > > > +static int __hci_cmd_sync_queue_once(struct hci_dev *hdev, hci_cmd_sync_work_func_t func,
> > > > void *data, hci_cmd_sync_work_destroy_t destroy)
> > > > {
> > > > if (hci_cmd_sync_lookup_entry(hdev, func, data, destroy))
> > > > - return 0;
> > > > + return -EEXIST;
> > > >
> > > > return hci_cmd_sync_queue(hdev, func, data, destroy);
> > > > }
> > > > +
> > > > +int hci_cmd_sync_queue_once(struct hci_dev *hdev, hci_cmd_sync_work_func_t func,
> > > > + void *data, hci_cmd_sync_work_destroy_t destroy)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + ret = __hci_cmd_sync_queue_once(hdev, func, data, destroy);
> > > > + return ret == -EEXIST ? 0 : ret;
> > > > +}
> > > > EXPORT_SYMBOL(hci_cmd_sync_queue_once);
> > > >
> > > > /* Run HCI command:
> > > > @@ -7338,10 +7347,8 @@ static void le_read_features_complete(struct hci_dev *hdev, void *data, int err)
> > > >
> > > > bt_dev_dbg(hdev, "err %d", err);
> > > >
> > > > - if (err == -ECANCELED)
> > > > - return;
> > > > -
> > > > hci_conn_drop(conn);
> > > > + hci_conn_put(conn);
> > > > }
> > > >
> > > > static int hci_le_read_all_remote_features_sync(struct hci_dev *hdev,
> > > > @@ -7408,12 +7415,20 @@ int hci_le_read_remote_features(struct hci_conn *conn)
> > > > * role is possible. Otherwise just transition into the
> > > > * connected state without requesting the remote features.
> > > > */
> > > > - if (conn->out || (hdev->le_features[0] & HCI_LE_PERIPHERAL_FEATURES))
> > > > - err = hci_cmd_sync_queue_once(hdev,
> > > > - hci_le_read_remote_features_sync,
> > > > - hci_conn_hold(conn),
> > > > - le_read_features_complete);
> > > > - else
> > > > + if (conn->out || (hdev->le_features[0] & HCI_LE_PERIPHERAL_FEATURES)) {
> > > > + hci_conn_get(conn);
> > > > + hci_conn_hold(conn);
> > > > + err = __hci_cmd_sync_queue_once(hdev,
> > > > + hci_le_read_remote_features_sync,
> > > > + conn,
> > > > + le_read_features_complete);
> > > > + if (err) {
> > > > + hci_conn_drop(conn);
> > > > + hci_conn_put(conn);
> > > > + if (err == -EEXIST)
> > > > + err = 0;
> > > > + }
> > > > + } else
> > >
> > > Sort of overkill, why do we have to use 2 references? Also we do have
> > > code for dequeuing callbacks using conn as user_data so either that is
> > > not working or there is something else at play here. Maybe we need to
> > > change the order so that dequeue happens before hci_conn_cleanup:
> > > diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> > > index dc085856f5e9..b64c0e53d9cd 100644
> > > --- a/net/bluetooth/hci_conn.c
> > > +++ b/net/bluetooth/hci_conn.c
> > > @@ -1232,15 +1232,15 @@ void hci_conn_del(struct hci_conn *conn)
> > > skb_queue_purge(&conn->data_q);
> > > skb_queue_purge(&conn->tx_q.queue);
> > >
> > > + /* Dequeue callbacks using connection pointer as data */
> > > + hci_cmd_sync_dequeue(hdev, NULL, conn, NULL);
> > > +
> >
> > hci_cmd_sync_dequeue() does not (i) cancel + wait for a job that is
> > already running, (ii) prevent further jobs for this conn from being
> > queued. So it's not guaranteed to work here AFAICS.
> >
> > For (i): note running hci_sync job may be blocked on taking hdev->lock,
> > which is held here, so trying to cancel + wait deadlocks. Does not seem
> > straightforward to fix.
>
> Hmm, there is a lock though that is used when running the callbacks:
>
> hci_req_sync_lock(hdev);
> err = entry->func(hdev, entry->data);
> if (entry->destroy)
> entry->destroy(hdev, entry->data, err);
> hci_req_sync_unlock(hdev);
>
> We could attempt to acquire hci_req_sync_lock on dequeue, but it looks
> like there are code paths that already do call hci_conn_del with that
> lock so the likes of mgmt-tester deadlock, anyway if there are code
> paths already doing with hci_req_sync_lock held then it should be safe
> to already require it when doing hci_conn_del or maybe rename
> hci_conn_del to hci_conn_del_sync and then have hci_conn_del
> performing the hci_req_sync_lock before calling hci_conn_del_sync then
> work out the code paths where hci_req_sync_lock is already held to use
> hci_conn_del_sync.
Taking hci_req_sync() before hci_conn_del() seems to face some issues:
[hdev->req_workqueue] hci_disconnect_sync()
[hdev->req_workqueue] __hci_cmd_sync_status() -- req_lock still held
[hdev->workqueue] hci_disconn_complete_evt()
[hdev->workqueue] hci_conn_del() -- blocks on req_lock
since the command may end up waiting for hci_conn_del. The disconnect
may also be spontaneous from controller, so looks like any
__hci_cmd_sync_status() could lock up.
AFAICS, you'd need something like a dedicated hdev->workqueue work that
does the final hci_conn put under req_lock, so that there is no job
running in either workqueue during the final put. And get it also right
in the hdev teardown.
***
For how the other refcount / lock approach would look in practice, iirc
the earlier RFC had all hci_lookup callsites dealt with (aside from
6lowpan)
https://lore.kernel.org/linux-bluetooth/cover.1758481869.git.pav@iki.fi/
> > For (ii): one would need to audit the places where these jobs are
> > queued, and make sure they are all done with hdev->lock held, to avoid
> > racing with the code here. Maybe doable with separate queueing function
> > that has lockdep asserts.
> >
> > I suggested some time ago to always hold either refcount or lock to
> > keep the hci_conn alive everywhere, also in these hci_sync callbacks:
> >
> > https://lore.kernel.org/linux-bluetooth/cover.1762100290.git.pav@iki.fi/
> >
> > with similar changes as suggested in this patch. This may be the
> > simpler fix.
> >
> > > /* Remove the connection from the list and cleanup its remaining
> > > * state. This is a separate function since for some cases like
> > > * BT_CONNECT_SCAN we *only* want the cleanup part without the
> > > * rest of hci_conn_del.
> > > */
> > > hci_conn_cleanup(conn);
> > > -
> > > - /* Dequeue callbacks using connection pointer as data */
> > > - hci_cmd_sync_dequeue(hdev, NULL, conn, NULL);
> > > }
> > > struct hci_dev *hci_get_route(bdaddr_t *dst, bdaddr_t *src, uint8_t src_type)
> > >
> > >
> > > > err = -EOPNOTSUPP;
> > > >
> > > > return err;
> > > > --
> > > > 2.52.0
> > > >
> > >
> >
> > --
> > Pauli Virtanen
>
>
Powered by blists - more mailing lists