netdev - Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource destruction upon firmware failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230321120259.GT36557@unreal>
Date:   Tue, 21 Mar 2023 14:02:59 +0200
From:   Leon Romanovsky <leon@...nel.org>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     Patrisious Haddad <phaddad@...dia.com>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>, linux-rdma@...r.kernel.org,
        netdev@...r.kernel.org, Paolo Abeni <pabeni@...hat.com>,
        Saeed Mahameed <saeedm@...dia.com>
Subject: Re: [PATCH rdma-next v1 2/3] RDMA/mlx5: Handling dct common resource
 destruction upon firmware failure

On Tue, Mar 21, 2023 at 08:53:35AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 21, 2023 at 09:54:58AM +0200, Leon Romanovsky wrote:
> > On Mon, Mar 20, 2023 at 04:18:14PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Mar 16, 2023 at 03:39:27PM +0200, Leon Romanovsky wrote:
> > > > From: Patrisious Haddad <phaddad@...dia.com>
> > > > 
> > > > Previously when destroying a DCT, if the firmware function for the
> > > > destruction failed, the common resource would have been destroyed
> > > > either way, since it was destroyed before the firmware object.
> > > > Which leads to kernel warning "refcount_t: underflow" which indicates
> > > > possible use-after-free.
> > > > Which is triggered when we try to destroy the common resource for the
> > > > second time and execute refcount_dec_and_test(&common->refcount).
> > > > 
> > > > So, currently before destroying the common resource we check its
> > > > refcount and continue with the destruction only if it isn't zero.
> > > 
> > > This seems super sketchy
> > > 
> > > If the destruction fails why not set the refcount back to 1?
> > 
> > Because destruction will fail in destroy_rq_tracked() which is after
> > destroy_resource_common().
> > 
> > In first destruction attempt, we delete qp from radix tree and wait for all
> > reference to drop. In order do not undo all this logic (setting 1 alone is
> > not enough), it is much safer simply skip destroy_resource_common() in reentry
> > case.
> 
> This is the bug I pointed a long time ago, it is ordered wrong to
> remove restrack before destruction is assured

It is not restrack, but internal to mlx5_core structure.

  176 static void destroy_resource_common(struct mlx5_ib_dev *dev,
  177                                     struct mlx5_core_qp *qp)
  178 {
  179         struct mlx5_qp_table *table = &dev->qp_table;
  180         unsigned long flags;
  181

....

  185         spin_lock_irqsave(&table->lock, flags);
  186         radix_tree_delete(&table->tree,
  187                           qp->qpn | (qp->common.res << MLX5_USER_INDEX_LEN));
  188         spin_unlock_irqrestore(&table->lock, flags);
  189         mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
  190         wait_for_completion(&qp->common.free);
  191 }


> 
> Jason