netdev - Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y0fJ6P943FuVZ3k1@unreal>
Date:   Thu, 13 Oct 2022 11:18:48 +0300
From:   Leon Romanovsky <leon@...nel.org>
To:     Jinpu Wang <jinpu.wang@...os.com>
Cc:     netdev <netdev@...r.kernel.org>,
        RDMA mailing list <linux-rdma@...r.kernel.org>,
        Moshe Shemesh <moshe@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Tariq Toukan <tariqt@...dia.com>,
        Maor Gottlieb <maorg@...dia.com>, Shay Drory <shayd@...dia.com>
Subject: Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
> Hi Leon, hi Saeed,
> 
> We have seen crashes during server shutdown on both kernel 5.10 and
> kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.
> 
> All of the crashes point to
> 
> 1606                         memcpy(ent->out->first.data,
> ent->lay->out, sizeof(ent->lay->out));
> 
> I guess, it's kind of use after free for ent buffer. I tried to reprod
> by repeatedly reboot the testing servers, but no success  so far.

My guess is that command interface is not flushed, but Moshe and me
didn't see how it can happen.

  1206         INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
  1207         INIT_WORK(&ent->work, cmd_work_handler);
  1208         if (page_queue) {
  1209                 cmd_work_handler(&ent->work);
  1210         } else if (!queue_work(cmd->wq, &ent->work)) {
                          ^^^^^^^ this is what is causing to the splat    
  1211                 mlx5_core_warn(dev, "failed to queue work\n");
  1212                 err = -EALREADY;
  1213                 goto out_free;
  1214         }

<...>
> 
> Is this problem known, maybe already fixed?

I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
Is it possible to reproduce this on latest upstream code?
And what is your FW version?


> I briefly checked the git, don't see anything, could you give me some hint?
> 
> 
> Thanks!
> Jinpu Wang @ IONOS