netdev - Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMGffEmFCgKv-6XNXjAKzr5g6TtT_=wj6H62AdGCUXx4hruxBQ@mail.gmail.com>
Date:   Thu, 13 Oct 2022 10:32:55 +0200
From:   Jinpu Wang <jinpu.wang@...os.com>
To:     Leon Romanovsky <leon@...nel.org>
Cc:     netdev <netdev@...r.kernel.org>,
        RDMA mailing list <linux-rdma@...r.kernel.org>,
        Moshe Shemesh <moshe@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Tariq Toukan <tariqt@...dia.com>,
        Maor Gottlieb <maorg@...dia.com>, Shay Drory <shayd@...dia.com>
Subject: Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@...nel.org> wrote:
>
> On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
> > Hi Leon, hi Saeed,
> >
> > We have seen crashes during server shutdown on both kernel 5.10 and
> > kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.
> >
> > All of the crashes point to
> >
> > 1606                         memcpy(ent->out->first.data,
> > ent->lay->out, sizeof(ent->lay->out));
> >
> > I guess, it's kind of use after free for ent buffer. I tried to reprod
> > by repeatedly reboot the testing servers, but no success  so far.
>
> My guess is that command interface is not flushed, but Moshe and me
> didn't see how it can happen.
>
>   1206         INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
>   1207         INIT_WORK(&ent->work, cmd_work_handler);
>   1208         if (page_queue) {
>   1209                 cmd_work_handler(&ent->work);
>   1210         } else if (!queue_work(cmd->wq, &ent->work)) {
>                           ^^^^^^^ this is what is causing to the splat
>   1211                 mlx5_core_warn(dev, "failed to queue work\n");
>   1212                 err = -EALREADY;
>   1213                 goto out_free;
>   1214         }
>
> <...>
> >
> > Is this problem known, maybe already fixed?
>
> I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
> Is it possible to reproduce this on latest upstream code?
I haven't been able to reproduce it, as mentioned above, I tried to
reproduce by simply reboot in loop, no luck yet.
do you have suggestions to speedup the reproduction?
Once I can reproduce, I can also try with kernel 6.0.

> And what is your FW version?
here is ibstat output
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.27.2008
Hardware version: 0
Node GUID: 0x08c0eb030054b372
System image GUID: 0x08c0eb030054b372
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 15
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x08c0eb030054b372
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.27.2008
Hardware version: 0
Node GUID: 0x08c0eb030054b373
System image GUID: 0x08c0eb030054b372
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 12
LMC: 0
SM lid: 4
Capability mask: 0x2651e848
Port GUID: 0x08c0eb030054b373
Link layer: InfiniBand


Thanks for your help!
>
>
> > I briefly checked the git, don't see anything, could you give me some hint?
> >
> >
> > Thanks!
> > Jinpu Wang @ IONOS