lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 9 Nov 2022 10:51:11 +0100
From:   Jinpu Wang <jinpu.wang@...os.com>
To:     Leon Romanovsky <leon@...nel.org>
Cc:     netdev <netdev@...r.kernel.org>,
        RDMA mailing list <linux-rdma@...r.kernel.org>,
        Moshe Shemesh <moshe@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>,
        Tariq Toukan <tariqt@...dia.com>,
        Maor Gottlieb <maorg@...dia.com>, Shay Drory <shayd@...dia.com>
Subject: Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler

On Mon, Oct 17, 2022 at 7:54 AM Jinpu Wang <jinpu.wang@...os.com> wrote:
>
> On Thu, Oct 13, 2022 at 12:27 PM Leon Romanovsky <leon@...nel.org> wrote:
> >
> > On Thu, Oct 13, 2022 at 10:32:55AM +0200, Jinpu Wang wrote:
> > > On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@...nel.org> wrote:
> > > >
> > > > On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
> > > > > Hi Leon, hi Saeed,
> > > > >
> > > > > We have seen crashes during server shutdown on both kernel 5.10 and
> > > > > kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.
> > > > >
> > > > > All of the crashes point to
> > > > >
> > > > > 1606                         memcpy(ent->out->first.data,
> > > > > ent->lay->out, sizeof(ent->lay->out));
> > > > >
> > > > > I guess, it's kind of use after free for ent buffer. I tried to reprod
> > > > > by repeatedly reboot the testing servers, but no success  so far.
> > > >
> > > > My guess is that command interface is not flushed, but Moshe and me
> > > > didn't see how it can happen.
> > > >
> > > >   1206         INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
> > > >   1207         INIT_WORK(&ent->work, cmd_work_handler);
> > > >   1208         if (page_queue) {
> > > >   1209                 cmd_work_handler(&ent->work);
> > > >   1210         } else if (!queue_work(cmd->wq, &ent->work)) {
> > > >                           ^^^^^^^ this is what is causing to the splat
> > > >   1211                 mlx5_core_warn(dev, "failed to queue work\n");
> > > >   1212                 err = -EALREADY;
> > > >   1213                 goto out_free;
> > > >   1214         }
> > > >
> > > > <...>
> > > > >
> > > > > Is this problem known, maybe already fixed?
> > > >
> > > > I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
> >
> > Sorry it is 5.15.32
> >
> > > > Is it possible to reproduce this on latest upstream code?
> > > I haven't been able to reproduce it, as mentioned above, I tried to
> > > reproduce by simply reboot in loop, no luck yet.
> > > do you have suggestions to speedup the reproduction?
> >
> > Maybe try to shutdown during filling command interface.
> > I think that any query command will do the trick.
> Just an update.
> I tried to run "saquery" in a loop in one session and do "modproble -r
> mlx5_ib && modprobe mlx5_ib" in loop in another session during last
> days , but still no luck. --c
> >
> > > Once I can reproduce, I can also try with kernel 6.0.
> >
> > It will be great.
> >
> > Thanks
> Thanks!
Just want to mention, we see more crash during reboot, all the crash
we saw are all
Intel  Intel(R) Xeon(R) Gold 6338 CPU. We use the same HCA on
different servers. So I suspect the bug is related to Ice Lake server.

In case it matters, here is lspci attached.

Thx!

View attachment "lspci.txt" of type "text/plain" (337649 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ