[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMGffEnWmVb_qZFq6_rhZGH5q1Wq=n5ciJmp6uxxE6JLctywng@mail.gmail.com>
Date: Mon, 17 Oct 2022 07:54:46 +0200
From: Jinpu Wang <jinpu.wang@...os.com>
To: Leon Romanovsky <leon@...nel.org>
Cc: netdev <netdev@...r.kernel.org>,
RDMA mailing list <linux-rdma@...r.kernel.org>,
Moshe Shemesh <moshe@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>,
Tariq Toukan <tariqt@...dia.com>,
Maor Gottlieb <maorg@...dia.com>, Shay Drory <shayd@...dia.com>
Subject: Re: [BUG] mlx5_core general protection fault in mlx5_cmd_comp_handler
On Thu, Oct 13, 2022 at 12:27 PM Leon Romanovsky <leon@...nel.org> wrote:
>
> On Thu, Oct 13, 2022 at 10:32:55AM +0200, Jinpu Wang wrote:
> > On Thu, Oct 13, 2022 at 10:18 AM Leon Romanovsky <leon@...nel.org> wrote:
> > >
> > > On Wed, Oct 12, 2022 at 01:55:55PM +0200, Jinpu Wang wrote:
> > > > Hi Leon, hi Saeed,
> > > >
> > > > We have seen crashes during server shutdown on both kernel 5.10 and
> > > > kernel 5.15 with GPF in mlx5 mlx5_cmd_comp_handler function.
> > > >
> > > > All of the crashes point to
> > > >
> > > > 1606 memcpy(ent->out->first.data,
> > > > ent->lay->out, sizeof(ent->lay->out));
> > > >
> > > > I guess, it's kind of use after free for ent buffer. I tried to reprod
> > > > by repeatedly reboot the testing servers, but no success so far.
> > >
> > > My guess is that command interface is not flushed, but Moshe and me
> > > didn't see how it can happen.
> > >
> > > 1206 INIT_DELAYED_WORK(&ent->cb_timeout_work, cb_timeout_handler);
> > > 1207 INIT_WORK(&ent->work, cmd_work_handler);
> > > 1208 if (page_queue) {
> > > 1209 cmd_work_handler(&ent->work);
> > > 1210 } else if (!queue_work(cmd->wq, &ent->work)) {
> > > ^^^^^^^ this is what is causing to the splat
> > > 1211 mlx5_core_warn(dev, "failed to queue work\n");
> > > 1212 err = -EALREADY;
> > > 1213 goto out_free;
> > > 1214 }
> > >
> > > <...>
> > > >
> > > > Is this problem known, maybe already fixed?
> > >
> > > I don't see any missing Fixes that exist in 6.0 and don't exist in 5.5.32.
>
> Sorry it is 5.15.32
>
> > > Is it possible to reproduce this on latest upstream code?
> > I haven't been able to reproduce it, as mentioned above, I tried to
> > reproduce by simply reboot in loop, no luck yet.
> > do you have suggestions to speedup the reproduction?
>
> Maybe try to shutdown during filling command interface.
> I think that any query command will do the trick.
Just an update.
I tried to run "saquery" in a loop in one session and do "modproble -r
mlx5_ib && modprobe mlx5_ib" in loop in another session during last
days , but still no luck. --c
>
> > Once I can reproduce, I can also try with kernel 6.0.
>
> It will be great.
>
> Thanks
Thanks!
Powered by blists - more mailing lists