[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+sbYW3VdewdCrU+PtvAksXXyi=zgGm6Yk=BHNNfbp1DDjRKcQ@mail.gmail.com>
Date: Mon, 24 Feb 2025 14:30:04 +0530
From: Selvin Xavier <selvin.xavier@...adcom.com>
To: Leon Romanovsky <leon@...nel.org>
Cc: jgg@...pe.ca, linux-rdma@...r.kernel.org, andrew.gospodarek@...adcom.com,
kalesh-anakkur.purayil@...adcom.com, netdev@...r.kernel.org,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, abeni@...hat.com,
horms@...nel.org, michael.chan@...adcom.com
Subject: Re: [PATCH rdma-next 0/9] RDMA/bnxt_re: Driver Debug Enhancements
On Sun, Feb 23, 2025 at 7:05 PM Leon Romanovsky <leon@...nel.org> wrote:
>
> On Thu, Feb 20, 2025 at 10:34:47AM -0800, Selvin Xavier wrote:
> > For debugging issues in the field, we need to track some of
> > the resources destroyed in the past. This is primarily required
> > for tracking certain QPs that encountered errors, leading to
> > application exits. A framework has been implemented to
> > save this information and retrieve it during coredump collection.
> >
> > The Broadcom bnxt L2 driver supports collecting driver dumps
> > using the ethtool -w option. This feature now also supports
> > collecting coredump information from the bnxt_re auxiliary driver.
> > Two new callbacks have been implemented to exchange dump
> > information supported by the auxbus bnxt_re driver.
> >
> > The bnxt_re driver caches certain hardware information before
> > resources are destroyed in the HW.
>
> Unfortunately, no. The idea that you will cache kernel objects and they
> live beyond their HW counterpart doesn't fit RDMA object model.
Since the scale of the resources are in thousands usually, we can not dump
the debug information to the system logs. So we are not having much context of
the failure and this is the reason for having this new mechanism.
>
> I'm aware that you are not keeping objects itself, but their shadow
> copy. So if you want, your FW can store these failed objects and you
> will retrieve them through existing netdev side (ethtool -w ...).
FW doesn't have enough memory to backup this info. It needs to
be backed up in the host memory and FW has to write it to host memory
when an error happens. This is possible in some newer FW versions.
But itt is not just the HW context that we are caching here. We need to backup
some host side driver/lib info also to correlate with the HW context.
We have been debugging issues like this using our Out of box driver
and we find it useful to get the context
of failure. Some of the internal tools can decode this information and
we want to
have the same behavior between inbox and Out of Box driver.
>
> Thanks
Download attachment "smime.p7s" of type "application/pkcs7-signature" (4224 bytes)
Powered by blists - more mailing lists