[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260114172832.GA822909@bhelgaas>
Date: Wed, 14 Jan 2026 11:28:32 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Johnny-CC Chang (張晋嘉) <Johnny-CC.Chang@...iatek.com>
Cc: "lukas@...ner.de" <lukas@...ner.de>,
Project_Global_Digits_Upstream_Group <Project_Global_Digits_Upstream_Group@...iatek.com>,
AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linux-mediatek@...ts.infradead.org" <linux-mediatek@...ts.infradead.org>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"matthias.bgg@...il.com" <matthias.bgg@...il.com>,
Jason Gunthorpe <jgg@...dia.com>,
Alex Williamson <alex@...zbot.org>
Subject: Re: [PATCH] PCI: Mark Nvidia GB10 to avoid bus reset
[+cc Jason, Alex for Nvidia input]
On Wed, Jan 14, 2026 at 06:39:24AM +0000, Johnny-CC Chang (張晋嘉) wrote:
> On Tue, 2025-11-18 at 17:39 +0800, Johnny-CC Chang wrote:
> > On Thu, 2025-11-13 at 10:39 +0100, Lukas Wunner wrote:
> > > On Thu, Nov 13, 2025 at 04:44:06PM +0800, Johnny Chang wrote:
> > > > Nvidia GB10 PCIe hosts will encounter problem occasionally
> > > > after SBR(secondary bus reset) is applied.
> > >
> > > Could you elaborate what kinds of problems occur, how often they
> > > occur, etc?
> >
> > There is about 1/1000 chance that after SBR is applied, any further
> > access via this root port will be blocked and make system crash.
What sort of crash happens? It's useful if we can include a bread
crumb that will help people identify the crash and find a fix.
What I would expect is some kind of PCIe error like a config read
timeout or unsupport request error. But usually those just result in
~0 data back to the CPU, which usually doesn't directly cause a crash.
> I would like to update below description to replace original comment in
> v1 patch, is this information sufficient?
> --------
> /*
> * After SBR(secondary bus reset) is applied on an Nvidia GB10
> * PCIe root port, there is 1/1000 chance that further requests
> * via this root port will be blocked and cause system unstable.
I'm confused about what the topology is. I first assumed GB10 was a
PCIe Endpoint, since Secondary Bus Reset only applies to devices below
a bridge, so SBR would be applied to a device by a config write to
that bridge.
But you mention a GB10 Root Port here, which obviously is not an
Endpoint, so there's no bridge upstream from the GB10 that could
initiate SBR to the GB10.
If this is actually a GB10 issue, it sounds like a hardware erratum
that lots of users would see and Nvidia would likely be aware of.
Bjorn
Powered by blists - more mailing lists