linux-kernel - RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CH2PR19MB4024F88716768EC49BCA08CCA0879@CH2PR19MB4024.namprd19.prod.outlook.com>
Date:   Fri, 29 Oct 2021 10:52:37 +0000
From:   Li Chen <lchen@...arella.com>
To:     Keith Busch <kbusch@...nel.org>
CC:     Bjorn Helgaas <helgaas@...nel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        Lorenzo Pieralisi <lorenzo.pieralisi@....com>,
        Rob Herring <robh@...nel.org>, "kw@...ux.com" <kw@...ux.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Tom Joseph <tjoseph@...ence.com>, Jens Axboe <axboe@...com>,
        Christoph Hellwig <hch@....de>,
        Sagi Grimberg <sagi@...mberg.me>,
        "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different
 non-prefetch mmio outbound/ranges

> -----Original Message-----
> From: Keith Busch [mailto:kbusch@...nel.org]
> Sent: Tuesday, October 26, 2021 12:16 PM
> To: Li Chen
> Cc: Bjorn Helgaas; linux-pci@...r.kernel.org; Lorenzo Pieralisi; Rob Herring;
> kw@...ux.com; Bjorn Helgaas; linux-kernel@...r.kernel.org; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; linux-nvme@...ts.infradead.org
> Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
> 
> On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co
> Ltd NVMe SSD Controller 980". From its datasheet,
> https://urldefense.com/v3/__https://s3.ap-northeast-
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not sure.
> Is there other ways/tools(like nvme-cli) to query?
> 
> The driver will export a sysfs property for it if it is supported:
> 
>   # cat /sys/class/nvme/nvme0/cmb
> 
> If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> 
> > > > I don't know how to interpret "ranges".  Can you supply the dmesg and
> > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > >
> > > >   pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > >   pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > >   pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > >   pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > >
> > > > > Question:
> > > > > 1.  Why dd can cause nvme timeout? Is there more debug ways?
> > >
> > > That means the nvme controller didn't provide a response to a posted
> > > command within the driver's latency tolerance.
> >
> > FYI, with the help of pci bridger's vendor, they find something interesting:
> "From catc log, I saw some memory read pkts sent from SSD card, but its memory
> range is within the memory range of switch down port. So, switch down port will
> replay UR pkt. It seems not normal." and "Why SSD card send out some memory
> pkts which memory address is within switch down port's memory range. If so,
> switch will response UR pkts". I also don't understand how can this happen?
> 
> I think we can safely assume you're not attempting peer-to-peer, so that
> behavior as described shouldn't be happening. It sounds like the memory
> windows may be incorrect. The dmesg may help to show if something appears
> wrong.

Hi, Keith

Agree that here doesn't involve peer-to-peer DMA. After conforming from switch vendor today, the two ur(unsupported request) is because nvme is trying to dma read dram with bus address 80d5000 and 80d5100. But the two bus addresses are located in switch's down port range, so the switch down port report ur. 

In our soc, dma/bus/pci address and physical/AXI address are 1:1, and DRAM space in physical memory address space is 000000.0000 - 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address are also 80d5000 and 80d5100, which both located inside dram space. 

Both our bootloader and romcode don't enum and configure pcie devices and switches, so the switch cfg stage should be left to kernel. 

Come back to the subject of this thread: " nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges". I found:

1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
(which will timeout nvme)

Switch(bridge of nvme)'s resource window: 
Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]

80d5000 and 80d5100 are both inside this range.

2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>; 
(which make nvme not timeout) 

Switch(bridge of nvme)'s resource window: 
Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]

80d5000 and 80d5100 are not inside this range, so if nvme tries to read 80d5000 and 80d5100 , ur won't happe.


>From /proc/iomen:
# cat /proc/iomem
01200000-ffffffff : System RAM
  01280000-022affff : Kernel code
  022b0000-0295ffff : reserved
  02960000-040cffff : Kernel data
  05280000-0528ffff : reserved
  41cc0000-422c0fff : reserved
  422c1000-4232afff : reserved
  4232d000-667bbfff : reserved
  667bc000-667bcfff : reserved
  667bd000-667c0fff : reserved
  667c1000-ffffffff : reserved
2000000000-2000000fff : cfg

No one uses 0000000-1200000, so " Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]" will never have any problem(because 0x1200000 > 0x004fffff). 


Above answers the question in Subject, one question left: what's the right way to resolve this problem? Use ranges property to configure switch memory window indirectly(just what I did)? Or something else?

I don't think changing range property is the right way: If my PCIe topology becomes more complex and have more endpoints or switches, maybe I have to reserve more MMIO through range property(please correct me if I'm wrong), the end of switch's memory window may be larger than 0x01200000. In case getting ur again,  I must reserve more physical memory address for them(like change kernel start address 0x01200000 to 0x02000000), which will make my visible dram smaller(I have verified it with "free -m"), it is not acceptable.


So, is there any better solution?

Regards,
Li 

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.