linux-kernel - Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YPqYDY9/VAhfHNfU@T590>
Date:   Fri, 23 Jul 2021 18:21:01 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Robin Murphy <robin.murphy@....com>
Cc:     John Garry <john.garry@...wei.com>, linux-kernel@...r.kernel.org,
        linux-nvme@...ts.infradead.org, iommu@...ts.linux-foundation.org,
        Will Deacon <will@...nel.org>,
        linux-arm-kernel@...ts.infradead.org
Subject: Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO
 from remote numa node

On Thu, Jul 22, 2021 at 06:40:18PM +0100, Robin Murphy wrote:
> On 2021-07-22 16:54, Ming Lei wrote:
> [...]
> > > If you are still keen to investigate more, then can try either of these:
> > > 
> > > - add iommu.strict=0 to the cmdline
> > > 
> > > - use perf record+annotate to find the hotspot
> > >    - For this you need to enable psuedo-NMI with 2x steps:
> > >      CONFIG_ARM64_PSEUDO_NMI=y in defconfig
> > >      Add irqchip.gicv3_pseudo_nmi=1
> > > 
> > >      See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/Kconfig#n1745
> > >      Your kernel log should show:
> > >      [    0.000000] GICv3: Pseudo-NMIs enabled using forced ICC_PMR_EL1
> > > synchronisation
> > 
> > OK, will try the above tomorrow.
> 
> Thanks, I was also going to suggest the latter, since it's what
> arm_smmu_cmdq_issue_cmdlist() does with IRQs masked that should be most
> indicative of where the slowness most likely stems from.

The improvement from 'iommu.strict=0' is very small:

[root@...ere-mtjade-04 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.0-rc2_linus root=UUID=cff79b49-6661-4347-b366-eb48273fe0c1 ro nvme.poll_queues=2 iommu.strict=0

[root@...ere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1530MiB/s][r=392k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2999: Fri Jul 23 06:05:15 2021
  read: IOPS=392k, BW=1530MiB/s (1604MB/s)(14.9GiB/10001msec)

[root@...ere-mtjade-04 ~]# taskset -c 80 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=150MiB/s][r=38.4k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3063: Fri Jul 23 06:05:49 2021
  read: IOPS=38.4k, BW=150MiB/s (157MB/s)(3000MiB/20002msec)

> 
> FWIW I would expect iommu.strict=0 to give a proportional reduction in SMMU
> overhead for both cases since it should effectively mean only 1/256 as many
> invalidations are issued.
> 
> Could you also check whether the SMMU platform devices have "numa_node"
> properties exposed in sysfs (and if so whether the values look right), and
> share all the SMMU output from the boot log?

No found numa_node attribute for smmu platform device, and the whole dmesg log is
attached.


Thanks, 
Ming

Download attachment "arm64.log.tar.gz" of type "application/gzip" (34200 bytes)