lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Tue, 2 Apr 2024 16:22:31 -0500
From: Andrew Halaney <ahalaney@...hat.com>
To: linux-arm-msm@...r.kernel.org
Cc: robdclark@...il.com, will@...nel.org, iommu@...ts.linux.dev, 
	joro@...tes.org, linux-arm-kernel@...ts.infradead.org, 
	linux-kernel@...r.kernel.org, quic_c_gdjako@...cinc.com, quic_cgoldswo@...cinc.com, 
	quic_sukadev@...cinc.com, quic_pdaly@...cinc.com, quic_sudaraja@...cinc.com
Subject: sa8775p-ride: What's a normal SMMU TLB sync time?

Hey,

Sorry for the wide email, but I figured someone recently contributing
to / maintaining the Qualcomm SMMU driver may have some proper insights
into this.

Recently I remembered that performance on some Qualcomm platforms
takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT.

On the sa8775p-ride, I see most TLB sync calls to be about 150 us long,
with some spiking to 500 us, etc:

    [root@...-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1
      plugin 'function_graph'
    [root@...-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     0) ! 144.062 us  |  qcom_smmu_tlb_sync();

On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare
with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this
patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number.

It's not entirely clear to me how a DPU specific programming affects system
wide SMMU performance, but I'm curious if this is the only way to achieve this?
sa8775p doesn't have the DPU described even right now, so that's a bummer
as there's no way to make a similar immediate optimization, but I'm still struggling
to understand what that patch really did to improve things so maybe I'm missing
something.

I'm honestly not even sure what a "typical" range for TLB sync time would be,
but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls
(pretty easy to reproduce with fio basic-verify.fio for example on the platform).
It also makes running with iommu.strict=1 impractical as performance for UFS,
ethernet, etc drops 75-80%.

Does anyone have any bright ideas on how to improve this, or if I'm even in
the right for assuming that time is suspiciously long?

Thanks,
Andrew

[0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@mail.gmail.com/


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ