[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <kord5qq6mywc7rbkzeoliz2cklrlljxm74qmrfwwjf6irx4fp7@6f5wsonafstt>
Date: Tue, 2 Apr 2024 16:22:31 -0500
From: Andrew Halaney <ahalaney@...hat.com>
To: linux-arm-msm@...r.kernel.org
Cc: robdclark@...il.com, will@...nel.org, iommu@...ts.linux.dev,
joro@...tes.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, quic_c_gdjako@...cinc.com, quic_cgoldswo@...cinc.com,
quic_sukadev@...cinc.com, quic_pdaly@...cinc.com, quic_sudaraja@...cinc.com
Subject: sa8775p-ride: What's a normal SMMU TLB sync time?
Hey,
Sorry for the wide email, but I figured someone recently contributing
to / maintaining the Qualcomm SMMU driver may have some proper insights
into this.
Recently I remembered that performance on some Qualcomm platforms
takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT.
On the sa8775p-ride, I see most TLB sync calls to be about 150 us long,
with some spiking to 500 us, etc:
[root@...-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1
plugin 'function_graph'
[root@...-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) ! 144.062 us | qcom_smmu_tlb_sync();
On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare
with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this
patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number.
It's not entirely clear to me how a DPU specific programming affects system
wide SMMU performance, but I'm curious if this is the only way to achieve this?
sa8775p doesn't have the DPU described even right now, so that's a bummer
as there's no way to make a similar immediate optimization, but I'm still struggling
to understand what that patch really did to improve things so maybe I'm missing
something.
I'm honestly not even sure what a "typical" range for TLB sync time would be,
but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls
(pretty easy to reproduce with fio basic-verify.fio for example on the platform).
It also makes running with iommu.strict=1 impractical as performance for UFS,
ethernet, etc drops 75-80%.
Does anyone have any bright ideas on how to improve this, or if I'm even in
the right for assuming that time is suspiciously long?
Thanks,
Andrew
[0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@mail.gmail.com/
Powered by blists - more mailing lists