[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ca5a0f5f-91fb-4c11-f158-44e16343cdb2@huawei.com>
Date: Fri, 13 Nov 2020 10:43:06 +0000
From: John Garry <john.garry@...wei.com>
To: Will Deacon <will@...nel.org>
CC: <robin.murphy@....com>, <joro@...tes.org>,
<linux-arm-kernel@...ts.infradead.org>,
<iommu@...ts.linux-foundation.org>, <maz@...nel.org>,
<linuxarm@...wei.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency
On 21/09/2020 14:58, John Garry wrote:
> On 21/09/2020 14:43, Will Deacon wrote:
>> On Fri, Aug 21, 2020 at 09:54:20PM +0800, John Garry wrote:
>>> As mentioned in [0], the CPU may consume many cycles processing
>>> arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
>>> loop to
>>> get space on the queue takes a lot of time once we start getting many
>>> CPUs contending - from experiment, for 64 CPUs contending the cmdq,
>>> success rate is ~ 1 in 12, which is poor, but not totally awful.
>>>
>>> This series removes that cmpxchg() and replaces with an atomic_add,
>>> same as how the actual cmdq deals with maintaining the prod pointer.
>> > I'm still not a fan of this.
>
> :(
>
>> Could you try to adapt the hacks I sent before,
>> please? I know they weren't quite right (I have no hardware to test
>> on), but
>> the basic idea is to fall back to a spinlock if the cmpxchg() fails. The
>> queueing in the spinlock implementation should avoid the contention.
>
> OK, so if you're asking me to try this again, then I can do that, and
> see what it gives us.
>
JFYI, to prove that this is not a problem which affects only our HW, I
managed to test an arm64 platform from another vendor. Generally I see
the same issue, and this patchset actually helps that platform even more.
CPUs Before After % Increase
Huawei D06 8 282K 302K 7%
Other 379K 420K 11%
Huawei D06 16 115K 193K 68K
Other 102K 291K 185K
Huawei D06 32 36K 80K 122%
Other 41K 156K 280%
Huawei D06 64 11K 30K 172%
Other 6K 47K 683%
I tested with something like [1], so unit is map+unmaps per cpu per
second - higher is better.
My D06 is memory poor, so would expect higher results otherwise (with
more memory). Indeed, my D05 has memory on all nodes and performs better.
Anyway, I see that the implementation here is not perfect, and I could
not get suggested approach to improve performance significantly. So back
to the drawing board...
Thanks,
John
[1]
https://lore.kernel.org/linux-iommu/20201102080646.2180-1-song.bao.hua@hisilicon.com/
Powered by blists - more mailing lists