[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <22e2728f-d51a-bf92-4791-a7df5c4a2c15@arm.com>
Date: Tue, 25 Apr 2023 16:49:49 +0100
From: Robin Murphy <robin.murphy@....com>
To: Geoff Blake <blakgeof@...zon.com>
Cc: will@...nel.org, mark.rutland@....com,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
ilkka@...amperecomputing.com
Subject: Re: [PATCH] perf/arm-cmn: Fix DTC reset
On 14/04/2023 3:19 pm, Robin Murphy wrote:
> On 2023-04-06 22:25, Geoff Blake wrote:
>> Ran this patch on an AWS C6g.metal and unfortunately still see the
>> spurious IRQs trigger quickly (within 10 tries) when using the following
>> flow:
>>
>> perf stat -a -e arm_cmn_0/event=0x5,type=0x5/ -- sleep 600
>> kexec -e
>>
>> Adding in the simple shutdown routine, I have run over 100 of the
>> above cycles and the spurious IRQs haven't triggered. I think we still
>> need both for now.
>
> There is no "need both" - if this patch doesn't work to reset the PMU as
> intended then we still need a better patch that does. After yet more
> trying, I still cannot reproduce your results, but I do suspect this
> patch isn't as good as it initially seemed.
>
> I got my hands on a C6g.metal instance, and I'm building the mainline
> version of arm-cmn.c from my cmn-dev branch (including the two other
> pending fixes that I've sent recently) against the 5.15.0-1031-aws
> kernel that it came with, as a standalone module with a trivial
> makefile. Even running "stress -m 60" in the background, as the most
> effective thing I've found so far, that hnf_pocq_reqs_recvd event takes
> well over 8 minutes to overflow, so I have failed to achieve the
> necessary timing to kexec at just the right point for the residual
> interconnect traffic to add up and overflow the event during the handful
> of seconds that the kexec takes. For completeness, I have managed to run
> the perf stat/kexec, then run stress for 10 minutes under the new
> kernel, *then* finally load the module to achieve the right conditions,
> but that's so utterly contrived and long-winded that I don't really have
> the patience to do it more than the twice that I already did.
>
> What I can do instantly and reliably is reproduce equivalent conditions
> with my (now even more stripped-down) remove hack[1] and a simple
> rmmod/insmod (with a few seconds in between for good measure), leading
> to demonstrable latent overflows on all 4 DTCs every time. The existing
> code does seem to manage to reset DTC0 such that its interrupt (IRQ 27)
> does not fire, consistent with what I've observed on other machines,
> while I see the secondary DTCs (IRQs 28, 29 and 30) each fire 100000
> times spuriously and get disabled. With this patch on top[2], that
> consistently does not happen over 100 unload/reload cycles.
>
> Given that you say the same write to clear DTC_CTL, but a few seconds
> earlier in the form of the shutdown hook, does seem to work, I have
> still been wary of some kind of weird timing issue all along, but the
> fact that I was getting such consistent behaviour even on C6g seemed to
> be pointing away from that :/
>
> The closest I've got so far is by leaving this even more involved test
> loop (with real PMU programming in between) running overnight:
>
> for i in {1..10000}; do sudo insmod arm-cmn.ko && sudo perf stat -e
> arm_cmn_0/eventid=5,type=5/ sleep 1 && sudo rmmod arm-cmn && sleep 4; done
>
> and now coming back to find /proc/interrupts saying this:
>
> 27: 1 0 0...
> 28: 1 0 0...
> 29: 2 0 0...
> 30: 1 0 0...
>
> I've quite often seen a single IRQ firing earlier than expected (not
> necessarily spuriously), so I still need to check what's up with that -
> it may be that writing to the counters doesn't always take either.
> However, the single extra incidence of IRQ 29 which has happened at some
> point after I went home is more of a smoking gun:
>
> [84581.790043] WARNING: CPU: 0 PID: 0 at /home/ubuntu/arm-cmn.c:1828
> arm_cmn_handle_irq+0x148/0x1cc [arm_cmn]
>
> So something still snuck through reset, but it *was* at least visible
> and clearable by the time the IRQ was enabled. Interestingly the other
> warning for !dtc->cycles did not fire at the same time, even though the
> hack normally overflows PMCCNTR before PMEVCNTR(0). I'll keep digging...
I realise I neglected to follow up on this - where I got to was adding
an extra read back of CMN_DTC_CTL after the write, tweaking my remove
hack to generate overflows even more reliably for good measure, and that
then ran for ~56,000 test cycles (until my time on the instance ran out)
without any stray interrupts at all. However I wasn't keen on posting
that as a v2 without any better justification than it being a "add
random things to change the timing a bit" bodge. Since we're in the
merge window now, I'll see if I can get a better answer by -rc1.
Thanks,
Robin.
Powered by blists - more mailing lists