[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <afa66fb5cb2f28765b03d26dba9bb217@codeaurora.org>
Date: Fri, 24 Jan 2020 20:22:10 +0530
From: Sai Prakash Ranjan <saiprakash.ranjan@...eaurora.org>
To: James Morse <james.morse@....com>
Cc: Borislav Petkov <bp@...en8.de>, Andy Gross <agross@...nel.org>,
Bjorn Andersson <bjorn.andersson@...aro.org>,
Mark Rutland <mark.rutland@....com>,
Rob Herring <robh+dt@...nel.org>, devicetree@...r.kernel.org,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Tony Luck <tony.luck@...el.com>,
Robert Richter <rrichter@...vell.com>,
linux-edac@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, linux-arm-msm@...r.kernel.org,
Stephen Boyd <swboyd@...omium.org>,
Evan Green <evgreen@...omium.org>, tsoni@...eaurora.org,
psodagud@...eaurora.org, baicar@...amperecomputing.com
Subject: Re: [PATCH 2/2] drivers: edac: Add EDAC support for Kryo CPU caches
Hi James,
On 2020-01-16 00:19, James Morse wrote:
> Hi guys,
>
> (CC: +Tyler)
>
> On 13/01/2020 05:44, Sai Prakash Ranjan wrote:
>> On 2019-12-30 17:20, Borislav Petkov wrote:
>>> On Thu, Dec 05, 2019 at 09:53:18AM +0000, Sai Prakash Ranjan wrote:
>>>> Kryo{3,4}XX CPU cores implement RAS extensions to support
>>>> Error Correcting Code(ECC). Currently all Kryo{3,4}XX CPU
>>>> cores (gold/silver a.k.a big/LITTLE) support ECC via RAS.
>>>
>>> via RAS what? ARM64_RAS_EXTN?
>>>
>>> In any case, this needs James to look at and especially if there's
>>> some
>>> ARM-generic functionality in there which should be shared, of course.
>
>> Yes it is ARM64_RAS_EXTN and I have been hoping if James can provide
>> the feedback,
>> it has been some time now since I posted this out.
>
> Sorry, I was out of the office for most of November/December, and I'm
> slowly catching up...
>
>
>>>> +
>>>> +config EDAC_QCOM_KRYO_POLL
>>>> + depends on EDAC_QCOM_KRYO
>>>> + bool "Poll on Kryo ECC registers"
>>>> + help
>>>> + This option chooses whether or not you want to poll on the
>>>> Kryo ECC
>>>> + registers. When this is enabled, the polling rate can be set
>>>> as a
>>>> + module parameter. By default, it will call the polling
>>>> function every
>>>> + second.
>>>
>>> Why is this a separate option and why should people use that?
>>>
>>> Can the polling/irq be switched automatically?
>
>> No it cannot be switched automatically. It is used in case some SoCs
>> do not support an irq
>> based mechanism for EDAC.
>> But I am contradicting myself because I am telling that atleast one
>> interrupt should be
>> specified in bindings,
>> so it is best if I drop this polling option for now.
>
> For now, sure. But I think this will come back for systems with
> embarrassing amounts of
> RAM that would rather scrub the errors than take a flood of IRQs. I'd
> like this to be
> controllable from user-space.
>
Ok so we should have an option to switch between polling and irq.
>
>>>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>>>> index d77200c9680b..29edcfa6ec0e 100644
>>>> --- a/drivers/edac/Makefile
>>>> +++ b/drivers/edac/Makefile
>>>> @@ -85,5 +85,6 @@ obj-$(CONFIG_EDAC_SYNOPSYS) +=
>>>> synopsys_edac.o
>>>> obj-$(CONFIG_EDAC_XGENE) += xgene_edac.o
>>>> obj-$(CONFIG_EDAC_TI) += ti_edac.o
>>>> obj-$(CONFIG_EDAC_QCOM) += qcom_edac.o
>>>> +obj-$(CONFIG_EDAC_QCOM_KRYO) += qcom_kryo_edac.o
>>>
>>> What is the difference between this new driver and the qcom_edac one?
>>> Can
>>> functionality be shared?
>
> High-level story time:
> Until the 'v8.2' revision of the 'v8' Arm-architecture (the 64bit
> one), arm didn't
> describe how RAS should work. Partners implemented what they needed,
> and we ended up with
> this collection of drivers because they were all different.
>
> v8.2 fixed all this, the good news is once its done, we should never
> need another edac
> driver. (at least, not for SoCs built for v8.2). The downside is there
> is quite a lot in
> there, and we need to cover ACPI machines as well as DT.
>
That is true but the qcom_edac one which is merged is for LLC(system
cache) which is a QCOM IP.
>> qcom_edac driver is for QCOM system cache(last level cache), it should
>> be renamed to
>> qcom_llcc_edac.c.
>> This new driver is for QCOM Kryo CPU core caches(L1,L2,L3).
>>
>> Functionality cannot be shared as these two are different IP blocks
>> and best kept separate.
>
> The qcom_edac will be Qualcomm's pre-v8.2 support.
> This series is about the v8.2 support which all looks totally
> different to Linux.
>
As said before qcom_edac is for LLC which is not available on all SoCs.
QCOM's pre v8.2 support is not upstreamed.
>
>>>> + * ARM Cortex-A55, Cortex-A75, Cortex-A76 TRM Chapter B3.3
>>>
>>> Chapter? Where? URL?
>>>
>>
>> I chose this because these TRMs are openly available and if you search
>> for these above
>> terms like
>> "Cortex-A76 TRM Chapter B3.3" in google, then the first search result
>> will be the TRM pdf,
>> otherwise
>> I would have to specify the long URL for the pdf and we do not know
>> how long that URL link
>> will be active.
>
> These are SoC/CPU specific. Using these we can't solve the whole
> problem.
>
> The architecture all those should fit into is here:
> https://static.docs.arm.com/ddi0587/cb/2019_07_05_DD_0587_C_b.pdf
> (or https://developer.arm.com/docs/ and look for 'RAS')
>
> ... and the arm-arm.
>
Thanks for the link.
>
>>>> +static void dump_syndrome_reg(int error_type, int level,
>>>> + u64 errxstatus, u64 errxmisc,
>>>> + struct edac_device_ctl_info *edev_ctl)
>>>> +{
>>>> + char msg[KRYO_EDAC_MSG_MAX];
>>>> + const char *error_msg;
>>>> + int cpu;
>>>> +
>>>> + cpu = raw_smp_processor_id();
>>>
>>> Why raw_?
>>>
>>
>> Because we will be calling smp_processor_id in preemptible context and
>> if we enable
>> CONFIG_DEBUG_PREEMPT,
>> we would get a nice backtrace.
>>
>> [ 3.747468] BUG: using smp_processor_id() in preemptible [00000000]
>> code: swapper/0/1
>> [ 3.755527] caller is qcom_kryo_edac_probe+0x138/0x2b8
>> [ 3.760819] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G
>> S
>> 5.4.0-rc7-next-20191113-00009-g8666855d6a5b-dirty #107
>> [ 3.772323] Hardware name: Qualcomm Technologies, Inc. SM8150 MTP
>> (DT)
>> [ 3.779030] Call trace:
>> [ 3.781556] dump_backtrace+0x0/0x158
>> [ 3.785331] show_stack+0x14/0x20
>> [ 3.788741] dump_stack+0xb0/0xf4
>> [ 3.792164] debug_smp_processor_id+0xd8/0xe0
>> [ 3.796639] qcom_kryo_edac_probe+0x138/0x2b8
>> [ 3.801116] platform_drv_probe+0x50/0xa8
>> [ 3.805236] really_probe+0x108/0x360
>> [ 3.808999] driver_probe_device+0x58/0x100
>> [ 3.813304] device_driver_attach+0x6c/0x78
>> [ 3.817606] __driver_attach+0xb0/0xf0
>> [ 3.821459] bus_for_each_dev+0x68/0xc8
>> [ 3.825407] driver_attach+0x20/0x28
>> [ 3.829083] bus_add_driver+0x160/0x1f0
>> [ 3.833030] driver_register+0x60/0x110
>> [ 3.836976] __platform_driver_register+0x40/0x48
>> [ 3.841813] qcom_kryo_edac_driver_init+0x18/0x20
>> [ 3.846645] do_one_initcall+0x58/0x1a0
>> [ 3.850596] kernel_init_freeable+0x19c/0x240
>> [ 3.855075] kernel_init+0x10/0x108
>> [ 3.858665] ret_from_fork+0x10/0x1c
>
> and raw_ stops the backtrace? You are still preemptible. The problem
> still exists, you've
> just suppressed the warning.
>
> At any time in dump_syndrome_reg(), you could get an interrupt and
> another task gets
> scheduled. Later your thread is started on another cpu... but not the
> one whose cpu number
> you read from smp_processor_id(). Whatever you needed it for, might
> have the wrong value.
>
Ok will correct this.
>
>>>> +static int kryo_l1_l2_setup_irq(struct platform_device *pdev,
>>>> + struct edac_device_ctl_info *edev_ctl)
>>>> +{
>>>> + int cpu, errirq, faultirq, ret;
>>>> +
>>>> + edac_dev = devm_alloc_percpu(&pdev->dev, *edac_dev);
>>>> + if (!edac_dev)
>>>> + return -ENOMEM;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + preempt_disable();
>>>> + per_cpu(edac_dev, cpu) = edev_ctl;
>>>> + preempt_enable();
>>>> + }
>>>
>>> That sillyness doesn't belong here, if at all.
>
>> Sorry but I do not understand the sillyness here. Could you please
>> explain?
>
> preempt_disable() prevents another task being scheduled instead of
> you, avoiding the risk
> that you get scheduled on another cpu. In this case it doesn't matter
> which cpu you are
> running on as you aren't accessing _this_ cpu's edac_dev, you are
> accessing each one in a
> loop.
>
Thanks for the explanation James, now I get the sillyness.
Thanks,
Sai
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member
of Code Aurora Forum, hosted by The Linux Foundation
Powered by blists - more mailing lists