[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <26acef2760e843ea92edeba199812d73@huawei.com>
Date: Mon, 11 May 2020 11:20:50 +0000
From: Shiju Jose <shiju.jose@...wei.com>
To: James Morse <james.morse@....com>, Borislav Petkov <bp@...en8.de>
CC: "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"rjw@...ysocki.net" <rjw@...ysocki.net>,
"helgaas@...nel.org" <helgaas@...nel.org>,
"lenb@...nel.org" <lenb@...nel.org>,
"tony.luck@...el.com" <tony.luck@...el.com>,
"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
"zhangliguang@...ux.alibaba.com" <zhangliguang@...ux.alibaba.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
Linuxarm <linuxarm@...wei.com>,
Jonathan Cameron <jonathan.cameron@...wei.com>,
tanxiaofei <tanxiaofei@...wei.com>,
yangyicong <yangyicong@...wei.com>
Subject: RE: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
specific HW errors
Hi Boris, Hi James,
>-----Original Message-----
>From: linux-pci-owner@...r.kernel.org [mailto:linux-pci-
>owner@...r.kernel.org] On Behalf Of James Morse
>Sent: 08 April 2020 11:03
>To: Borislav Petkov <bp@...en8.de>; Shiju Jose <shiju.jose@...wei.com>
>Cc: linux-acpi@...r.kernel.org; linux-pci@...r.kernel.org; linux-
>kernel@...r.kernel.org; rjw@...ysocki.net; helgaas@...nel.org;
>lenb@...nel.org; tony.luck@...el.com; gregkh@...uxfoundation.org;
>zhangliguang@...ux.alibaba.com; tglx@...utronix.de; Linuxarm
><linuxarm@...wei.com>; Jonathan Cameron
><jonathan.cameron@...wei.com>; tanxiaofei <tanxiaofei@...wei.com>;
>yangyicong <yangyicong@...wei.com>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>Hi Boris, Shiju,
>
>Sorry for not spotting this reply earlier: Its in-reply to v1, so gets buried.
>
>On 27/03/2020 18:22, Borislav Petkov wrote:
>> On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote:
>>> Presently APEI does not support reporting the vendor specific HW
>>> errors, received in the vendor defined table entries, to the vendor
>>> drivers for any recovery.
>>>
>>> This patch adds the support to register and unregister the
>>
>> Avoid having "This patch" or "This commit" in the commit message. It
>> is tautologically useless.
>>
>> Also, do
>>
>> $ git grep 'This patch' Documentation/process
>>
>> for more details.
>>
>>> error handling function for the vendor specific HW errors and notify
>>> the registered kernel driver.
>
>>> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes,
>>> log_arm_hw_error(err);
>>> } else {
>>> void *err = acpi_hest_get_payload(gdata);
>>> + u8 error_handled = false;
>>> + int ret;
>>> +
>>> + ret =
>atomic_notifier_call_chain(&ghes_event_notify_list, 0,
>>> +gdata);
>>
>> Well, this is a notifier with standard name for a non-standard event.
>> Not optimal.
>>
>> Why does only this event need a notifier? Because your driver is
>> interested in only those events?
>
>Its the 'else' catch-all for stuff drivers/acpi/apei doesn't know to handle.
>
>In this case its because its a vendor specific GUID that only the vendor driver
>knows how to parse.
>
>
>>> + if (ret & NOTIFY_OK)
>>> + error_handled = true;
>>>
>>> log_non_standard_event(sec_type, fru_id, fru_text,
>>> sec_sev, err,
>>> - gdata->error_data_length);
>>> + gdata->error_data_length,
>>> + error_handled);
>>
>> What's that error_handled thing for? That's just silly.
>>
>> Your notifier returns NOTIFY_STOP when it has queued the error. If you
>> don't want to log it, just test == NOTIFY_STOP and do not log it then.
>
>My thinking for this being needed was so user-space consumers of those
>tracepoints keep working. Otherwise you upgrade, get this feature, and your
>user-space counters stop working.
>
>You'd need to know this error source was now managed by an in-kernel
>driver, which may report the errors somewhere else...
>
>
>> Then your notifier callback is queuing the error into a kfifo for
>> whatever reason and then scheduling a workqueue to handle it in user
>> context...
>>
>> So I'm thinking that it would be better if you:
>>
>> * make that kfifo generic and part of ghes.c and queue all types of
>> error records into it in ghes_do_proc() - not just the non-standard
>> ones.
>
>Move the drop to process context into ghes.c? This should result in less code.
>
>I asked for this hooking to only be for the 'catch all' don't-know case so that
>we don't get drivers trying to hook and handle memory errors. (if we ever
>wanted that, it should be from part of memory_failure() so it catches all the
>ways of reporting memory-failure) 32bit arm has prior in this area.
>
>
>> * then, when you're done queuing, you kick a workqueue.
>>
>> * that workqueue runs a normal, blocking notifier to which drivers
>> register.
>>
>> Your driver can register to that notifier too and do the normal
>> handling then and not have this ad-hoc, semi-generic, semi-vendor-specific
>thing.
>
>As long as we don't walk a list of things that might handle a memory-error,
>and have some random driver try and NOTIFY_STOP it....
>
>aer_recover_queue() would be replaced by this. memory_failure_queue() has
>one additional caller in drivers/ras/cec.c.
Can you suggest whether the standard errors can report through the
notifier (kfifo + blocking notifier), [which implemented in V7 patch], or not
so that we can proceed with the changes to notify the vendor specific errors?
>
>
>Thanks,
>
>James
Thanks,
Shiju
Powered by blists - more mailing lists