[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c886c628-551a-4bda-a68f-87549dfc6831@oracle.com>
Date: Mon, 20 May 2024 11:30:38 -0700
From: Jane Chu <jane.chu@...cle.com>
To: Oscar Salvador <osalvador@...e.de>
Cc: linmiaohe@...wei.com, nao.horiguchi@...il.com, akpm@...ux-foundation.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 3/5] mm/memory-failure: improve memory failure
action_result messages
On 5/16/2024 2:46 AM, Oscar Salvador wrote:
> On Fri, May 10, 2024 at 12:26:00AM -0600, Jane Chu wrote:
>> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
>> Attemped to document the definition of various action names, and made a few
>> adjustment to the action_result() calls.
>>
>> Signed-off-by: Jane Chu <jane.chu@...cle.com>
> ...
>> +/*
>> + * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did
> "or if it"
>> + * something, then caught in a race condition which renders the effort sort
> "it was caught"
>
> I would also add to MF_IGNORED that we mark the page hwpoisoned anyway.
Thanks! Will fix, and add comments.
>
>> @@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p)
>> {
>> pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
>> unlock_page(p);
>> - return MF_FAILED;
>> + return MF_IGNORED;
>> }
> I was confused because I saw you replaced all MF_MSG_UNKNOWN, so I
> wondered how we can end up here until I saw this is a catch-all in case
> we fail to make sense of the page->flags.
> While you are improving this, I would suggest to add a little comment
> above the function explaining how we can reach it.
Yes, it's a catch-all versus something that suppose to happen, will add
comments.
>
>> /*
>> @@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>> if (flags & MF_ACTION_REQUIRED) {
>> folio = page_folio(p);
>> res = kill_accessing_process(current, folio_pfn(folio), flags);
>> + return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> I do not understand why are you doing this.
>
> First of all, why is this considered a failure? We did not fail this
> time, did we? We went right ahead and kill the process which was
> re-accessing the hwpoisoned page. Is that considered a failure?
Given that the goal for the m-f() handler is to isolate the poisoned
page, so in this case,
even if the kernel has killed the page, nothing else could be done to
prevent the page from
being accessed and triggering another MCE, which is, I suppose more
sever than triggering a page fault.
>
> Second, you are know supressing -EHWPOISON with whatever action_result()
> will gives us, which judging from the actual code would be -EBUSY?
> I do not think that that is right, and we should be returning
> -EHWPOISON. Or whatever error code kill_accessing_process() gives us.
>
>
>> @@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags)
>> res = kill_accessing_process(current, pfn, flags);
>> if (flags & MF_COUNT_INCREASED)
>> put_page(p);
>> + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> This is not coherent with what you did in try_memory_failure_hugetlb
> for MF_MSG_ALREADY_POISONED, I __think__ that in there we should be
> doing the same as we do here.
>
>
Good catch, thanks. In both cases, 'res' from kill_accessing_process()
should be returned.
Thanks!
-jane
>
Powered by blists - more mailing lists