[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191022102457.GJ9379@dhcp22.suse.cz>
Date: Tue, 22 Oct 2019 12:24:57 +0200
From: Michal Hocko <mhocko@...nel.org>
To: Oscar Salvador <osalvador@...e.de>
Cc: n-horiguchi@...jp.nec.com, mike.kravetz@...cle.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 10/16] mm,hwpoison: Rework soft offline for free
pages
On Tue 22-10-19 11:58:52, Oscar Salvador wrote:
> On Tue, Oct 22, 2019 at 11:22:56AM +0200, Michal Hocko wrote:
> > Hmm, that might be a misunderstanding on my end. I thought that it is
> > the MCE handler to say whether the failure is recoverable or not. If yes
> > then we can touch the content of the memory (that would imply the
> > migration). Other than that both paths should be essentially the same,
> > no? Well unrecoverable case would be essentially force migration failure
> > path.
> >
> > MADV_HWPOISON is explicitly documented to test MCE handling IIUC:
> > : This feature is intended for testing of memory error-handling
> > : code; it is available only if the kernel was configured with
> > : CONFIG_MEMORY_FAILURE.
> >
> > There is no explicit note about the type of the error that is injected
> > but I think it is reasonably safe to assume this is a recoverable one.
>
> MADV_HWPOISON stands for hard-offline.
> MADV_SOFT_OFFLINE stands for soft-offline.
>
> MADV_SOFT_OFFLINE (since Linux 2.6.33)
> Soft offline the pages in the range specified by addr and
> length. The memory of each page in the specified range is
> preserved (i.e., when next accessed, the same content will be
> visible, but in a new physical page frame), and the original
> page is offlined (i.e., no longer used, and taken out of
> normal memory management). The effect of the
> MADV_SOFT_OFFLINE operation is invisible to (i.e., does not
> change the semantics of) the calling process.
>
> This feature is intended for testing of memory error-handling
> code; it is available only if the kernel was configured with
> CONFIG_MEMORY_FAILURE.
I have missed that one somehow. Thanks for pointing out.
[...]
> AFAICS, for hard-offline case, a recovered event would be if:
>
> - the page to shut down is already free
> - the page was unmapped
>
> In some cases we need to kill the process if it holds dirty pages.
Yes, I would expect that the page table would be poisoned and the
process receive a SIGBUS when accessing that memory.
> But we never migrate contents in hard-offline path.
> I guess it is because we cannot really trust the contents anymore.
Yes, that makes a perfect sense. What I am saying that the migration
(aka trying to recover) is the main and only difference. The soft
offline should poison page tables when not able to migrate as well
IIUC.
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists