[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c7b8e28d-cca0-93c7-c1cb-5a9b07e2332d@redhat.com>
Date: Wed, 27 Oct 2021 09:15:35 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>,
Naoya Horiguchi <naoya.horiguchi@...ux.dev>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, linux-mm@...ck.org,
Andrew Morton <akpm@...ux-foundation.org>,
Alistair Popple <apopple@...dia.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Konstantin Khlebnikov <koct9i@...il.com>,
Bin Wang <wangbin224@...wei.com>,
Yang Shi <shy828301@...il.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1] mm, pagemap: expose hwpoison entry
On 27.10.21 09:02, Peter Xu wrote:
> On Wed, Oct 27, 2021 at 03:45:13PM +0900, Naoya Horiguchi wrote:
>> On Wed, Oct 27, 2021 at 10:09:03AM +0800, Peter Xu wrote:
>>> On Wed, Oct 27, 2021 at 08:27:36AM +0900, Naoya Horiguchi wrote:
>>>> On Mon, Oct 04, 2021 at 11:32:28PM +0900, Naoya Horiguchi wrote:
>>>>> On Mon, Oct 04, 2021 at 01:55:30PM +0200, David Hildenbrand wrote:
>>>>>> On 04.10.21 13:50, Naoya Horiguchi wrote:
>>>> ...
>>>>>>>
>>>>>>> Hwpoison entry for hugepage is also exposed by this patch. The below
>>>>>>> example shows how pagemap is visible in the case where a memory error
>>>>>>> hit a hugepage mapped to a process.
>>>>>>>
>>>>>>> $ ./page-types --no-summary --pid $PID --raw --list --addr 0x700000000+0x400
>>>>>>> voffset offset len flags
>>>>>>> 700000000 12fa00 1 ___U_______Ma__H_G_________________f_______1
>>>>>>> 700000001 12fa01 1ff ___________Ma___TG_________________f_______1
>>>>>>> 700000200 12f800 1 __________B________X_______________f______w_
>>>>>>> 700000201 12f801 1 ___________________X_______________f______w_ // memory failure hit this page
>>>>>>> 700000202 12f802 1fe __________B________X_______________f______w_
>>>>>>>
>>>>>>> The entries with both of "X" flag (hwpoison flag) and "w" flag (swap
>>>>>>> flag) are considered as hwpoison entries. So all pages in 2MB range
>>>>>>> are inaccessible from the process. We can get actual error location
>>>>>>> by page-types in physical address mode.
>>>>>>>
>>>>>>> $ ./page-types --no-summary --addr 0x12f800+0x200 --raw --list
>>>>>>> offset len flags
>>>>>>> 12f800 1 __________B_________________________________
>>>>>>> 12f801 1 ___________________X________________________
>>>>>>> 12f802 1fe __________B_________________________________
>>>>>>>
>>>>>>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@....com>
>>>>>>> ---
>>>>>>> fs/proc/task_mmu.c | 41 ++++++++++++++++++++++++++++++++---------
>>>>>>> include/linux/swapops.h | 13 +++++++++++++
>>>>>>> tools/vm/page-types.c | 7 ++++++-
>>>>>>> 3 files changed, 51 insertions(+), 10 deletions(-)
>>>>>>
>>>>>>
>>>>>> Please also update the documentation located at
>>>>>>
>>>>>> Documentation/admin-guide/mm/pagemap.rst
>>>>>
>>>>> I will do this in the next post.
>>>>
>>>> Reading the document, I found that swap type is already exported so we
>>>> could identify hwpoison entry with it (without new PM_HWPOISON bit).
>>>> One problem is that the format of swap types (like SWP_HWPOISON) depends
>>>> on a few config macros like CONFIG_DEVICE_PRIVATE and CONFIG_MIGRATION,
>>>> so we also need to export how the swap type field is interpreted.
>>>
>>> I had similar question before.. though it was more on the generic swap entries
>>> not the special ones yet.
>>>
>>> The thing is I don't know how the userspace could interpret normal swap device
>>> indexes out of reading pagemap, say if we have two swap devices with "swapon
>>> -s" then I've no idea how do we know which device has which swap type index
>>> allocated. That seems to be a similar question asked above on special swap
>>> types - the interface seems to be incomplete, if not unused at all.
>>>
>>> AFAIU the information on "this page is swapped out to device X on offset Y" is
>>> not reliable too, because the pagein/pageout from kernel is transparent to the
>>> userspace and not under control of userspace at all. IOW, if the user reads
>>> that swap entry, then reads data upon the disk of that offset out and put it
>>> somewhere else, then it means the data read could already be old if kernel
>>> paged in the page after userspace reading the pagemap but before it reading the
>>> disk, and I don't see any way to make it right unless the userspace could stop
>>> the kernel from page-in a swap entry. That's why I really wonder whether we
>>> should expose normal swap entry at all, as I don't know how it could be helpful
>>> and used in the 100% right way.
>>
>> Thank you for the feedback.
>>
>> I think that a process interested in controlling swap-in/out behavior in its own
>> typically calls mincore() to get current status and madvise() to trigger swap-in/out.
>> That's not 100% solution for the same reason, but it mostly works well because
>> calling madvise(MADV_PAGEOUT) to already swapped out is not a big issue (although
>> some CPU/memory resource is wasted, but the amount of the waste is small if the
>> returned info is new enough).
>> So my point is that the concern around information newness might be more generic
>> issue rather than just for pagemap. If we need 100% accurate in-kernel info,
>> maybe it had better be done in kernel (or some cooler stuff like eBPF)?
>
> I fully agree the solution you mentioned with mincore() and madvise(), that is
> very sane and working approach. Though IMHO the major thing I wanted to point
> out is for generic swap devices we exposed (disk_index, disk_offset) tuple as
> the swap entry (besides "whether this page is swapped out or not"; that's
> PM_SWAP, and as you mentioned people'll need to rely on mincore() to make it
> right for shmem), though to use it we need to either record the index/offset or
> read/write data from it. However none of them will make sense, IMHO.. So I
> think exposing PM_SWAP makes sense, not the swap entries on swap devices.
>
>>
>>>
>>> Special swap entries seem a bit different - at least for is_pfn_swap_entry()
>>> typed swap entries we can still expose the PFN which might be helpful, which I
>>> can't tell.
>>
>> I'm one who think it helpful for testing, although I know testing might not be
>> considered as a real usecase.
>
> I think testing is valid use case too.
>
>>
>>>
>>> I used to send an email to Matt Mackall <mpm@...enic.com> and Dave Hansen
>>> <dave.hansen@...ux.intel.com> asking about above but didn't get a reply. Ccing
>>> again this time with the list copied.
>>>
>>>>
>>>> I thought of adding new interfaces for example under /sys/kernel/mm/swap/type_format/,
>>>> which shows info like below (assuming that all CONFIG_{DEVICE_PRIVATE,MIGRATION,MEMORY_FAILURE}
>>>> is enabled):
>>>>
>>>> $ ls /sys/kernel/mm/swap/type_format/
>>>> hwpoison
>>>> migration_read
>>>> migration_write
>>>> device_write
>>>> device_read
>>>> device_exclusive_write
>>>> device_exclusive_read
>>>>
>>>> $ cat /sys/kernel/mm/swap/type_format/hwpoison
>>>> 25
>>>>
>>>> $ cat /sys/kernel/mm/swap/type_format/device_write
>>>> 28
>>>>
>>>> Does it make sense or any better approach?
>>>
>>> Then I'm wondering whether we care about the rest of the normal swap devices
>>> too with pagemap so do we need to expose some information there too (only if
>>> there's a real use case, though..)? Or... should we just don't expose swap
>>> entries at all, at least generic swap entries? We can still expose things like
>>> hwpoison via PM_* bits well defined in that case.
>>
>> I didn't think about normal swap devices for no reason. I'm OK to stop exposing
>> normal swap device part. I don't have strong option yet about which approach
>> (in swaptype or PM_HWPOISON) I'll suggest next (so wait a little more for feedback).
>
> No strong opinion here too. It's just that the new interface proposed reminded
> me that it's partially complete if considering we're also exposing swap entries
> on swap devices, so the types didn't cover those entries. However it's more
> like a pure question because I never figured out how those entries will work
> anyway. I'd be willing to know whether Dave Hanson would comment on this.
>
> While the PM_HWPOISON approach looks always sane to me.
I consider that somehow cleaner, because how HWPOISON entries are
implemented ("fake swap entries") is somewhat an internal implementation
detail.
(I also agree that PM_SWAP makes sense, but maybe really only when we're
actually dealing with something that has been/is currently being swapped
out. Maybe we should just not expose fake swap entries via PM_SWAP and
instead use proper PM_ types for that. PM_MIGRATION, PM_HWPOISON, ...)
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists