[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez10dYpom22cQNgj62wkztbjpJiuuSroE5BahNkpnN-y3Q@mail.gmail.com>
Date: Mon, 18 Nov 2024 22:55:24 +0100
From: Jann Horn <jannh@...gle.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-doc@...r.kernel.org, linux-fsdevel@...r.kernel.org,
cgroups@...r.kernel.org, linux-kselftest@...r.kernel.org,
akpm@...ux-foundation.org, corbet@....net, derek.kiernan@....com,
dragan.cvetic@....com, arnd@...db.de, gregkh@...uxfoundation.org,
viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, tj@...nel.org,
hannes@...xchg.org, mhocko@...nel.org, roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev, muchun.song@...ux.dev, Liam.Howlett@...cle.com,
lorenzo.stoakes@...cle.com, vbabka@...e.cz, shuah@...nel.org,
vegard.nossum@...cle.com, vattunuru@...vell.com, schalla@...vell.com,
david@...hat.com, willy@...radead.org, osalvador@...e.de,
usama.anjum@...labora.com, andrii@...nel.org, ryan.roberts@....com,
peterx@...hat.com, oleg@...hat.com, tandersen@...flix.com,
rientjes@...gle.com, gthelen@...gle.com
Subject: Re: [RFCv1 4/6] misc/page_detective: Introduce Page Detective
On Sat, Nov 16, 2024 at 6:59 PM Pasha Tatashin
<pasha.tatashin@...een.com> wrote:
> Page Detective is a kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
>
> It operates through the Linux debugfs interface, providing access
> to both virtual and physical address inquiries. The output, presented
> via kernel log messages (accessible with dmesg), will help
> administrators and developers understand how specific pages are
> utilized by the system.
>
> This tool can be used to investigate various memory-related issues,
> such as checksum failures during live migration, filesystem journal
> failures, general segfaults, or other corruptions.
[...]
> +/*
> + * Walk kernel page table, and print all mappings to this pfn, return 1 if
> + * pfn is mapped in direct map, return 0 if not mapped in direct map, and
> + * return -1 if operation canceled by user.
> + */
> +static int page_detective_kernel_map_info(unsigned long pfn,
> + unsigned long direct_map_addr)
> +{
> + struct pd_private_kernel pr = {0};
> + unsigned long s, e;
> +
> + pr.direct_map_addr = direct_map_addr;
> + pr.pfn = pfn;
> +
> + for (s = PAGE_OFFSET; s != ~0ul; ) {
> + e = s + PD_WALK_MAX_RANGE;
> + if (e < s)
> + e = ~0ul;
> +
> + if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) {
I think which parts of the kernel virtual address range you can safely
pagewalk is somewhat architecture-specific; for example, X86 can run
under Xen PV, in which case I think part of the page tables may not be
walkable because they're owned by the hypervisor for its own use?
Notably the x86 version of ptdump_walk_pgd_level_core starts walking
at GUARD_HOLE_END_ADDR instead.
See also https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html
for an ASCII table reference on address space regions.
> + pr_info("Received a cancel signal from user, while scanning kernel mappings\n");
> + return -1;
> + }
> + cond_resched();
> + s = e;
> + }
> +
> + if (!pr.vmalloc_maps) {
> + pr_info("The page is not mapped into kernel vmalloc area\n");
> + } else if (pr.vmalloc_maps > 1) {
> + pr_info("The page is mapped into vmalloc area: %ld times\n",
> + pr.vmalloc_maps);
> + }
> +
> + if (!pr.direct_map)
> + pr_info("The page is not mapped into kernel direct map\n");
> +
> + pr_info("The page mapped into kernel page table: %ld times\n", pr.maps);
> +
> + return pr.direct_map ? 1 : 0;
> +}
> +
> +/* Print kernel information about the pfn, return -1 if canceled by user */
> +static int page_detective_kernel(unsigned long pfn)
> +{
> + unsigned long *mem = __va((pfn) << PAGE_SHIFT);
> + unsigned long sum = 0;
> + int direct_map;
> + u64 s, e;
> + int i;
> +
> + s = sched_clock();
> + direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem);
> + e = sched_clock() - s;
> + pr_info("Scanned kernel page table in [%llu.%09llus]\n",
> + e / NSEC_PER_SEC, e % NSEC_PER_SEC);
> +
> + /* Canceled by user or no direct map */
> + if (direct_map < 1)
> + return direct_map;
> +
> + for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++)
> + sum |= mem[i];
If the purpose of this interface is to inspect pages in weird states,
I wonder if it would make sense to use something like
copy_mc_to_kernel() in case that helps avoid kernel crashes due to
uncorrectable 2-bit ECC errors or such. But maybe that's not the kind
of error you're concerned about here? And I also don't have any idea
if copy_mc_to_kernel() actually does anything sensible for ECC errors.
So don't treat this as a fix suggestion, more as a random idea that
should probably be ignored unless someone who understands ECC errors
says it makes sense.
But I think you should at least be using READ_ONCE(), since you're
reading from memory that can change concurrently.
> + if (sum == 0)
> + pr_info("The page contains only zeroes\n");
> + else
> + pr_info("The page contains some data\n");
> +
> + return 0;
> +}
[...]
> +/*
> + * print information about mappings of pfn by mm, return -1 if canceled
> + * return number of mappings found.
> + */
> +static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn)
> +{
> + struct pd_private_user pr = {0};
> + unsigned long s, e;
> +
> + pr.pfn = pfn;
> + pr.mm = mm;
> +
> + for (s = 0; s != TASK_SIZE; ) {
TASK_SIZE does not make sense when inspecting another task, because
TASK_SIZE depends on the virtual address space size of the current
task (whether you are a 32-bit or 64-bit process). Please use
TASK_SIZE_MAX for remote process access.
> + e = s + PD_WALK_MAX_RANGE;
> + if (e > TASK_SIZE || e < s)
> + e = TASK_SIZE;
> +
> + if (mmap_read_lock_killable(mm)) {
> + pr_info("Received a cancel signal from user, while scanning user mappings\n");
> + return -1;
> + }
> + walk_page_range(mm, s, e, &pd_user_ops, &pr);
> + mmap_read_unlock(mm);
> + cond_resched();
> + s = e;
> + }
> + return pr.maps;
> +}
Powered by blists - more mailing lists