linux-kernel - Re: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez10dYpom22cQNgj62wkztbjpJiuuSroE5BahNkpnN-y3Q@mail.gmail.com>
Date: Mon, 18 Nov 2024 22:55:24 +0100
From: Jann Horn <jannh@...gle.com>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	linux-doc@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	cgroups@...r.kernel.org, linux-kselftest@...r.kernel.org, 
	akpm@...ux-foundation.org, corbet@....net, derek.kiernan@....com, 
	dragan.cvetic@....com, arnd@...db.de, gregkh@...uxfoundation.org, 
	viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, tj@...nel.org, 
	hannes@...xchg.org, mhocko@...nel.org, roman.gushchin@...ux.dev, 
	shakeel.butt@...ux.dev, muchun.song@...ux.dev, Liam.Howlett@...cle.com, 
	lorenzo.stoakes@...cle.com, vbabka@...e.cz, shuah@...nel.org, 
	vegard.nossum@...cle.com, vattunuru@...vell.com, schalla@...vell.com, 
	david@...hat.com, willy@...radead.org, osalvador@...e.de, 
	usama.anjum@...labora.com, andrii@...nel.org, ryan.roberts@....com, 
	peterx@...hat.com, oleg@...hat.com, tandersen@...flix.com, 
	rientjes@...gle.com, gthelen@...gle.com
Subject: Re: [RFCv1 4/6] misc/page_detective: Introduce Page Detective

On Sat, Nov 16, 2024 at 6:59 PM Pasha Tatashin
<pasha.tatashin@...een.com> wrote:
> Page Detective is a kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
>
> It operates through the Linux debugfs interface, providing access
> to both virtual and physical address inquiries. The output, presented
> via kernel log messages (accessible with dmesg), will help
> administrators and developers understand how specific pages are
> utilized by the system.
>
> This tool can be used to investigate various memory-related issues,
> such as checksum failures during live migration, filesystem journal
> failures, general segfaults, or other corruptions.
[...]
> +/*
> + * Walk kernel page table, and print all mappings to this pfn, return 1 if
> + * pfn is mapped in direct map, return 0 if not mapped in direct map, and
> + * return -1 if operation canceled by user.
> + */
> +static int page_detective_kernel_map_info(unsigned long pfn,
> +                                         unsigned long direct_map_addr)
> +{
> +       struct pd_private_kernel pr = {0};
> +       unsigned long s, e;
> +
> +       pr.direct_map_addr = direct_map_addr;
> +       pr.pfn = pfn;
> +
> +       for (s = PAGE_OFFSET; s != ~0ul; ) {
> +               e = s + PD_WALK_MAX_RANGE;
> +               if (e < s)
> +                       e = ~0ul;
> +
> +               if (walk_page_range_kernel(s, e, &pd_kernel_ops, &pr)) {

I think which parts of the kernel virtual address range you can safely
pagewalk is somewhat architecture-specific; for example, X86 can run
under Xen PV, in which case I think part of the page tables may not be
walkable because they're owned by the hypervisor for its own use?
Notably the x86 version of ptdump_walk_pgd_level_core starts walking
at GUARD_HOLE_END_ADDR instead.

See also https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html
for an ASCII table reference on address space regions.

> +                       pr_info("Received a cancel signal from user, while scanning kernel mappings\n");
> +                       return -1;
> +               }
> +               cond_resched();
> +               s = e;
> +       }
> +
> +       if (!pr.vmalloc_maps) {
> +               pr_info("The page is not mapped into kernel vmalloc area\n");
> +       } else if (pr.vmalloc_maps > 1) {
> +               pr_info("The page is mapped into vmalloc area: %ld times\n",
> +                       pr.vmalloc_maps);
> +       }
> +
> +       if (!pr.direct_map)
> +               pr_info("The page is not mapped into kernel direct map\n");
> +
> +       pr_info("The page mapped into kernel page table: %ld times\n", pr.maps);
> +
> +       return pr.direct_map ? 1 : 0;
> +}
> +
> +/* Print kernel information about the pfn, return -1 if canceled by user */
> +static int page_detective_kernel(unsigned long pfn)
> +{
> +       unsigned long *mem = __va((pfn) << PAGE_SHIFT);
> +       unsigned long sum = 0;
> +       int direct_map;
> +       u64 s, e;
> +       int i;
> +
> +       s = sched_clock();
> +       direct_map = page_detective_kernel_map_info(pfn, (unsigned long)mem);
> +       e = sched_clock() - s;
> +       pr_info("Scanned kernel page table in [%llu.%09llus]\n",
> +               e / NSEC_PER_SEC, e % NSEC_PER_SEC);
> +
> +       /* Canceled by user or no direct map */
> +       if (direct_map < 1)
> +               return direct_map;
> +
> +       for (i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++)
> +               sum |= mem[i];

If the purpose of this interface is to inspect pages in weird states,
I wonder if it would make sense to use something like
copy_mc_to_kernel() in case that helps avoid kernel crashes due to
uncorrectable 2-bit ECC errors or such. But maybe that's not the kind
of error you're concerned about here? And I also don't have any idea
if copy_mc_to_kernel() actually does anything sensible for ECC errors.
So don't treat this as a fix suggestion, more as a random idea that
should probably be ignored unless someone who understands ECC errors
says it makes sense.

But I think you should at least be using READ_ONCE(), since you're
reading from memory that can change concurrently.

> +       if (sum == 0)
> +               pr_info("The page contains only zeroes\n");
> +       else
> +               pr_info("The page contains some data\n");
> +
> +       return 0;
> +}
[...]
> +/*
> + * print information about mappings of pfn by mm, return -1 if canceled
> + * return number of mappings found.
> + */
> +static long page_detective_user_mm_info(struct mm_struct *mm, unsigned long pfn)
> +{
> +       struct pd_private_user pr = {0};
> +       unsigned long s, e;
> +
> +       pr.pfn = pfn;
> +       pr.mm = mm;
> +
> +       for (s = 0; s != TASK_SIZE; ) {

TASK_SIZE does not make sense when inspecting another task, because
TASK_SIZE depends on the virtual address space size of the current
task (whether you are a 32-bit or 64-bit process). Please use
TASK_SIZE_MAX for remote process access.

> +               e = s + PD_WALK_MAX_RANGE;
> +               if (e > TASK_SIZE || e < s)
> +                       e = TASK_SIZE;
> +
> +               if (mmap_read_lock_killable(mm)) {
> +                       pr_info("Received a cancel signal from user, while scanning user mappings\n");
> +                       return -1;
> +               }
> +               walk_page_range(mm, s, e, &pd_user_ops, &pr);
> +               mmap_read_unlock(mm);
> +               cond_resched();
> +               s = e;
> +       }
> +       return pr.maps;
> +}