linux-kernel - Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHFg4BgeygyKV8tY_2Dk4cv9zwQnU6-n7jSxjwyyXzau6g@mail.gmail.com>
Date: Thu, 16 Jan 2025 13:04:42 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Tavian Barnes <tavianator@...ianator.com>, linux-fsdevel@...r.kernel.org, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions

On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@...e.cz> wrote:
>
> On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > which have not been modified.  If get_dump_page() returns NULL, the page
> > > is skipped rather than written to the core dump with dump_emit_page().
> > >
> > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > so when dumping a very large sparse region, the core dump becomes
> > > effectively uninterruptible.  This can be observed with the following
> > > test program:
> > >
> > >     #include <stdlib.h>
> > >     #include <stdio.h>
> > >     #include <sys/mman.h>
> > >
> > >     int main(void) {
> > >         char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > >                 MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > >         printf("%p %m\n", mem);
> > >         if (mem != MAP_FAILED) {
> > >                 mem[0] = 1;
> > >         }
> > >         abort();
> > >     }
> > >
> > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > and aborts.  During the core dump, SIGKILL has no effect.  It takes
> > > about 30 seconds to finish the dump, burning 100% CPU.
> > >
> >
> > While the patch makes sense to me, this should not be taking anywhere
> > near this much time and plausibly after unscrewing it will stop being a
> > factor.
> >
> > So I had a look with a profiler:
> > -   99.89%     0.00%  a.out
> >      entry_SYSCALL_64_after_hwframe
> >      do_syscall_64
> >      syscall_exit_to_user_mode
> >      arch_do_signal_or_restart
> >    - get_signal
> >       - 99.89% do_coredump
> >          - 99.88% elf_core_dump
> >             - dump_user_range
> >                - 98.12% get_dump_page
> >                   - 64.19% __get_user_pages
> >                      - 40.92% gup_vma_lookup
> >                         - find_vma
> >                            - mt_find
> >                                 4.21% __rcu_read_lock
> >                                 1.33% __rcu_read_unlock
> >                      - 3.14% check_vma_flags
> >                           0.68% vma_is_secretmem
> >                        0.61% __cond_resched
> >                        0.60% vma_pgtable_walk_end
> >                        0.59% vma_pgtable_walk_begin
> >                        0.58% no_page_table
> >                   - 15.13% down_read_killable
> >                        0.69% __cond_resched
> >                     13.84% up_read
> >                  0.58% __cond_resched
> >
> >
> > Almost 29% of time is spent relocking the mmap semaphore in
> > __get_user_pages. This most likely can operate locklessly in the fast
> > path. Even if somehow not, chances are the lock can be held across
> > multiple calls.
> >
> > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > I'll maybe look into it.
> >
> > However, I strongly suspect the current iteration method is just slow
> > due to repeat mt_find calls and The Right Approach(tm) would make this
> > entire thing finish within miliseconds by iterating the maple tree
> > instead, but then the mm folk would have to be consulted on how to
> > approach this and it may be time consuming to implement.
> >
> > Sorting out relocking should be an easily achievable & measurable win
> > (no interest on my end).
>
> As much as I agree the code is dumb, doing what you suggest with mmap_sem
> isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> as that will cause lock inversion between mmap_sem and whatever filesystem
> locks we have to take. So the fix would have to involve processing larger
> batches of address space at once (which should also somewhat amortize the
> __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> it out in case someone wants to pick up this todo item :)
>

Is the lock really needed to begin with?

Suppose it is.

In this context there are next to no pages found, but there is a
gazillion relocks as the entire VA is being walked.

Bare minimum patch which would already significantly help would start
with the lock held and only relock if there is a page to dump, should
be very easy to add.

I however vote for someone mm-savvy to point out an easy way (if any)
to just iterate pages which are there instead.
-- 
Mateusz Guzik <mjguzik gmail.com>