[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFg4BgeygyKV8tY_2Dk4cv9zwQnU6-n7jSxjwyyXzau6g@mail.gmail.com>
Date: Thu, 16 Jan 2025 13:04:42 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Tavian Barnes <tavianator@...ianator.com>, linux-fsdevel@...r.kernel.org,
Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] coredump: allow interrupting dumps of large anonymous regions
On Thu, Jan 16, 2025 at 10:56 AM Jan Kara <jack@...e.cz> wrote:
>
> On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> > On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > > which have not been modified. If get_dump_page() returns NULL, the page
> > > is skipped rather than written to the core dump with dump_emit_page().
> > >
> > > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > > so when dumping a very large sparse region, the core dump becomes
> > > effectively uninterruptible. This can be observed with the following
> > > test program:
> > >
> > > #include <stdlib.h>
> > > #include <stdio.h>
> > > #include <sys/mman.h>
> > >
> > > int main(void) {
> > > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > > printf("%p %m\n", mem);
> > > if (mem != MAP_FAILED) {
> > > mem[0] = 1;
> > > }
> > > abort();
> > > }
> > >
> > > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > > and aborts. During the core dump, SIGKILL has no effect. It takes
> > > about 30 seconds to finish the dump, burning 100% CPU.
> > >
> >
> > While the patch makes sense to me, this should not be taking anywhere
> > near this much time and plausibly after unscrewing it will stop being a
> > factor.
> >
> > So I had a look with a profiler:
> > - 99.89% 0.00% a.out
> > entry_SYSCALL_64_after_hwframe
> > do_syscall_64
> > syscall_exit_to_user_mode
> > arch_do_signal_or_restart
> > - get_signal
> > - 99.89% do_coredump
> > - 99.88% elf_core_dump
> > - dump_user_range
> > - 98.12% get_dump_page
> > - 64.19% __get_user_pages
> > - 40.92% gup_vma_lookup
> > - find_vma
> > - mt_find
> > 4.21% __rcu_read_lock
> > 1.33% __rcu_read_unlock
> > - 3.14% check_vma_flags
> > 0.68% vma_is_secretmem
> > 0.61% __cond_resched
> > 0.60% vma_pgtable_walk_end
> > 0.59% vma_pgtable_walk_begin
> > 0.58% no_page_table
> > - 15.13% down_read_killable
> > 0.69% __cond_resched
> > 13.84% up_read
> > 0.58% __cond_resched
> >
> >
> > Almost 29% of time is spent relocking the mmap semaphore in
> > __get_user_pages. This most likely can operate locklessly in the fast
> > path. Even if somehow not, chances are the lock can be held across
> > multiple calls.
> >
> > mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> > be faster to rep mov 6 times instead). This is the compiler being nasty,
> > I'll maybe look into it.
> >
> > However, I strongly suspect the current iteration method is just slow
> > due to repeat mt_find calls and The Right Approach(tm) would make this
> > entire thing finish within miliseconds by iterating the maple tree
> > instead, but then the mm folk would have to be consulted on how to
> > approach this and it may be time consuming to implement.
> >
> > Sorting out relocking should be an easily achievable & measurable win
> > (no interest on my end).
>
> As much as I agree the code is dumb, doing what you suggest with mmap_sem
> isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
> as that will cause lock inversion between mmap_sem and whatever filesystem
> locks we have to take. So the fix would have to involve processing larger
> batches of address space at once (which should also somewhat amortize the
> __get_user_pages() setup costs). Not that hard to do but I wanted to spell
> it out in case someone wants to pick up this todo item :)
>
Is the lock really needed to begin with?
Suppose it is.
In this context there are next to no pages found, but there is a
gazillion relocks as the entire VA is being walked.
Bare minimum patch which would already significantly help would start
with the lock held and only relock if there is a page to dump, should
be very easy to add.
I however vote for someone mm-savvy to point out an easy way (if any)
to just iterate pages which are there instead.
--
Mateusz Guzik <mjguzik gmail.com>
Powered by blists - more mailing lists