[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63wvjel64hsft4clgeayaorx3v7txvqh264mw7ionlbmmve7pj@eblpknd677zf>
Date: Thu, 16 Jan 2025 10:56:53 +0100
From: Jan Kara <jack@...e.cz>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: Tavian Barnes <tavianator@...ianator.com>,
linux-fsdevel@...r.kernel.org, Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] coredump: allow interrupting dumps of large anonymous
regions
On Thu 16-01-25 08:46:48, Mateusz Guzik wrote:
> On Wed, Jan 15, 2025 at 11:05:38PM -0500, Tavian Barnes wrote:
> > dump_user_range() supports sparse core dumps by skipping anonymous pages
> > which have not been modified. If get_dump_page() returns NULL, the page
> > is skipped rather than written to the core dump with dump_emit_page().
> >
> > Sadly, dump_emit_page() contains the only check for dump_interrupted(),
> > so when dumping a very large sparse region, the core dump becomes
> > effectively uninterruptible. This can be observed with the following
> > test program:
> >
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <sys/mman.h>
> >
> > int main(void) {
> > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE,
> > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
> > printf("%p %m\n", mem);
> > if (mem != MAP_FAILED) {
> > mem[0] = 1;
> > }
> > abort();
> > }
> >
> > The program allocates 1 TiB of anonymous memory, touches one page of it,
> > and aborts. During the core dump, SIGKILL has no effect. It takes
> > about 30 seconds to finish the dump, burning 100% CPU.
> >
>
> While the patch makes sense to me, this should not be taking anywhere
> near this much time and plausibly after unscrewing it will stop being a
> factor.
>
> So I had a look with a profiler:
> - 99.89% 0.00% a.out
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> syscall_exit_to_user_mode
> arch_do_signal_or_restart
> - get_signal
> - 99.89% do_coredump
> - 99.88% elf_core_dump
> - dump_user_range
> - 98.12% get_dump_page
> - 64.19% __get_user_pages
> - 40.92% gup_vma_lookup
> - find_vma
> - mt_find
> 4.21% __rcu_read_lock
> 1.33% __rcu_read_unlock
> - 3.14% check_vma_flags
> 0.68% vma_is_secretmem
> 0.61% __cond_resched
> 0.60% vma_pgtable_walk_end
> 0.59% vma_pgtable_walk_begin
> 0.58% no_page_table
> - 15.13% down_read_killable
> 0.69% __cond_resched
> 13.84% up_read
> 0.58% __cond_resched
>
>
> Almost 29% of time is spent relocking the mmap semaphore in
> __get_user_pages. This most likely can operate locklessly in the fast
> path. Even if somehow not, chances are the lock can be held across
> multiple calls.
>
> mt_find spends most of it's time issuing a rep stos of 48 bytes (would
> be faster to rep mov 6 times instead). This is the compiler being nasty,
> I'll maybe look into it.
>
> However, I strongly suspect the current iteration method is just slow
> due to repeat mt_find calls and The Right Approach(tm) would make this
> entire thing finish within miliseconds by iterating the maple tree
> instead, but then the mm folk would have to be consulted on how to
> approach this and it may be time consuming to implement.
>
> Sorting out relocking should be an easily achievable & measurable win
> (no interest on my end).
As much as I agree the code is dumb, doing what you suggest with mmap_sem
isn't going to be easy. You cannot call dump_emit_page() with mmap_sem held
as that will cause lock inversion between mmap_sem and whatever filesystem
locks we have to take. So the fix would have to involve processing larger
batches of address space at once (which should also somewhat amortize the
__get_user_pages() setup costs). Not that hard to do but I wanted to spell
it out in case someone wants to pick up this todo item :)
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists