[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <009e7658-1377-cc79-7a42-4dda8fec5af0@virtuozzo.com>
Date: Wed, 10 May 2023 12:36:42 +0800
From: Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To: Thomas Gleixner <tglx@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
syzbot+5c54bd3eb218bb595aa9@...kaller.appspotmail.com,
Dmitry Vyukov <dvyukov@...gle.com>,
Sebastian Siewior <bigeasy@...utronix.de>,
Michael Kerrisk <mtk.manpages@...il.com>,
Andrei Vagin <avagin@...nvz.org>,
Christian Brauner <brauner@...nel.org>,
Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>,
Pavel Emelyanov <xemul@...nvz.org>
Subject: Re: [RFD] posix-timers: CRIU woes
On 10.05.2023 05:42, Thomas Gleixner wrote:
> Hi!
>
> This is a summary of several mails so that the CRIU people have the full
> picture.
>
> A recent syzbot report made me look at the timer ID management, which
> was modified 7 years ago to accomodate CRIU:
>
> 5ed67f05f66c ("posix timers: Allocate timer id per process (v2)")
>
> and introduced that reported issue along with a bogus loop termination
> problem. See
>
> https://lore.kernel.org/lkml/000000000000a3723305f9d98fc3@google.com/
> https://lore.kernel.org/lkml/20230425183312.932345089@linutronix.de
>
> for details.
>
> The intent to make the timer IDs per process is definitely correct, but
> the implementation is beyond suboptimal. I really regret that I did not
> catch this back then when picking those changes up.
>
> The way it works is that each process keeps a 'next_timer_id' which it
> uses to allocate the next timer. That allows CRIU to reconstruct timers
> with the same ID by walking the ID space via
>
> do {
> timer_create(&timer, ...., &id);
> if (id == original_id)
> goto success;
> timer_delete(&timer);
> } while (idspace_not_exhausted());
>
> That works by some definition of works, but it is problematic in two ways:
>
> 1) As the timer ID space is up to INT_MAX, a process which creates and
> deletes timers frequently, can easily move up close to the INT_MAX
> id space over time.
>
> If such a process is checkpointed and restored, then the above loop
> will run for at least an hour to restore a single timer.
>
> And no, this is not only a hypothetical issue. There are legitimate
> scenarios where threads are created and the control thread arms a
> posix CPU timer on them. Such threads can be torn down on a regular
> base due to thread pool consolidations. As CRIU does not happen
> every 5 minutes it's not completely unlikely that such a process
> surives quite some time on a particular host and thereby approaches
> the ID space limit.
>
> Sure we can restrict the ID space to a way smaller number so the
> search wraps around earlier, but what's a sensible limit?
>
> Though restricting the ID space has its own issue vs. backwards
> compability. A process which created a timer on an older kernel with
> an ID larger than the newer kernels ID limit cannot longer be
> restored on that newer kernel.
>
> Aside of that it does not solve the other problem this created:
>
> 2) That change created an user space ABI, which means that the kernel
> side has to stick with this next ID search mechanism forever.
>
> That prevents to get rid of that global lock and hash table by
> sticking an xarray into task::signal which makes a lot of sense in
> terms of cache locality and gets rid of the extra timer list
> management in task::signal. Making this change would be very useful
> to address some other issues in the posix-timer code without
> creating yet more duct tape horrors.
>
> Such a change obviously has to aim for a dense ID space to keep the
> memory overhead low, but that breaks existing CRIU userspace because
> dense ID space and next ID search does not fit together.
>
> Next ID search is obviously creating non-recoverable holes in the
> case that timers are deleted afterwards.
>
> A dense ID space approach can create holes too, but they are
> recoverable and well within the resource limits, because the process
> has to be able to create enough timers in the first place in order
> to release those in the middle.
>
> With the next ID search brute force recovery is not possible on a
> kernel with dense ID as there is no way to create all intermediate
> timers first before reaching the one at the far end due to resource
> limits.
>
> So because of that half thought out user space ABI we are now up the
> regression creek without a paddle, unless CRIU can accomodate to a
> different restore mechanism to lift this restriction from the kernel.
>
> Thoughts?
Maybe we can do something similar to /proc/sys/kernel/ns_last_pid?
Switch to per-(process->signal) idr based approach with idr_set_cursor
to set next id for next posix timer from new sysctl?
>
> Thanks,
>
> tglx
>
>
--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
Powered by blists - more mailing lists