[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc8aa6a4-a187-f3ad-fec9-05f037a3886d@virtuozzo.com>
Date: Thu, 11 May 2023 11:17:07 +0800
From: Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To: Andrey Vagin <avagin@...nvz.org>,
Thomas Gleixner <tglx@...utronix.de>
Cc: Frederic Weisbecker <frederic@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
syzbot+5c54bd3eb218bb595aa9@...kaller.appspotmail.com,
Dmitry Vyukov <dvyukov@...gle.com>,
Sebastian Siewior <bigeasy@...utronix.de>,
Michael Kerrisk <mtk.manpages@...il.com>,
Christian Brauner <brauner@...nel.org>,
Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>,
Pavel Emelyanov <xemul@...nvz.org>
Subject: Re: [RFD] posix-timers: CRIU woes
On 10.05.2023 16:16, Andrey Vagin wrote:
> On Tue, May 9, 2023 at 2:42 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>>
>> Hi!
>>
>> This is a summary of several mails so that the CRIU people have the full
>> picture.
>>
>> A recent syzbot report made me look at the timer ID management, which
>> was modified 7 years ago to accomodate CRIU:
>>
>> 5ed67f05f66c ("posix timers: Allocate timer id per process (v2)")
>>
>> and introduced that reported issue along with a bogus loop termination
>> problem. See
>>
>> https://lore.kernel.org/lkml/000000000000a3723305f9d98fc3@google.com/
>> https://lore.kernel.org/lkml/20230425183312.932345089@linutronix.de
>>
>> for details.
>>
>> The intent to make the timer IDs per process is definitely correct, but
>> the implementation is beyond suboptimal. I really regret that I did not
>> catch this back then when picking those changes up.
>>
>> The way it works is that each process keeps a 'next_timer_id' which it
>> uses to allocate the next timer. That allows CRIU to reconstruct timers
>> with the same ID by walking the ID space via
>>
>> do {
>> timer_create(&timer, ...., &id);
>> if (id == original_id)
>> goto success;
>> timer_delete(&timer);
>> } while (idspace_not_exhausted());
>>
>> That works by some definition of works, but it is problematic in two ways:
>>
>> 1) As the timer ID space is up to INT_MAX, a process which creates and
>> deletes timers frequently, can easily move up close to the INT_MAX
>> id space over time.
>>
>> If such a process is checkpointed and restored, then the above loop
>> will run for at least an hour to restore a single timer.
>>
>> And no, this is not only a hypothetical issue. There are legitimate
>> scenarios where threads are created and the control thread arms a
>> posix CPU timer on them. Such threads can be torn down on a regular
>> base due to thread pool consolidations. As CRIU does not happen
>> every 5 minutes it's not completely unlikely that such a process
>> surives quite some time on a particular host and thereby approaches
>> the ID space limit.
>>
>> Sure we can restrict the ID space to a way smaller number so the
>> search wraps around earlier, but what's a sensible limit?
>>
>> Though restricting the ID space has its own issue vs. backwards
>> compability. A process which created a timer on an older kernel with
>> an ID larger than the newer kernels ID limit cannot longer be
>> restored on that newer kernel.
>>
>> Aside of that it does not solve the other problem this created:
>>
>> 2) That change created an user space ABI, which means that the kernel
>> side has to stick with this next ID search mechanism forever.
>>
>> That prevents to get rid of that global lock and hash table by
>> sticking an xarray into task::signal which makes a lot of sense in
>> terms of cache locality and gets rid of the extra timer list
>> management in task::signal. Making this change would be very useful
>> to address some other issues in the posix-timer code without
>> creating yet more duct tape horrors.
>>
>> Such a change obviously has to aim for a dense ID space to keep the
>> memory overhead low, but that breaks existing CRIU userspace because
>> dense ID space and next ID search does not fit together.
>>
>> Next ID search is obviously creating non-recoverable holes in the
>> case that timers are deleted afterwards.
>>
>> A dense ID space approach can create holes too, but they are
>> recoverable and well within the resource limits, because the process
>> has to be able to create enough timers in the first place in order
>> to release those in the middle.
>>
>> With the next ID search brute force recovery is not possible on a
>> kernel with dense ID as there is no way to create all intermediate
>> timers first before reaching the one at the far end due to resource
>> limits.
>>
>> So because of that half thought out user space ABI we are now up the
>> regression creek without a paddle, unless CRIU can accomodate to a
>> different restore mechanism to lift this restriction from the kernel.
>>
>> Thoughts?
>
> Hi Thomas,
>
> If you give us a new API to create timers with specified id-s, we will
> figure out how to live with it. It isn't good to ask users to update
> CRIU to work on new kernels, but here are reasons and event improvements
> for CRIU, so I think it's worth it.
I agree, any API to create timers with specified id-s would work for new
CRIU versions.
>
> As for API, we can use one bit of sigevent.sigev_notify to request a
> timer with a specified id.
Yes, I agree, this kind of API is a good option.
>
> Thanks,
> Andrei
--
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.
Powered by blists - more mailing lists