lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cc8aa6a4-a187-f3ad-fec9-05f037a3886d@virtuozzo.com>
Date:   Thu, 11 May 2023 11:17:07 +0800
From:   Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To:     Andrey Vagin <avagin@...nvz.org>,
        Thomas Gleixner <tglx@...utronix.de>
Cc:     Frederic Weisbecker <frederic@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Anna-Maria Behnsen <anna-maria@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        syzbot+5c54bd3eb218bb595aa9@...kaller.appspotmail.com,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Sebastian Siewior <bigeasy@...utronix.de>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>,
        Pavel Emelyanov <xemul@...nvz.org>
Subject: Re: [RFD] posix-timers: CRIU woes



On 10.05.2023 16:16, Andrey Vagin wrote:
> On Tue, May 9, 2023 at 2:42 PM Thomas Gleixner <tglx@...utronix.de> wrote:
>>
>> Hi!
>>
>> This is a summary of several mails so that the CRIU people have the full
>> picture.
>>
>> A recent syzbot report made me look at the timer ID management, which
>> was modified 7 years ago to accomodate CRIU:
>>
>>      5ed67f05f66c ("posix timers: Allocate timer id per process (v2)")
>>
>> and introduced that reported issue along with a bogus loop termination
>> problem. See
>>
>>      https://lore.kernel.org/lkml/000000000000a3723305f9d98fc3@google.com/
>>      https://lore.kernel.org/lkml/20230425183312.932345089@linutronix.de
>>
>> for details.
>>
>> The intent to make the timer IDs per process is definitely correct, but
>> the implementation is beyond suboptimal. I really regret that I did not
>> catch this back then when picking those changes up.
>>
>> The way it works is that each process keeps a 'next_timer_id' which it
>> uses to allocate the next timer. That allows CRIU to reconstruct timers
>> with the same ID by walking the ID space via
>>
>>    do {
>>       timer_create(&timer, ...., &id);
>>       if (id == original_id)
>>          goto success;
>>       timer_delete(&timer);
>>    } while (idspace_not_exhausted());
>>
>> That works by some definition of works, but it is problematic in two ways:
>>
>>   1) As the timer ID space is up to INT_MAX, a process which creates and
>>      deletes timers frequently, can easily move up close to the INT_MAX
>>      id space over time.
>>
>>      If such a process is checkpointed and restored, then the above loop
>>      will run for at least an hour to restore a single timer.
>>
>>      And no, this is not only a hypothetical issue. There are legitimate
>>      scenarios where threads are created and the control thread arms a
>>      posix CPU timer on them. Such threads can be torn down on a regular
>>      base due to thread pool consolidations. As CRIU does not happen
>>      every 5 minutes it's not completely unlikely that such a process
>>      surives quite some time on a particular host and thereby approaches
>>      the ID space limit.
>>
>>      Sure we can restrict the ID space to a way smaller number so the
>>      search wraps around earlier, but what's a sensible limit?
>>
>>      Though restricting the ID space has its own issue vs. backwards
>>      compability. A process which created a timer on an older kernel with
>>      an ID larger than the newer kernels ID limit cannot longer be
>>      restored on that newer kernel.
>>
>>      Aside of that it does not solve the other problem this created:
>>
>>   2) That change created an user space ABI, which means that the kernel
>>      side has to stick with this next ID search mechanism forever.
>>
>>      That prevents to get rid of that global lock and hash table by
>>      sticking an xarray into task::signal which makes a lot of sense in
>>      terms of cache locality and gets rid of the extra timer list
>>      management in task::signal. Making this change would be very useful
>>      to address some other issues in the posix-timer code without
>>      creating yet more duct tape horrors.
>>
>>      Such a change obviously has to aim for a dense ID space to keep the
>>      memory overhead low, but that breaks existing CRIU userspace because
>>      dense ID space and next ID search does not fit together.
>>
>>      Next ID search is obviously creating non-recoverable holes in the
>>      case that timers are deleted afterwards.
>>
>>      A dense ID space approach can create holes too, but they are
>>      recoverable and well within the resource limits, because the process
>>      has to be able to create enough timers in the first place in order
>>      to release those in the middle.
>>
>>      With the next ID search brute force recovery is not possible on a
>>      kernel with dense ID as there is no way to create all intermediate
>>      timers first before reaching the one at the far end due to resource
>>      limits.
>>
>> So because of that half thought out user space ABI we are now up the
>> regression creek without a paddle, unless CRIU can accomodate to a
>> different restore mechanism to lift this restriction from the kernel.
>>
>> Thoughts?
> 
> Hi Thomas,
> 
> If you give us a new API to create timers with specified id-s, we will
> figure out how to live with it. It isn't good to ask users to update
> CRIU to work on new kernels, but here are reasons and event improvements
> for CRIU, so I think it's worth it.

I agree, any API to create timers with specified id-s would work for new 
CRIU versions.

> 
> As for API, we can use one bit of sigevent.sigev_notify to request a
> timer with a specified id.

Yes, I agree, this kind of API is a good option.

> 
> Thanks,
> Andrei



-- 
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ