linux-kernel - Re: [RFD] posix-timers: CRIU woes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <009e7658-1377-cc79-7a42-4dda8fec5af0@virtuozzo.com>
Date:   Wed, 10 May 2023 12:36:42 +0800
From:   Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        Frederic Weisbecker <frederic@...nel.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Anna-Maria Behnsen <anna-maria@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        syzbot+5c54bd3eb218bb595aa9@...kaller.appspotmail.com,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Sebastian Siewior <bigeasy@...utronix.de>,
        Michael Kerrisk <mtk.manpages@...il.com>,
        Andrei Vagin <avagin@...nvz.org>,
        Christian Brauner <brauner@...nel.org>,
        Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>,
        Pavel Emelyanov <xemul@...nvz.org>
Subject: Re: [RFD] posix-timers: CRIU woes



On 10.05.2023 05:42, Thomas Gleixner wrote:
> Hi!
> 
> This is a summary of several mails so that the CRIU people have the full
> picture.
> 
> A recent syzbot report made me look at the timer ID management, which
> was modified 7 years ago to accomodate CRIU:
> 
>      5ed67f05f66c ("posix timers: Allocate timer id per process (v2)")
> 
> and introduced that reported issue along with a bogus loop termination
> problem. See
> 
>      https://lore.kernel.org/lkml/000000000000a3723305f9d98fc3@google.com/
>      https://lore.kernel.org/lkml/20230425183312.932345089@linutronix.de
> 
> for details.
> 
> The intent to make the timer IDs per process is definitely correct, but
> the implementation is beyond suboptimal. I really regret that I did not
> catch this back then when picking those changes up.
> 
> The way it works is that each process keeps a 'next_timer_id' which it
> uses to allocate the next timer. That allows CRIU to reconstruct timers
> with the same ID by walking the ID space via
> 
>    do {
>       timer_create(&timer, ...., &id);
>       if (id == original_id)
>          goto success;
>       timer_delete(&timer);
>    } while (idspace_not_exhausted());
> 
> That works by some definition of works, but it is problematic in two ways:
> 
>   1) As the timer ID space is up to INT_MAX, a process which creates and
>      deletes timers frequently, can easily move up close to the INT_MAX
>      id space over time.
> 
>      If such a process is checkpointed and restored, then the above loop
>      will run for at least an hour to restore a single timer.
> 
>      And no, this is not only a hypothetical issue. There are legitimate
>      scenarios where threads are created and the control thread arms a
>      posix CPU timer on them. Such threads can be torn down on a regular
>      base due to thread pool consolidations. As CRIU does not happen
>      every 5 minutes it's not completely unlikely that such a process
>      surives quite some time on a particular host and thereby approaches
>      the ID space limit.
> 
>      Sure we can restrict the ID space to a way smaller number so the
>      search wraps around earlier, but what's a sensible limit?
> 
>      Though restricting the ID space has its own issue vs. backwards
>      compability. A process which created a timer on an older kernel with
>      an ID larger than the newer kernels ID limit cannot longer be
>      restored on that newer kernel.
> 
>      Aside of that it does not solve the other problem this created:
> 
>   2) That change created an user space ABI, which means that the kernel
>      side has to stick with this next ID search mechanism forever.
> 
>      That prevents to get rid of that global lock and hash table by
>      sticking an xarray into task::signal which makes a lot of sense in
>      terms of cache locality and gets rid of the extra timer list
>      management in task::signal. Making this change would be very useful
>      to address some other issues in the posix-timer code without
>      creating yet more duct tape horrors.
> 
>      Such a change obviously has to aim for a dense ID space to keep the
>      memory overhead low, but that breaks existing CRIU userspace because
>      dense ID space and next ID search does not fit together.
> 
>      Next ID search is obviously creating non-recoverable holes in the
>      case that timers are deleted afterwards.
> 
>      A dense ID space approach can create holes too, but they are
>      recoverable and well within the resource limits, because the process
>      has to be able to create enough timers in the first place in order
>      to release those in the middle.
> 
>      With the next ID search brute force recovery is not possible on a
>      kernel with dense ID as there is no way to create all intermediate
>      timers first before reaching the one at the far end due to resource
>      limits.
> 
> So because of that half thought out user space ABI we are now up the
> regression creek without a paddle, unless CRIU can accomodate to a
> different restore mechanism to lift this restriction from the kernel.
> 
> Thoughts?

Maybe we can do something similar to /proc/sys/kernel/ns_last_pid? 
Switch to per-(process->signal) idr based approach with idr_set_cursor 
to set next id for next posix timer from new sysctl?

> 
> Thanks,
> 
>          tglx
> 
> 

-- 
Best regards, Tikhomirov Pavel
Senior Software Developer, Virtuozzo.