linux-kernel - Re: [RFD] reboot / shutdown of a container

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110113225059.6f46cd44@neptune.home>
Date:	Thu, 13 Jan 2011 22:50:59 +0100
From:	Bruno Prémont <bonbons@...ux-vserver.org>
To:	Daniel Lezcano <daniel.lezcano@...e.fr>
Cc:	Linux Containers <containers@...ts.linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [RFD] reboot / shutdown of a container

On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@...e.fr> wrote:

> On 01/13/2011 09:09 PM, Bruno Prémont wrote:
> > On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@...e.fr>  wrote:
> >> in the container implementation, we are facing the problem of a process
> >> calling the sys_reboot syscall which of course makes the host to
> >> poweroff/reboot.
> >>
> >> If we drop the cap_sys_reboot capability, sys_reboot fails and the
> >> container reach a shutdown state but the init process stay there, hence
> >> the container becomes stuck waiting indefinitely the process '1' to exit.
> >>
> >> The current implementation to make the shutdown / reboot of the
> >> container to work is we watch, from a process outside of the container,
> >> the<rootfs>/var/run/utmp file and check the runlevel each time the file
> >> changes. When the 'reboot' or 'shutdown' level is detected, we wait for
> >> a single remaining in the container and then we kill it.
> >>
> >> That works but this is not efficient in case of a large number of
> >> containers as we will have to watch a lot of utmp files. In addition,
> >> the /var/run directory must *not* mounted as tmpfs in the distro.
> >> Unfortunately, it is the default setup on most of the distros and tends
> >> to generalize. That implies, the rootfs init's scripts must be modified
> >> for the container when we put in place its rootfs and as /var/run is
> >> supposed to be a tmpfs, most of the applications do not cleanup the
> >> directory, so we need to add extra services to wipeout the files.
> >>
> >> More problems arise when we do an upgrade of the distro inside the
> >> container, because all the setup we made at creation time will be lost.
> >> The upgrade overwrite the scripts, the fstab and so on.
> >>
> >> We did what was possible to solve the problem from userspace but we
> >> reach always a limit because there are different implementations of the
> >> 'init' process and the init's scripts differ from a distro to another
> >> and the same with the versions.
> >>
> >> We think this problem can only be solved from the kernel.
> >>
> >> The idea was to send a signal SIGPWR to the parent of the pid '1' of the
> >> pid namespace when the sys_reboot is called. Of course that won't occur
> >> for the init pid namespace.
> > Wouldn't sending SIGKILL to the pid '1' process of the originating PID
> > namespace be sufficient (that would trigger a SIGCHLD for the parent
> > process in the outer PID namespace.
> 
> This is already the case. The question is : when do we send this signal ?
> We have to wait for the container system shutdown before killing it.

I meant that sys_reboot() would kill the namespace's init if it's not
called from boot namespace.

See below

> > (as far as I remember the PID namespace is killed when its 'init' exits,
> > if this is not the case all other processes in the given namespace would
> > have to be killed as well)
> 
> Yes, absolutely but this is not the point, reaping the container is not 
> a problem.
> 
> What we are trying to achieve is to shutdown properly the container from 
> inside (from outside will be possible too with the setns syscall).
> 
> Assuming the process '1234' creates a new process in a new namespace set 
> and wait for it.
> 
> The new process '1' will exec /sbin/init and the system will boot up. 
> But, when the system is shutdown or rebooted, after the down scripts are 
> executed the kill -15 -1 will be invoked, killing all the processes 
> expect the process '1' and the caller. This one will then call 
> 'sys_reboot' and exit. Hence we still have the init process idle and its 
> parent '1234' waiting for it to die.

This call to sys_reboot() would kill "new process '1'" instead of trying to
operate on the HW box.
This also has the advantage that a container would not require an informed
parent "monitoring" it from outside (though it would not be restarted even if
requested without such informed outside parent).

> If we are able to receive the information in the process '1234' : "the 
> sys_reboot was called in the child pid namespace", we can take then kill 
> our child pid.  If this information is raised via a signal sent by the 
> kernel with the proper information in the siginfo_t (eg. si_code 
> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the 
> solution will be generic for all the shutdown/reboot of any kind of 
> container and init version.

Could this be passed for a SIGCHLD? (when namespace is reaped, and received
by 1234 from above example assuming sys_reboot() kills the "new process '1'")

Looks like yes, but with the need to define new values for si_code (reusing
LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen).

> > Only issue is how to differentiate the various reboot() modes (restart,
> > power-off/halt) from outside, though that one also exists with the SIGPWR
> > signal.

Bruno
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/