linux-kernel - Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEJqkggwbNP3Q+7AWr4i64CvAdCp15jVLOtUsnMo6rLk3ExTkA@mail.gmail.com>
Date:   Sat, 22 Dec 2018 21:57:42 +0100
From:   Gabriel C <nix.or.die@...il.com>
To:     Ellie Reeves <ellierevves@...il.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Seth Forshee <seth.forshee@...onical.com>
Subject: Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on
 userns mounts, breaking systemd-nspawn

Added some people to CC that might want to see this..

Am Sa., 22. Dez. 2018 um 19:14 Uhr schrieb Ellie Reeves <ellierevves@...il.com>:
>
> Hi,
> first off, allow me to express that this is my first time ever writing
> on such a mailing list, and that if something is unclear or you would
> need more information, just let me know.
> I write to this list in hoping to see this change reverted. The linux
> kernel always said it would avoid breaking user namespace as much as
> possible, and yet this is what happens. I was hence very much surprised
> when my perfectly working containers on systemd-nspawn which makes use
> of userns by default, stopped working from one day to the next, till I
> identified the problem as being kernel >= 4.18. This container is in
> production, hence the annoyance it was. From one day to the next the
> container started failing with stranges problems:
>
> * nginx, dovecot, postgresql, and postfix complained about getting
> permission denied on /dev/null even though it appeared perfectly normal
> to me, the correct permissions, all that
> * /var was also acting very strangely, getting a lot of permission
> denied or operation not supported messages.
> * I could not delete a file that my user had the right to create, write
> to and read in /var, I needed root
>
> Here is the pull request that was made to systemd, along with a small
> amount of talk around the issue:
>
> https://github.com/systemd/systemd/pull/9483
>
> It was ultimately decided among the systemd folks to bail out of the
> issue, as shown in the news entry for systemd 240:
>
>          * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour
> regarding
>            mknod() handling in user namespaces. Previously mknod() would
> always
>            fail with EPERM in user namespaces. Since 4.18 mknod() will
> succeed
>            but device nodes generated that way cannot be opened, and
> attempts to
>            open them result in EPERM. This breaks the "graceful
> fallback" logic
>            in systemd's PrivateDevices= sand-boxing option. This option is
>            implemented defensively, so that when systemd detects it runs
> in a
>            restricted environment (such as a user namespace, or an
> environment
>            where mknod() is blocked through seccomp or absence of
> CAP_SYS_MKNOD)
>            where device nodes cannot be created the effect of
> PrivateDevices= is
>            bypassed (following the logic that 2nd-level sand-boxing is not
>            essential if the system systemd runs in is itself already
> sand-boxed
>            as a whole). This logic breaks with 4.18 in container
> managers where
>            user namespacing is used: suddenly PrivateDevices= succeeds
> setting
>            up a private /dev/ file system containing devices nodes — but
> when
>            these are opened they don't work.
>
>            At this point is is recommended that container managers utilizing
>            user namespaces that intend to run systemd in the payload
> explicitly
>            block mknod() with seccomp or similar, so that the graceful
> fallback
>            logic works again.
>
>            We are very sorry for the breakage and the requirement to change
>            container configurations for newer kernels. It's purely
> caused by an
>            incompatible kernel change. The relevant kernel developers
> have been
>            notified about this userspace breakage quickly, but they chose to
>            ignore it.
>
> Here's an email that was sent to lkml about the subject:
>
> https://lkml.org/lkml/2018/7/5/742
>
> I link also this, quoting the last of it:
>
> https://lkml.org/lkml/2018/7/5/701
>
> It has never been the case that mknod on a device node will guarantee
> that you even can open the device node.  The applications that regress
> are broken.  It doesn't mean we shouldn't be bug compatible, but we darn
> well should document very clearly the bugs we are being bug compatible with.
>
> I'm in the opinion that it is a kernel bug, and I quote someone from the
> systemd irc channel:
>
> ewb said applications were broken. But the rule is, if userspace breaks,
> its a bug. The kernel *has* to revert it. And honestly, this change
> doesn't make much sense. You can set nodev yourself but then you know
> mknod will not allow you to open the object. Here, the kernel does it
> without your knowledge
>
> Also, it seems that if this change is reverted, things that were fixed
> to work around the issue this breakage caused will not be broken again,
> they should simply go back to their previous way of working. I
> understand there may be security reason why this change was made in the
> first place, but it is not so big a problem is it ? I can mknode
> arbitrary devices in userns and open them as userns root. But my point
> is, several things broke. My *working* stuff was broken from one day to
> the next.
>
> I am not trying to pick a fight. I want to understand the reasoning
> behind this change in the first place, and I'm simply making an attempt
> at getting it reverted, because it is true that I don't much fancy
> blocking the mknode() syscall in every template unit on every machine we
> administer here, and that staying on kernel < 4.18 is not a good
> sollution either.
>
> I would also like to be personally CC'ed the comments or answers posted
> to this mailing list in response to this message.
>
> Thanks