linux-kernel - [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <e62d4c3a-b89b-2380-7b39-66733b389145@gmail.com>
Date:   Sat, 22 Dec 2018 06:39:05 -0500
From:   Ellie Reeves <ellierevves@...il.com>
To:     linux-kernel@...r.kernel.org
Subject: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns
 mounts, breaking systemd-nspawn

Hi,
first off, allow me to express that this is my first time ever writing 
on such a mailing list, and that if something is unclear or you would 
need more information, just let me know.
I write to this list in hoping to see this change reverted. The linux 
kernel always said it would avoid breaking user namespace as much as 
possible, and yet this is what happens. I was hence very much surprised 
when my perfectly working containers on systemd-nspawn which makes use 
of userns by default, stopped working from one day to the next, till I 
identified the problem as being kernel >= 4.18. This container is in 
production, hence the annoyance it was. From one day to the next the 
container started failing with stranges problems:

* nginx, dovecot, postgresql, and postfix complained about getting 
permission denied on /dev/null even though it appeared perfectly normal 
to me, the correct permissions, all that
* /var was also acting very strangely, getting a lot of permission 
denied or operation not supported messages.
* I could not delete a file that my user had the right to create, write 
to and read in /var, I needed root

Here is the pull request that was made to systemd, along with a small 
amount of talk around the issue:

https://github.com/systemd/systemd/pull/9483

It was ultimately decided among the systemd folks to bail out of the 
issue, as shown in the news entry for systemd 240:

         * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour 
regarding
           mknod() handling in user namespaces. Previously mknod() would 
always
           fail with EPERM in user namespaces. Since 4.18 mknod() will 
succeed
           but device nodes generated that way cannot be opened, and 
attempts to
           open them result in EPERM. This breaks the "graceful 
fallback" logic
           in systemd's PrivateDevices= sand-boxing option. This option is
           implemented defensively, so that when systemd detects it runs 
in a
           restricted environment (such as a user namespace, or an 
environment
           where mknod() is blocked through seccomp or absence of 
CAP_SYS_MKNOD)
           where device nodes cannot be created the effect of 
PrivateDevices= is
           bypassed (following the logic that 2nd-level sand-boxing is not
           essential if the system systemd runs in is itself already 
sand-boxed
           as a whole). This logic breaks with 4.18 in container 
managers where
           user namespacing is used: suddenly PrivateDevices= succeeds 
setting
           up a private /dev/ file system containing devices nodes — but 
when
           these are opened they don't work.

           At this point is is recommended that container managers utilizing
           user namespaces that intend to run systemd in the payload 
explicitly
           block mknod() with seccomp or similar, so that the graceful 
fallback
           logic works again.

           We are very sorry for the breakage and the requirement to change
           container configurations for newer kernels. It's purely 
caused by an
           incompatible kernel change. The relevant kernel developers 
have been
           notified about this userspace breakage quickly, but they chose to
           ignore it.

Here's an email that was sent to lkml about the subject:

https://lkml.org/lkml/2018/7/5/742

I link also this, quoting the last of it:

https://lkml.org/lkml/2018/7/5/701

It has never been the case that mknod on a device node will guarantee 
that you even can open the device node.  The applications that regress 
are broken.  It doesn't mean we shouldn't be bug compatible, but we darn 
well should document very clearly the bugs we are being bug compatible with.

I'm in the opinion that it is a kernel bug, and I quote someone from the 
systemd irc channel:

ewb said applications were broken. But the rule is, if userspace breaks, 
its a bug. The kernel *has* to revert it. And honestly, this change 
doesn't make much sense. You can set nodev yourself but then you know 
mknod will not allow you to open the object. Here, the kernel does it 
without your knowledge

Also, it seems that if this change is reverted, things that were fixed 
to work around the issue this breakage caused will not be broken again, 
they should simply go back to their previous way of working. I 
understand there may be security reason why this change was made in the 
first place, but it is not so big a problem is it ? I can mknode 
arbitrary devices in userns and open them as userns root. But my point 
is, several things broke. My *working* stuff was broken from one day to 
the next.

I am not trying to pick a fight. I want to understand the reasoning 
behind this change in the first place, and I'm simply making an attempt 
at getting it reverted, because it is true that I don't much fancy 
blocking the mknode() syscall in every template unit on every machine we 
administer here, and that staying on kernel < 4.18 is not a good 
sollution either.

I would also like to be personally CC'ed the comments or answers posted 
to this mailing list in response to this message.

Thanks