netdev - Re: userns, netns, and quick physical memory consumption by unprivileged user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <m3io0riucg.fsf@gmail.com>
Date:	Sat, 12 Mar 2016 16:35:55 +0300
From:	yumkam@...il.com (Yuriy M. Kaminskiy)
To:	netdev@...r.kernel.org
Cc:	linux-kernel@...r.kernel.org, containers@...ts.osdl.org
Subject: Re: userns, netns, and quick physical memory consumption by unprivileged user

On 03/11/16 18:34 , Florian Westphal wrote:
> Yuriy M. Kaminskiy <yumkam@...il.com> wrote:
>> BTW, all those hash/conntrack/etc default sizes was calculated from
>> physical memory size in assumption there will be only *one* instance of
>> those tables. Obviously, introduction of network namespaces (and
>> especially unprivileged user-ns) thrown this assumption in the window
>> (and here comes that "falling back to vmalloc" message again; in pre-netns
>> world, those tables were allocated *once* on early system startup, with
>> typically plenty of free and unfragmented memory).
>
> No idea how to fix this expect by removing conntrack support in net
> namespaces completely.

Well, it is not *only* conntrack. Conntrack eats big chunks at once, but
there are other things that eat kernel memory too: *any* iptables
rules, 'ip address', 'ip link' (at very least, 'type dummy' and
/dev/net/tun are available inside unprivileged userns/netns), 'ip
tunnel', 'ip rule', 'ip route', etc.

Just add *a lot* of them (and over several netns to avoid potential
O(n^2) behaviour on adding), and it will be painful, regardless of
memcg/ulimits/free swap/etc.

E.g. something like

  unshare -rn sh -c 'setsid sleep inf & for i in 1 1024; do
  ip li add d$i type dummy; ip li set d$i up;done'

eats ~40M kernel memory each (unswappable and not curbed by memcg),
that's way more than conntrack hashtables alone.

At *very minimum*, all that must be limited by memcg (it is not
currently!). And maybe by ulimits too (of process that created userns?
well, natural choice of RLIMIT_MEMLOCK practically forbids netns, no
chance it will fit in 64k).

Specifically with conntrack, separate limits on hash size/entries for
non-initns won't harm, but that's more of "flexibility to avoid
senseless waste of memory" (in case specific container won't use many
connection or won't use conntrack at all) than protection against abuse.

By the way, there are unrestrained kernel memory consumption in *other*
namespace types too. E.g., let's look at mount  namespace; it looks like
tmpfs contents is at least curbed by memcg [but *not* curbed by
ulimits!], however *mounts itself* are not; e.g.

  unshare -rm sh -c 'while :; do seq -f /tmp/foo%g 1 1024|
  while read d; do mkdir -p $d; mount --bind $d $d;done'

is a bit slower [try with several instances?], but end result
would be same).

> I'd disallow all write accesses to skb->nfct (NAT, CONNMARK,
> CONNSECMARK, ...) and then no longer clear skb->nfct when forwarding
> packet from init_ns to container.
>
> Containers could then still test conntrack as seen from init namespace pov
> in PREROUTING/FORWARD/INPUT (but not OUTPUT, obviously).
>
> [ OUTPUT *might* be doable as well by allowing NEW creation in output
>   but skipping nat and deferring the confirmation/commit of the new
>   entry to the table until skb leaves initns ]
>
> We could key conntrack entries to initns conntrack table
> instead of adding one new table per netns, but seems like this only
> replaces one problem with a new one (filling/blocking initns table from
> another netns).
>
> Maybe we could go with a compromise and skip/disallow conntrack in
> unpriv userns only?