linux-kernel - Re: [PATCH] user namespaces: bump idmap limits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 4 Oct 2017 16:45:25 +0200
From:   Christian Brauner <christian.brauner@...onical.com>
To:     "Serge E. Hallyn" <serge@...lyn.com>
Cc:     Christian Brauner <christian.brauner@...ntu.com>,
        ebiederm@...ssion.com, stgraber@...ntu.com,
        linux-kernel@...r.kernel.org, tycho@...ho.ws
Subject: Re: [PATCH] user namespaces: bump idmap limits

On Wed, Oct 04, 2017 at 09:28:57AM -0500, Serge Hallyn wrote:
> Quoting Christian Brauner (christian.brauner@...ntu.com):
> > We have quite some use cases where users already run into the current limit for
> > {g,u}id mappings. Consider a user requesting us to map everything but 999, and
> > 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
> > 
> > some-user:100000:1000000000
> > some-user:999:1
> > some-user:1000:1
> > some-user:1001:1
> > some-user:1002:1
> > 
> > This translates to:
> > 
> > MAPPING-TYPE CONTAINER HOST        RANGE
> > uid           999          999         1
> > uid          1001         1001         1
> > uid             0      1000000       999
> > uid          1000      1001000         1
> > uid          1002      1001002 999998998
> > 
> > gid           999          999         1
> > gid          1001         1001         1
> > gid             0      1000000       999
> > gid          1000      1001000         1
> > gid          1002      1001002 999998998
> > 
> > which is already the current limit.
> > 
> > Design Notes:
> > As discussed at LPC simply bumping the number of limits is not going to work
> > since this would mean that struct uid_gid_map won't fit into a single cache-line
> > anymore thereby regressing performance for the base-cases. The same problem
> > seems to arise when using a single pointer. So the idea is to keep the base
> > cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
> > cache-line of 64 byte. For the two removed mappings we place three pointers in
> > the struct that mock the behavior of traditional filesystems:
> > 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
> > 2. an indirect pointer to an array of 64 byte of direct pointers to struct
> >    uid_gid_extent of 5 mappings a 60 bytes each
> > 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
> >    to an array of 64 bytes of direct pointers (and so on)
> > Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
> > 368 mappings which should really be enough. The idea of this approach is to
> > always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
> > 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
> > should only see a (i.e. linear) performance impact caused by iterating through
> > the idmappings in a for-loop. Note that the base cases shouldn't see any
> > performance degradation which is the most important part.
> 
> Sounds like a good plan.
> 
> > Performance Testing:
> > When Eric introduced the extent-based struct uid_gid_map approach he measured
> > the performanc impact of his idmap changes:
> > 
> > > My benchmark consisted of going to single user mode where nothing else was
> > > running. On an ext4 filesystem opening 1,000,000 files and looping through all
> > > of the files 1000 times and calling fstat on the individuals files.  This was
> > > to ensure I was benchmarking stat times where the inodes were in the kernels
> > > cache, but the inode values were not in the processors cache. My results:
> > 
> > > v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> > > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> > > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
> > 
> > I used an identical approach on my laptop. Here's a thorough description of what
> > I did. I built three kernels and used an additional "control" kernel:
> > 
> > 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
> > 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> > 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
> 
> ^ you mean *withYou your patch but with CONFIG_USER_NS=n ?

Yes, exactly. Sorry, that was unclear here.

> 
> > 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
> 
> ^ Just curious, why did you include this?  To show that other factors have a much
> larger impact?  This does not include your patch, right?

Basically I wanted something which I didn't compile and see if the numbers
somehow line-up. In terms of experimentation you could think of this as a second
"control condition".

> 
> > 
> > I booted into single user mode (systemd rescue target in newspeak) and used an
> > ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
> > files calling fstat() on each of them 1000 times and calculated the mean fstat()
> > time for a single file. (The test program can be found below.)
> > 
> > For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
> >   0 mappings
> >   1 mapping
> >   2 mappings
> >   3 mappings
> >   5 mappings
> > 
> > For kernel v4.4-rc2-userns+ I tested:
> >     0 mappings
> >     1 mapping
> >     2 mappings
> >     3 mappings
> >     5 mappings
> >    10 mappings
> >    50 mappings
> >   100 mappings
> >   200 mappings
> >   300 mappings
> > 
> > Here are the results:
> > 
> > - v4.14-rc2-vanilla (unmodified v4.14-rc2)
> >   # no unshare:                  312 ns
> >   unshare -U # write 0 mappings: 307 ns
> >   unshare -U # write 1 mappings: 328 ns
> >   unshare -U # write 2 mappings: 328 ns
> >   unshare -U # write 3 mappings: 328 ns
> >   unshare -U # write 5 mappings: 338 ns
> > 
> > - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> >   # no unshare:                     158 ns
> >   unshare -U # write   0 mappings:  158 ns
> >   unshare -U # write   1 mappings:  164 ns
> >   unshare -U # write   2 mappings:  170 ns
> >   unshare -U # write   3 mappings:  175 ns
> >   unshare -U # write   5 mappings:  187 ns
> >   unshare -U # write  10 mappings:  218 ns
> >   unshare -U # write  50 mappings:  528 ns
> >   unshare -U # write 100 mappings:  980 ns
> >   unshare -U # write 200 mappings: 1880 ns
> >   unshare -U # write 300 mappings: 2760 ns
> > 
> > - v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> >   # no unshare: 161 ns
> > 
> > - 4.12.0-12-generic Ubuntu Kernel:
> >   # no unshare:                  328 ns
> >   unshare -U # write 0 mappings: 327 ns
> >   unshare -U # write 1 mappings: 328 ns
> >   unshare -U # write 2 mappings: 328 ns
> >   unshare -U # write 3 mappings: 328 ns
> >   unshare -U # write 5 mappings: 338 ns
> > 
> 
> ^ This is really weird.  Why does Ubuntu kernel have near-constant (horrible)
> time?

I actually think - even in single user mode - with the same number of processes
running and so on - that there's a lot of fluctuation going on. That's why I ran
the tests multiple times. It might also depend on compilation since I compiled
the three kernels myself and just downloaded the binaries for the ubuntu kernel.
The tests clearly show that there's an increase with the number of mappings
which is what I expected.

> 
> > I've tested this multiple times and the numbers hold up. All v4.14-rc2 kernels
> > were built on the same machine with the same .config, the same options and a
> > simple call to make -j 11 bindeb-pkg. The 4.12 kernel was simply installed from
> > the Ubuntu archives.
> > 
> > The most import part seems to me that my idmap patches don't regress performance
> > for the base-cases. I'd actually only consider 0 and 1 mapping to be the proper
> 
> Agreed.  Now personally I probably would have kept 4 direct pointers then make
> the 5+ case hurt more, but I'm not saying that's the right thing.

Yeah, I thought about that as well but my goal was to basically ramp up the
number of mappings into the hundreds to settle this "once and for all". I
actually don't expect us to go any higher than this. Tbh, users that have a
requirement to have many mappings should be prepared to take the performance
hit. Also, I think that the direct pointers won't necessarily give you more
speed since - I'd guess - that the slowdown simply comes from the number of
iterations through the map you have to do and not necessarily from cache misses.
But I might be thinking nonsense here. Thanks!

> 
> (haven't looked at the patch itself yet)
> 
> thanks,
> -serge