linux-kernel - Re: [PATCH] user namespaces: bump idmap limits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171011203044.GA8207@mail.hallyn.com>
Date:   Wed, 11 Oct 2017 15:30:44 -0500
From:   "Serge E. Hallyn" <serge@...lyn.com>
To:     Christian Brauner <christian.brauner@...ntu.com>
Cc:     ebiederm@...ssion.com, serge@...lyn.com, stgraber@...ntu.com,
        linux-kernel@...r.kernel.org, tycho@...h.ws
Subject: Re: [PATCH] user namespaces: bump idmap limits

Quoting Christian Brauner (christian.brauner@...ntu.com):
> We have quite some use cases where users already run into the current limit for
> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
> 
> some-user:100000:1000000000
> some-user:999:1
> some-user:1000:1
> some-user:1001:1
> some-user:1002:1
> 
> This translates to:
> 
> MAPPING-TYPE CONTAINER HOST        RANGE
> uid           999          999         1
> uid          1001         1001         1
> uid             0      1000000       999
> uid          1000      1001000         1
> uid          1002      1001002 999998998
> 
> gid           999          999         1
> gid          1001         1001         1
> gid             0      1000000       999
> gid          1000      1001000         1
> gid          1002      1001002 999998998
> 
> which is already the current limit.
> 
> Design Notes:
> As discussed at LPC simply bumping the number of limits is not going to work
> since this would mean that struct uid_gid_map won't fit into a single cache-line
> anymore thereby regressing performance for the base-cases. The same problem
> seems to arise when using a single pointer. So the idea is to keep the base
> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
> cache-line of 64 byte. For the two removed mappings we place three pointers in
> the struct that mock the behavior of traditional filesystems:
> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
>    uid_gid_extent of 5 mappings a 60 bytes each
> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
>    to an array of 64 bytes of direct pointers (and so on)
> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
> 368 mappings which should really be enough. The idea of this approach is to
> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
> should only see a (i.e. linear) performance impact caused by iterating through
> the idmappings in a for-loop. Note that the base cases shouldn't see any
> performance degradation which is the most important part.
> 
> Performance Testing:
> When Eric introduced the extent-based struct uid_gid_map approach he measured
> the performanc impact of his idmap changes:
> 
> > My benchmark consisted of going to single user mode where nothing else was
> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
> > of the files 1000 times and calling fstat on the individuals files.  This was
> > to ensure I was benchmarking stat times where the inodes were in the kernels
> > cache, but the inode values were not in the processors cache. My results:
> 
> > v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
> 
> I used an identical approach on my laptop. Here's a thorough description of what
> I did. I built three kernels and used an additional "control" kernel:
> 
> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
> 
> I booted into single user mode (systemd rescue target in newspeak) and used an
> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
> files calling fstat() on each of them 1000 times and calculated the mean fstat()
> time for a single file. (The test program can be found below.)
> 
> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
>   0 mappings
>   1 mapping
>   2 mappings
>   3 mappings
>   5 mappings
> 
> For kernel v4.4-rc2-userns+ I tested:
>     0 mappings
>     1 mapping
>     2 mappings
>     3 mappings
>     5 mappings
>    10 mappings
>    50 mappings
>   100 mappings
>   200 mappings
>   300 mappings
> 
> Here are the results:
> 
> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
>   # no unshare:                  312 ns
>   unshare -U # write 0 mappings: 307 ns
>   unshare -U # write 1 mappings: 328 ns
>   unshare -U # write 2 mappings: 328 ns
>   unshare -U # write 3 mappings: 328 ns
>   unshare -U # write 5 mappings: 338 ns
> 
> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>   # no unshare:                     158 ns
>   unshare -U # write   0 mappings:  158 ns
>   unshare -U # write   1 mappings:  164 ns
>   unshare -U # write   2 mappings:  170 ns
>   unshare -U # write   3 mappings:  175 ns
>   unshare -U # write   5 mappings:  187 ns
>   unshare -U # write  10 mappings:  218 ns
>   unshare -U # write  50 mappings:  528 ns
>   unshare -U # write 100 mappings:  980 ns
>   unshare -U # write 200 mappings: 1880 ns
>   unshare -U # write 300 mappings: 2760 ns

These numbers look very good.

Eric, does that address your performance concern?

(If so I'll take a closer look at the patch itself, I've not yet done so)

thanks,
-serge