[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87lgkgz43i.fsf@xmission.com>
Date: Thu, 12 Oct 2017 12:08:17 -0500
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Christian Brauner <christian.brauner@...onical.com>
Cc: "Serge E. Hallyn" <serge@...lyn.com>,
Christian Brauner <christian.brauner@...ntu.com>,
Stéphane Graber <stgraber@...ntu.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
tycho@...h.ws
Subject: Re: [PATCH] user namespaces: bump idmap limits
Christian Brauner <christian.brauner@...onical.com> writes:
> On Wed, Oct 11, 2017 at 03:43:20PM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serge@...lyn.com> writes:
>>
>> > Quoting Christian Brauner (christian.brauner@...ntu.com):
>> >> We have quite some use cases where users already run into the current limit for
>> >> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
>> >> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
>> >>
>> >> some-user:100000:1000000000
>> >> some-user:999:1
>> >> some-user:1000:1
>> >> some-user:1001:1
>> >> some-user:1002:1
>> >>
>> >> This translates to:
>> >>
>> >> MAPPING-TYPE CONTAINER HOST RANGE
>> >> uid 999 999 1
>> >> uid 1001 1001 1
>> >> uid 0 1000000 999
>> >> uid 1000 1001000 1
>> >> uid 1002 1001002 999998998
>> >>
>> >> gid 999 999 1
>> >> gid 1001 1001 1
>> >> gid 0 1000000 999
>> >> gid 1000 1001000 1
>> >> gid 1002 1001002 999998998
>> >>
>> >> which is already the current limit.
>> >>
>> >> Design Notes:
>> >> As discussed at LPC simply bumping the number of limits is not going to work
>> >> since this would mean that struct uid_gid_map won't fit into a single cache-line
>> >> anymore thereby regressing performance for the base-cases. The same problem
>> >> seems to arise when using a single pointer. So the idea is to keep the base
>> >> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
>> >> cache-line of 64 byte. For the two removed mappings we place three pointers in
>> >> the struct that mock the behavior of traditional filesystems:
>> >> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
>> >> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
>> >> uid_gid_extent of 5 mappings a 60 bytes each
>> >> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
>> >> to an array of 64 bytes of direct pointers (and so on)
>> >> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
>> >> 368 mappings which should really be enough. The idea of this approach is to
>> >> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
>> >> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
>> >> should only see a (i.e. linear) performance impact caused by iterating through
>> >> the idmappings in a for-loop. Note that the base cases shouldn't see any
>> >> performance degradation which is the most important part.
>> >>
>> >> Performance Testing:
>> >> When Eric introduced the extent-based struct uid_gid_map approach he measured
>> >> the performanc impact of his idmap changes:
>> >>
>> >> > My benchmark consisted of going to single user mode where nothing else was
>> >> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
>> >> > of the files 1000 times and calling fstat on the individuals files. This was
>> >> > to ensure I was benchmarking stat times where the inodes were in the kernels
>> >> > cache, but the inode values were not in the processors cache. My results:
>> >>
>> >> > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
>> >> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
>> >> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
>> >>
>> >> I used an identical approach on my laptop. Here's a thorough description of what
>> >> I did. I built three kernels and used an additional "control" kernel:
>> >>
>> >> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
>> >> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>> >> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
>> >> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
>> >>
>> >> I booted into single user mode (systemd rescue target in newspeak) and used an
>> >> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
>> >> files calling fstat() on each of them 1000 times and calculated the mean fstat()
>> >> time for a single file. (The test program can be found below.)
>> >>
>> >> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
>> >> 0 mappings
>> >> 1 mapping
>> >> 2 mappings
>> >> 3 mappings
>> >> 5 mappings
>> >>
>> >> For kernel v4.4-rc2-userns+ I tested:
>> >> 0 mappings
>> >> 1 mapping
>> >> 2 mappings
>> >> 3 mappings
>> >> 5 mappings
>> >> 10 mappings
>> >> 50 mappings
>> >> 100 mappings
>> >> 200 mappings
>> >> 300 mappings
>> >>
>> >> Here are the results:
>> >>
>> >> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
>> >> # no unshare: 312 ns
>> >> unshare -U # write 0 mappings: 307 ns
>> >> unshare -U # write 1 mappings: 328 ns
>> >> unshare -U # write 2 mappings: 328 ns
>> >> unshare -U # write 3 mappings: 328 ns
>> >> unshare -U # write 5 mappings: 338 ns
>> >>
>> >> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>> >> # no unshare: 158 ns
>> >> unshare -U # write 0 mappings: 158 ns
>> >> unshare -U # write 1 mappings: 164 ns
>> >> unshare -U # write 2 mappings: 170 ns
>> >> unshare -U # write 3 mappings: 175 ns
>> >> unshare -U # write 5 mappings: 187 ns
>> >> unshare -U # write 10 mappings: 218 ns
>> >> unshare -U # write 50 mappings: 528 ns
>> >> unshare -U # write 100 mappings: 980 ns
>> >> unshare -U # write 200 mappings: 1880 ns
>> >> unshare -U # write 300 mappings: 2760 ns
>> >
>> > These numbers look very good.
>> >
>> > Eric, does that address your performance concern?
>> >
>> > (If so I'll take a closer look at the patch itself, I've not yet done so)
>>
>> Apologies it has been a busy couple of days. I really meant to write my
>> reply a while ago.
>>
>> With the current strategy I have no concerns about the new
>> implementation regressing performance in existing use cases.
>>
>> I meant to mention that one way I got out variability when I was timing
>> things was to turn off cpu power management. I don't know if that works
>> on the newest cpus that can occassionally speed up.
>>
>> I am concerned about an implementation that scales linearly with the
>> number of mappings in the large. When it is know we should be able
>> to scale at log(N) of the number of mappings.
>
> Oh, I was under the impression that what you cared about was slowing down the
> base cases. Thanks!
I care very much about the base case because if that slows down we have
a performance regression that must be fixed. At the same time it
doesn't makes sense to extend the data structure and with something that
has poor performance when used heavily. We will just have to fix that later.
>> So perhaps we want the implementation to look something like:
>> struct uid_gid_extent {
>> u32 first;
>> u32 lower_first;
>> u32 count;
>> };
>> struct uid_gid_map {
>> u32 nr_extents;
>> union {
>> extent[UID_GID_MAP_MAX_INLINE_EXTENTS];
>> struct {
>> struct uid_gid_extent *forward;
>> struct uid_gid_extent *reverse;
>> };
>> };
>> };
>
>>
>> So perhaps we want the implementation to be a union with nr_extents
>> being the constant member and having two pointers one for a forward
>> and one for the reverse mappings.
>
> That looks reasonable. I'll code that up soon.
> Just to clarify, what do you mean by forward mappings and reverse mappings? Are
> you just saying that struct uid_gid_extent *forward points to the beginning of
> the sorted 4k kmalloc()ed extent array and struct uid_gid_extent *reverse points
> to the end of the sorted 4k kmalloc()ed extent array for the binary search? Not
> that I'm missing some intricacy here that might be crucial.
No. Intricacies missed. The sort for forward and reverse is different
so we need two different arrays. One sorted on first the other sorted
on lower_first. If the data was larger we could put in a pointer to
their common elements but as only a u32 is different a pointer to a
common element would be a waste of space.
Eric
Powered by blists - more mailing lists