linux-kernel - Re: [PATCH 3/3] autofs - fix AT_NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87a7z5yjbs.fsf@notabene.neil.brown.name>
Date:   Wed, 29 Nov 2017 12:17:27 +1100
From:   NeilBrown <neilb@...e.com>
To:     Mike Marion <mmarion@...lcomm.com>, Ian Kent <raven@...maw.net>
Cc:     autofs mailing list <autofs@...r.kernel.org>,
        Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH 3/3] autofs - fix AT_NO_AUTOMOUNT not being honored

On Tue, Nov 28 2017, Mike Marion wrote:

> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote:
>
>> I think the situation is going to get worse before it gets better.
>> 
>> On recent Fedora and kernel, with a large map and heavy mount activity
>> I see:
>> 
>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash,
>> gnome-settings-daemon, packagekitd and gnome-shell
>> 
>> all go crazy consuming large amounts of CPU.
>
> Yep.  I'm not even worried about the CPU usage as much (yet, I'm sure 
> it'll be more of a problem as time goes on).  We have pretty huge
> direct maps and our initial startup tests on a new host with the link vs
> file took >6 hours.  That's not a typo.  We worked with Suse engineering 
> to come up with a fix, which should've been pushed here some time ago.
>
> Then, there's shutdowns (and reboots). They also took a long time (on
> the order of 20+min) because it would walk the entire /proc/mounts
> "unmounting" things.  Also fixed now.  That one had something to do in
> SMP code as if you used a single CPU/core, it didn't take long at all.
>
> Just got a fix for the suse grub2-mkconfig script to fix their parsing 
> looking for the root dev to skip over fstype autofs
> (probe_nfsroot_device function).
>
>> The symlink change was probably the start, now a number of applications
>> now got directly to the proc file system for this information.
>> 
>> For large mount tables and many processes accessing the mount table
>> (probably reading the whole thing, either periodically or on change
>> notification) the current system does not scale well at all.
>
> We use Clearcase in some instances as well, and that's yet another thing
> adding mounts, and its startup is very slow, due to the size of
> /proc/mounts.  
>
> It's definitely something that's more than just autofs and probably
> going to get worse, as you say.

If we assume that applications are going to want to read
/proc/self/mount* a log, we probably need to make it faster.
I performed a simple experiment where I mounted 1000 tmpfs filesystems,
copied /proc/self/mountinfo to /tmp/mountinfo, then
ran 4 for loops in parallel catting one of these files to /dev/null 1000 times.
On a single CPU VM:
  For /tmp/mountinfo, each group of 1000 cats took about 3 seconds.
  For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds.
On a 4 CPU VM
  /tmp/mountinfo: 1.5secs
  /proc/self/mountinfo: 3.5 secs

Using "perf record" it appears that most of the cost is repeated calls
to prepend_path, with a small contribution from the fact that each read
only returns 4K rather than the 128K that cat asks for.

If we could hang a cache off struct mnt_namespace and use it instead of
iterating the mount table - using rcu and ns->event to ensure currency -
we should be able to minimize the cost of this increased use of
/proc/self/mount*.

I suspect that the best approach would be implement a cache at the
seq_file level.

One possible problem might be if applications assume that a read will
always return a whole number of lines (it currently does).  To be
sure we remain safe, we would only be able to use the cache for
a read() syscall which reads the whole file.
How big do people see /proc/self/mount* getting?  What size reads
does 'strace' show the various programs using to read it?

Thanks,
NeilBrown

Download attachment "signature.asc" of type "application/pgp-signature" (833 bytes)