lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 5 Jul 2020 13:34:59 -0700
From:   Vito Caputo <vcaputo@...garu.com>
To:     Greg KH <gregkh@...uxfoundation.org>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Jan Ziak <0xe2.0x9a.0x9b@...il.com>, linux-api@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-kselftest@...r.kernel.org, linux-man@...r.kernel.org,
        mtk.manpages@...il.com, shuah@...nel.org, viro@...iv.linux.org.uk
Subject: Re: [PATCH 0/3] readfile(2): a new syscall to make open/read/close
 faster

On Sun, Jul 05, 2020 at 01:44:54PM +0200, Greg KH wrote:
> On Sun, Jul 05, 2020 at 01:07:14AM -0700, Vito Caputo wrote:
> > On Sun, Jul 05, 2020 at 04:27:32AM +0100, Matthew Wilcox wrote:
> > > On Sun, Jul 05, 2020 at 05:18:58AM +0200, Jan Ziak wrote:
> > > > On Sun, Jul 5, 2020 at 5:12 AM Matthew Wilcox <willy@...radead.org> wrote:
> > > > >
> > > > > You should probably take a look at io_uring.  That has the level of
> > > > > complexity of this proposal and supports open/read/close along with many
> > > > > other opcodes.
> > > > 
> > > > Then glibc can implement readfile using io_uring and there is no need
> > > > for a new single-file readfile syscall.
> > > 
> > > It could, sure.  But there's also a value in having a simple interface
> > > to accomplish a simple task.  Your proposed API added a very complex
> > > interface to satisfy needs that clearly aren't part of the problem space
> > > that Greg is looking to address.
> > 
> > I disagree re: "aren't part of the problem space".
> > 
> > Reading small files from procfs was specifically called out in the
> > rationale for the syscall.
> > 
> > In my experience you're rarely monitoring a single proc file in any
> > situation where you care about the syscall overhead.  You're
> > monitoring many of them, and any serious effort to do this efficiently
> > in a repeatedly sampled situation has cached the open fds and already
> > uses pread() to simply restart from 0 on every sample and not
> > repeatedly pay for the name lookup.
> 
> That's your use case, but many other use cases are just "read a bunch of
> sysfs files in one shot".  Examples of that are tools that monitor
> uevents and lots of hardware-information gathering tools.
> 
> Also not all tools sem to be as smart as you think they are, look at
> util-linux for loads of the "open/read/close" lots of files pattern.  I
> had a half-baked patch to convert it to use readfile which I need to
> polish off and post with the next series to show how this can be used to
> both make userspace simpler as well as use less cpu time.
> 
> > Basically anything optimally using the existing interfaces for
> > sampling proc files needs a way to read multiple open file descriptors
> > in a single syscall to move the needle.
> 
> Is psutils using this type of interface, or do they constantly open
> different files?
> 

When I last checked, psutils was not an optimal example, nor did I
suggest it was.

> What about fun tools like bashtop:
> 	https://github.com/aristocratos/bashtop.git
> which thankfully now relies on python's psutil package to parse proc in
> semi-sane ways, but that package does loads of constant open/read/close
> of proc files all the time from what I can tell.
> 
> And lots of people rely on python's psutil, right?

If python's psutil is constantly reopening the same files in /proc,
this is an argument to go improve python's psutil, especially if it's
popular.

Your proposed syscall doesn't magically make everything suboptimally
sampling proc more efficient.  It still requires going out and
modifying everything to use the new syscall.

In order to actually realize a gain comparable to what can be done
using existing interfaces, but with your new syscall, if the code
wasn't already reusing the open fd it still requires a refactor to do
so with your syscall, to eliminate the directory lookup on every
sample.

At the end of the day, if you did all this work, you'd have code that
only works on kernels with the new syscall, didn't enjoy a significant
performance gain over what could have been achieved using the existing
interfaces, and still required basically the same amount of work as
optimizing for the existing interfaces would have.  For what gain?

> 
> > This syscall doesn't provide that.  It doesn't really give any
> > advantage over what we can achieve already.  It seems basically
> > pointless to me, from a monitoring proc files perspective.
> 
> What "good" monitoring programs do you suggest follow the pattern you
> recommend?
> 

"Good" is not generally a word I'd use to describe software, surely
that's not me you're quoting... but I assume you mean "optimal".

I'm sure sysprof is at least reusing open files when sampling proc,
because we discussed the issue when Christian took over maintenance.

It appears he's currently using the lseek()->read() sequence:

https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-netdev-source.c#L223
https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-memory-source.c#L210
https://gitlab.gnome.org/GNOME/sysprof/-/blob/master/src/libsysprof/sysprof-diskstat-source.c#L185

It'd be more efficient to just use pread() and lose the lseek(), at
which point it'd be just a single pread() call per sample per proc
file.  Nothing your proposed syscall would improve upon, not that it'd
be eligible for software that wants to work on existing kernels from
distros like Debian and Centos/RHEL anyways.

If this were a conversation about providing something like a better
scatter-gather interface akin to p{read,write}v but with the fd in the
iovec, then we'd be talking about something very lucrative for proc
sampling. But like you've said elsewhere in this thread, io_uring()
may suffice as an alternative solution in that vein.

My personal interest in this topic stems from an experimental window
manager I made, and still use, which monitors every descendant process
for the X session at frequencies up to 60HZ.  The code opens a bunch
of proc files for every process, and keeps them open until the process
goes away or falls out of scope.  See the attachment for some idea of
what /proc/$(pidof wm)/fd looks like.  All those proc files are read
at up to 60HZ continuously.

All top-like tools are really no different, and already shouldn't be
reopening things on every sample.  They should be fixed if not - with
or without your syscall, it's equal effort, but the existing
interfaces... exist.

Regards,
Vito Caputo

View attachment "vwm-fds.txt" of type "text/plain" (18704 bytes)

Powered by blists - more mailing lists