[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANaxB-zLkvXWS3Fg5Ps463iF7Cb1UVr+FwKb65VFRATqbgnW+A@mail.gmail.com>
Date: Wed, 12 Jun 2024 10:48:42 -0700
From: Andrei Vagin <avagin@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Andrii Nakryiko <andrii@...nel.org>, linux-fsdevel@...r.kernel.org, brauner@...nel.org,
viro@...iv.linux.org.uk, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, bpf@...r.kernel.org, gregkh@...uxfoundation.org,
linux-mm@...ck.org, liam.howlett@...cle.com, surenb@...gle.com,
rppt@...nel.org
Subject: Re: [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API
for /proc/<pid>/maps
On Mon, Jun 10, 2024 at 1:17 AM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> On Fri, Jun 7, 2024 at 11:31 PM Andrei Vagin <avagin@...il.com> wrote:
> >
> > On Tue, Jun 04, 2024 at 05:24:48PM -0700, Andrii Nakryiko wrote:
> > > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > > involving figuring out process memory layout, what files are backing any
> > > given memory range, etc. One important class of applications that
> > > absolutely rely on this are profilers/stack symbolizers (perf tool being one
> > > of them). Patterns of use differ, but they generally would fall into two
> > > categories.
> > >
> > > In on-demand pattern, a profiler/symbolizer would normally capture stack
> > > trace containing absolute memory addresses of some functions, and would
> > > then use /proc/<pid>/maps file to find corresponding backing ELF files
> > > (normally, only executable VMAs are of interest), file offsets within
> > > them, and then continue from there to get yet more information (ELF
> > > symbols, DWARF information) to get human-readable symbolic information.
> > > This pattern is used by Meta's fleet-wide profiler, as one example.
> > >
> > > In preprocessing pattern, application doesn't know the set of addresses
> > > of interest, so it has to fetch all relevant VMAs (again, probably only
> > > executable ones), store or cache them, then proceed with profiling and
> > > stack trace capture. Once done, it would do symbolization based on
> > > stored VMA information. This can happen at much later point in time.
> > > This patterns is used by perf tool, as an example.
> > >
> > > In either case, there are both performance and correctness requirement
> > > involved. This address to VMA information translation has to be done as
> > > efficiently as possible, but also not miss any VMA (especially in the
> > > case of loading/unloading shared libraries). In practice, correctness
> > > can't be guaranteed (due to process dying before VMA data can be
> > > captured, or shared library being unloaded, etc), but any effort to
> > > maximize the chance of finding the VMA is appreciated.
> > >
> > > Unfortunately, for all the /proc/<pid>/maps file universality and
> > > usefulness, it doesn't fit the above use cases 100%.
> > >
> > > First, it's main purpose is to emit all VMAs sequentially, but in
> > > practice captured addresses would fall only into a smaller subset of all
> > > process' VMAs, mainly containing executable text. Yet, library would
> > > need to parse most or all of the contents to find needed VMAs, as there
> > > is no way to skip VMAs that are of no use. Efficient library can do the
> > > linear pass and it is still relatively efficient, but it's definitely an
> > > overhead that can be avoided, if there was a way to do more targeted
> > > querying of the relevant VMA information.
> > >
> > > Second, it's a text based interface, which makes its programmatic use from
> > > applications and libraries more cumbersome and inefficient due to the
> > > need to handle text parsing to get necessary pieces of information. The
> > > overhead is actually payed both by kernel, formatting originally binary
> > > VMA data into text, and then by user space application, parsing it back
> > > into binary data for further use.
> >
> > I was trying to solve all these issues in a more generic way:
> > https://lwn.net/Articles/683371/
> >
>
> Can you please provide a tl;dr summary of that effort?
task_diag is a generic interface designed to efficiently gather
information about running processes. It addresses the limitations of
traditional /proc/PID/* files. This binary interface utilizes the
netlink protocol, inspired by the socket diag interface. Input is
provided as a netlink message detailing the desired information, and the
kernel responds with a set of netlink messages containing the results.
Compared to struct-based interfaces like this one or statx, the
netlink-based approach can be more flexible, particularly when
dealing with numerous optional parameters. BTW, David Ahern made
some adjustments in task_diag to optimize the same things that are
targeted here.
task_diag hasn't been merged to the kernel. I don't remember all the
arguments, it was some time ago. The primary concern was the
introduction of redundant functionality. It would have been the second
interface offering similar capabilities, without a plan to deprecate the
older interface. Furthermore, there wasn't sufficient demand to justify
the addition of a new interface at the time.
Thanks,
Andrei
Powered by blists - more mailing lists