[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpEkV_+pAjxyEpMqY+x7buZhSpj5qDF6KubsS=ObrQKUZg@mail.gmail.com>
Date: Thu, 4 May 2023 08:08:13 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: Michal Hocko <mhocko@...e.com>
Cc: akpm@...ux-foundation.org, kent.overstreet@...ux.dev,
vbabka@...e.cz, hannes@...xchg.org, roman.gushchin@...ux.dev,
mgorman@...e.de, dave@...olabs.net, willy@...radead.org,
liam.howlett@...cle.com, corbet@....net, void@...ifault.com,
peterz@...radead.org, juri.lelli@...hat.com, ldufour@...ux.ibm.com,
catalin.marinas@....com, will@...nel.org, arnd@...db.de,
tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
x86@...nel.org, peterx@...hat.com, david@...hat.com,
axboe@...nel.dk, mcgrof@...nel.org, masahiroy@...nel.org,
nathan@...nel.org, dennis@...nel.org, tj@...nel.org,
muchun.song@...ux.dev, rppt@...nel.org, paulmck@...nel.org,
pasha.tatashin@...een.com, yosryahmed@...gle.com,
yuzhao@...gle.com, dhowells@...hat.com, hughd@...gle.com,
andreyknvl@...il.com, keescook@...omium.org,
ndesaulniers@...gle.com, gregkh@...uxfoundation.org,
ebiggers@...gle.com, ytcoode@...il.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
bristot@...hat.com, vschneid@...hat.com, cl@...ux.com,
penberg@...nel.org, iamjoonsoo.kim@....com, 42.hyeyoo@...il.com,
glider@...gle.com, elver@...gle.com, dvyukov@...gle.com,
shakeelb@...gle.com, songmuchun@...edance.com, jbaron@...mai.com,
rientjes@...gle.com, minchan@...gle.com, kaleshsingh@...gle.com,
kernel-team@...roid.com, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
linux-arch@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, linux-modules@...r.kernel.org,
kasan-dev@...glegroups.com, cgroups@...r.kernel.org
Subject: Re: [PATCH 00/40] Memory allocation profiling
On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@...e.com> wrote:
>
> On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@...e.com> wrote:
> [...]
> > Thanks for summarizing!
> >
> > > At least those I find the most important:
> > > - This is a big change and it adds a significant maintenance burden
> > > because each allocation entry point needs to be handled specifically.
> > > The cost will grow with the intended coverage especially there when
> > > allocation is hidden in a library code.
> >
> > Do you mean with more allocations in the codebase more codetags will
> > be generated? Is that the concern?
>
> No. I am mostly concerned about the _maintenance_ overhead. For the
> bare tracking (without profiling and thus stack traces) only those
> allocations that are directly inlined into the consumer are really
> of any use. That increases the code impact of the tracing because any
> relevant allocation location has to go through the micro surgery.
>
> e.g. is it really interesting to know that there is a likely memory
> leak in seq_file proper doing and allocation? No as it is the specific
> implementation using seq_file that is leaking most likely. There are
> other examples like that See?
Yes, I see that. One level tracking does not provide all the
information needed to track such issues. Something more informative
would cost more. That's why our proposal is to have a light-weight
mechanism to get a high level picture and then be able to zoom into a
specific area using context capture. If you have ideas to improve
this, I'm open to suggestions.
>
> > Or maybe as you commented in
> > another patch that context capturing feature does not limit how many
> > stacks will be captured?
>
> That is a memory overhead which can be really huge and it would be nice
> to be more explicit about that in the cover letter. It is a downside for
> sure but not something that has a code maintenance impact and it is an
> opt-in so it can be enabled only when necessary.
You are right, I'll add that into the cover letter.
>
> Quite honestly, though, the more I look into context capturing part it
> seems to me that there is much more to be reconsidered there and if you
> really want to move forward with the code tagging part then you should
> drop that for now. It would make the whole series smaller and easier to
> digest.
Sure, I don't see an issue with removing that for now and refining the
mechanism before posting again.
>
> > > - It has been brought up that this is duplicating functionality already
> > > available via existing tracing infrastructure. You should make it very
> > > clear why that is not suitable for the job
> >
> > I experimented with using tracing with _RET_IP_ to implement this
> > accounting. The major issue is the _RET_IP_ to codetag lookup runtime
> > overhead which is orders of magnitude higher than proposed code
> > tagging approach. With code tagging proposal, that link is resolved at
> > compile time. Since we want this mechanism deployed in production, we
> > want to keep the overhead to the absolute minimum.
> > You asked me before how much overhead would be tolerable and the
> > answer will always be "as small as possible". This is especially true
> > for slab allocators which are ridiculously fast and regressing them
> > would be very noticable (due to the frequent use).
>
> It would have been more convincing if you had some numbers at hands.
> E.g. this is a typical workload we are dealing with. With the compile
> time tags we are able to learn this with that much of cost. With a dynamic
> tracing we are able to learn this much with that cost. See? As small as
> possible is a rather vague term that different people will have a very
> different idea about.
I'm rerunning my tests with the latest kernel to collect the
comparison data. I profiled these solutions before but the kernel
changed since then, so I need to update them.
>
> > There is another issue, which I think can be solved in a smart way but
> > will either affect performance or would require more memory. With the
> > tracing approach we don't know beforehand how many individual
> > allocation sites exist, so we have to allocate code tags (or similar
> > structures for counting) at runtime vs compile time. We can be smart
> > about it and allocate in batches or even preallocate more than we need
> > beforehand but, as I said, it will require some kind of compromise.
>
> I have tried our usual distribution config (only vmlinux without modules
> so the real impact will be larger as we build a lot of stuff into
> modules) just to get an idea:
> text data bss dec hex filename
> 28755345 17040322 19845124 65640791 3e99957 vmlinux.before
> 28867168 17571838 19386372 65825378 3ec6a62 vmlinux.after
>
> Less than 1% for text 3% for data. This is not all that terrible
> for an initial submission and a more dynamic approach could be added
> later. E.g. with a smaller pre-allocated hash table that could be
> expanded lazily. Anyway not something I would be losing sleep over. This
> can always be improved later on.
Ah, right. I should have mentioned this overhead too. Thanks for
keeping me honest.
> > I understand that code tagging creates additional maintenance burdens
> > but I hope it also produces enough benefits that people will want
> > this. The cost is also hopefully amortized when additional
> > applications like the ones we presented in RFC [1] are built using the
> > same framework.
>
> TBH I am much more concerned about the maintenance burden on the MM side
> than the actual code tagging itslef which is much more self contained. I
> haven't seen other potential applications of the same infrastructure and
> maybe the code impact would be much smaller than in the MM proper. Our
> allocator API is really hairy and convoluted.
Yes, other applications are much smaller and cleaner. MM allocation
code is quite complex indeed.
>
> > > - We already have page_owner infrastructure that provides allocation
> > > tracking data. Why it cannot be used/extended?
> >
> > 1. The overhead.
>
> Do you have any numbers?
Will post once my tests are completed.
>
> > 2. Covers only page allocators.
>
> Yes this sucks.
> >
> > I didn't think about extending the page_owner approach to slab
> > allocators but I suspect it would not be trivial. I don't see
> > attaching an owner to every slab object to be a scalable solution. The
> > overhead would again be of concern here.
>
> This would have been a nice argument to mention in the changelog so that
> we know that you have considered that option at least. Why should I (as
> a reviewer) wild guess that?
Sorry, It's hard to remember all the decisions, discussions and
conclusions when working on a feature over a long time period. I'll
include more information about that.
>
> > I should point out that there was one important technical concern
> > about lack of a kill switch for this feature, which was an issue for
> > distributions that can't disable the CONFIG flag. In this series we
> > addressed that concern.
>
> Thanks, that is certainly appreciated. I haven't looked deeper into that
> part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING
> implies unconditional page_ext and therefore the memory overhead
> assosiated with that. There seems to be a killswitch nomem_profiling but
> from a quick look it doesn't seem to disable page_ext allocations. I
> might be missing something there of course. Having a highlevel
> describtion for that would be really nice as well.
Right, will add a description of that as well.
We eliminate the runtime overhead but not the memory one. However I
believe it's also doable using page_ext_operations.need callback. Will
look into it.
Thanks,
Suren.
>
> > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
>
> --
> Michal Hocko
> SUSE Labs
Powered by blists - more mailing lists