linux-kernel - Re: [PATCH 00/40] Memory allocation profiling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpEkV_+pAjxyEpMqY+x7buZhSpj5qDF6KubsS=ObrQKUZg@mail.gmail.com>
Date:   Thu, 4 May 2023 08:08:13 -0700
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     akpm@...ux-foundation.org, kent.overstreet@...ux.dev,
        vbabka@...e.cz, hannes@...xchg.org, roman.gushchin@...ux.dev,
        mgorman@...e.de, dave@...olabs.net, willy@...radead.org,
        liam.howlett@...cle.com, corbet@....net, void@...ifault.com,
        peterz@...radead.org, juri.lelli@...hat.com, ldufour@...ux.ibm.com,
        catalin.marinas@....com, will@...nel.org, arnd@...db.de,
        tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
        x86@...nel.org, peterx@...hat.com, david@...hat.com,
        axboe@...nel.dk, mcgrof@...nel.org, masahiroy@...nel.org,
        nathan@...nel.org, dennis@...nel.org, tj@...nel.org,
        muchun.song@...ux.dev, rppt@...nel.org, paulmck@...nel.org,
        pasha.tatashin@...een.com, yosryahmed@...gle.com,
        yuzhao@...gle.com, dhowells@...hat.com, hughd@...gle.com,
        andreyknvl@...il.com, keescook@...omium.org,
        ndesaulniers@...gle.com, gregkh@...uxfoundation.org,
        ebiggers@...gle.com, ytcoode@...il.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        bristot@...hat.com, vschneid@...hat.com, cl@...ux.com,
        penberg@...nel.org, iamjoonsoo.kim@....com, 42.hyeyoo@...il.com,
        glider@...gle.com, elver@...gle.com, dvyukov@...gle.com,
        shakeelb@...gle.com, songmuchun@...edance.com, jbaron@...mai.com,
        rientjes@...gle.com, minchan@...gle.com, kaleshsingh@...gle.com,
        kernel-team@...roid.com, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, iommu@...ts.linux.dev,
        linux-arch@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-modules@...r.kernel.org,
        kasan-dev@...glegroups.com, cgroups@...r.kernel.org
Subject: Re: [PATCH 00/40] Memory allocation profiling

On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@...e.com> wrote:
>
> On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@...e.com> wrote:
> [...]
> > Thanks for summarizing!
> >
> > > At least those I find the most important:
> > > - This is a big change and it adds a significant maintenance burden
> > >   because each allocation entry point needs to be handled specifically.
> > >   The cost will grow with the intended coverage especially there when
> > >   allocation is hidden in a library code.
> >
> > Do you mean with more allocations in the codebase more codetags will
> > be generated? Is that the concern?
>
> No. I am mostly concerned about the _maintenance_ overhead. For the
> bare tracking (without profiling and thus stack traces) only those
> allocations that are directly inlined into the consumer are really
> of any use. That increases the code impact of the tracing because any
> relevant allocation location has to go through the micro surgery.
>
> e.g. is it really interesting to know that there is a likely memory
> leak in seq_file proper doing and allocation? No as it is the specific
> implementation using seq_file that is leaking most likely. There are
> other examples like that See?

Yes, I see that. One level tracking does not provide all the
information needed to track such issues. Something more informative
would cost more. That's why our proposal is to have a light-weight
mechanism to get a high level picture and then be able to zoom into a
specific area using context capture. If you have ideas to improve
this, I'm open to suggestions.

>
> > Or maybe as you commented in
> > another patch that context capturing feature does not limit how many
> > stacks will be captured?
>
> That is a memory overhead which can be really huge and it would be nice
> to be more explicit about that in the cover letter. It is a downside for
> sure but not something that has a code maintenance impact and it is an
> opt-in so it can be enabled only when necessary.

You are right, I'll add that into the cover letter.

>
> Quite honestly, though, the more I look into context capturing part it
> seems to me that there is much more to be reconsidered there and if you
> really want to move forward with the code tagging part then you should
> drop that for now. It would make the whole series smaller and easier to
> digest.

Sure, I don't see an issue with removing that for now and refining the
mechanism before posting again.

>
> > > - It has been brought up that this is duplicating functionality already
> > >   available via existing tracing infrastructure. You should make it very
> > >   clear why that is not suitable for the job
> >
> > I experimented with using tracing with _RET_IP_ to implement this
> > accounting. The major issue is the _RET_IP_ to codetag lookup runtime
> > overhead which is orders of magnitude higher than proposed code
> > tagging approach. With code tagging proposal, that link is resolved at
> > compile time. Since we want this mechanism deployed in production, we
> > want to keep the overhead to the absolute minimum.
> > You asked me before how much overhead would be tolerable and the
> > answer will always be "as small as possible". This is especially true
> > for slab allocators which are ridiculously fast and regressing them
> > would be very noticable (due to the frequent use).
>
> It would have been more convincing if you had some numbers at hands.
> E.g. this is a typical workload we are dealing with. With the compile
> time tags we are able to learn this with that much of cost. With a dynamic
> tracing we are able to learn this much with that cost. See? As small as
> possible is a rather vague term that different people will have a very
> different idea about.

I'm rerunning my tests with the latest kernel to collect the
comparison data. I profiled these solutions before but the kernel
changed since then, so I need to update them.

>
> > There is another issue, which I think can be solved in a smart way but
> > will either affect performance or would require more memory. With the
> > tracing approach we don't know beforehand how many individual
> > allocation sites exist, so we have to allocate code tags (or similar
> > structures for counting) at runtime vs compile time. We can be smart
> > about it and allocate in batches or even preallocate more than we need
> > beforehand but, as I said, it will require some kind of compromise.
>
> I have tried our usual distribution config (only vmlinux without modules
> so the real impact will be larger as we build a lot of stuff into
> modules) just to get an idea:
>    text    data     bss     dec     hex filename
> 28755345        17040322        19845124        65640791        3e99957 vmlinux.before
> 28867168        17571838        19386372        65825378        3ec6a62 vmlinux.after
>
> Less than 1% for text 3% for data.  This is not all that terrible
> for an initial submission and a more dynamic approach could be added
> later. E.g. with a smaller pre-allocated hash table that could be
> expanded lazily. Anyway not something I would be losing sleep over. This
> can always be improved later on.

Ah, right. I should have mentioned this overhead too. Thanks for
keeping me honest.

> > I understand that code tagging creates additional maintenance burdens
> > but I hope it also produces enough benefits that people will want
> > this. The cost is also hopefully amortized when additional
> > applications like the ones we presented in RFC [1] are built using the
> > same framework.
>
> TBH I am much more concerned about the maintenance burden on the MM side
> than the actual code tagging itslef which is much more self contained. I
> haven't seen other potential applications of the same infrastructure and
> maybe the code impact would be much smaller than in the MM proper. Our
> allocator API is really hairy and convoluted.

Yes, other applications are much smaller and cleaner. MM allocation
code is quite complex indeed.

>
> > > - We already have page_owner infrastructure that provides allocation
> > >   tracking data. Why it cannot be used/extended?
> >
> > 1. The overhead.
>
> Do you have any numbers?

Will post once my tests are completed.

>
> > 2. Covers only page allocators.
>
> Yes this sucks.
> >
> > I didn't think about extending the page_owner approach to slab
> > allocators but I suspect it would not be trivial. I don't see
> > attaching an owner to every slab object to be a scalable solution. The
> > overhead would again be of concern here.
>
> This would have been a nice argument to mention in the changelog so that
> we know that you have considered that option at least. Why should I (as
> a reviewer) wild guess that?

Sorry, It's hard to remember all the decisions, discussions and
conclusions when working on a feature over a long time period. I'll
include more information about that.

>
> > I should point out that there was one important technical concern
> > about lack of a kill switch for this feature, which was an issue for
> > distributions that can't disable the CONFIG flag. In this series we
> > addressed that concern.
>
> Thanks, that is certainly appreciated. I haven't looked deeper into that
> part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING
> implies unconditional page_ext and therefore the memory overhead
> assosiated with that. There seems to be a killswitch nomem_profiling but
> from a quick look it doesn't seem to disable page_ext allocations. I
> might be missing something there of course. Having a highlevel
> describtion for that would be really nice as well.

Right, will add a description of that as well.
We eliminate the runtime overhead but not the memory one. However I
believe it's also doable using page_ext_operations.need callback. Will
look into it.
Thanks,
Suren.

>
> > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
>
> --
> Michal Hocko
> SUSE Labs