linux-kernel - Re: [PATCH] mm/page_owner: print largest memory consumer when OOM panic occurs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1577169946.4959.4.camel@mtkswgap22>
Date:   Tue, 24 Dec 2019 14:45:46 +0800
From:   Miles Chen <miles.chen@...iatek.com>
To:     Qian Cai <cai@....pw>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>, <linux-kernel@...r.kernel.org>,
        <linux-mm@...ck.org>, <linux-mediatek@...ts.infradead.org>,
        <wsd_upstream@...iatek.com>
Subject: Re: [PATCH] mm/page_owner: print largest memory consumer when OOM
 panic occurs

On Mon, 2019-12-23 at 07:32 -0500, Qian Cai wrote:
> 
> > On Dec 23, 2019, at 6:33 AM, Miles Chen <miles.chen@...iatek.com> wrote:
> > 
> > Motivation:
> > -----------
> > 
> > When debug with a OOM kernel panic, it is difficult to know the
> > memory allocated by kernel drivers of vmalloc() by checking the
> > Mem-Info or Node/Zone info. For example:
> > 
> >  Mem-Info:
> >  active_anon:5144 inactive_anon:16120 isolated_anon:0
> >   active_file:0 inactive_file:0 isolated_file:0
> >   unevictable:0 dirty:0 writeback:0 unstable:0
> >   slab_reclaimable:739 slab_unreclaimable:442469
> >   mapped:534 shmem:21050 pagetables:21 bounce:0
> >   free:14808 free_pcp:3389 free_cma:8128
> > 
> >  Node 0 active_anon:20576kB inactive_anon:64480kB active_file:0kB
> >  inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> >  mapped:2136kB dirty:0kB writeback:0kB shmem:84200kB shmem_thp: 0kB
> >  shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB
> >  all_unr eclaimable? yes
> > 
> >  Node 0 DMA free:14476kB min:21512kB low:26888kB high:32264kB
> >  reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> >  active_file: 0kB inactive_file:0kB unevictable:0kB writepending:0kB
> >  present:1048576kB managed:952736kB mlocked:0kB kernel_stack:0kB
> >  pagetables:0kB bounce:0kB free_pcp:2716kB local_pcp:0kB free_cma:0kB
> > 
> > The information above tells us the memory usage of the known memory
> > categories and we can check the abnormal large numbers. However, if a
> > memory leakage cannot be observed in the categories above, we need to
> > reproduce the issue with CONFIG_PAGE_OWNER.
> > 
> > It is possible to read the page owner information from coredump files.
> > However, coredump files may not always be available, so my approach is
> > to print out the largest page consumer when OOM kernel panic occurs.
> 
> Many of those patches helping debugging special cases had been shot down in the past. I don’t see much difference this time. If you worry about memory leak, enable kmemleak and then to reproduce. Otherwise, we will end up with too many heuristics just for debugging.
> 

Thanks for your comment.

We use kmemleak too, but a memory leakage which is caused by
alloc_pages() in a kernel device driver cannot be caught by kmemleak.
We have fought against this kind of real problems for a few years and 
find a way to make the debugging easier.

We currently have information during OOM: process Node, zone, swap, 
process (pid, rss, name), slab usage, and the backtrace, order, and
gfp flags of the OOM backtrace. 
We can tell many different types of OOM problems by the information
above except the alloc_pages() leakage.

The patch does work and save a lot of debugging time.
Could we consider the "greatest memory consumer" as another useful 
OOM information?


Miles
> > 
> > The heuristic approach assumes that the OOM kernel panic is caused by
> > a single backtrace. The assumption is not always true but it works in
> > many cases during our test.
> > 
> > We have tested this heuristic approach since 2019/5 on android devices.
> > In 38 internal OOM kernel panic reports:
> > 
> > 31/38: can be analyzed by using existing information
> > 7/38: need page owner formatino and the heuristic approach in this patch
> > prints the correct backtraces of abnormal memory allocations. No need to
> > reproduce the issues.