linux-kernel - Re: [PATCH] net/skbuff: silence warnings under memory pressure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1567546948.5576.68.camel@lca.pw>
Date:   Tue, 03 Sep 2019 17:42:28 -0400
From:   Qian Cai <cai@....pw>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Eric Dumazet <eric.dumazet@...il.com>, davem@...emloft.net,
        netdev@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net/skbuff: silence warnings under memory pressure

On Tue, 2019-09-03 at 20:53 +0200, Michal Hocko wrote:
> On Tue 03-09-19 11:42:22, Qian Cai wrote:
> > On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > > If there is a risk of flooding the syslog, we should fix this
> > > > generically
> > > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > > 
> > > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > > can think of a different parameters. Or maybe it is the ratelimiting
> > > which doesn't work here. Hard to tell and something to explore.
> > 
> > The time-based ratelimit won't work for skb_build() as when a system under
> > memory pressure, and the CPU is fast and IO is so slow, it could take a long
> > time to swap and trigger OOM.
> 
> I really do not understand what does OOM and swapping have to do with
> the ratelimiting here. The sole purpose of the ratelimit is to reduce
> the amount of warnings to be printed. Slow IO might have an effect on
> when the OOM killer is invoked but atomic allocations are not directly
> dependent on IO.

When there is a heavy memory pressure, the system is trying hard to reclaim
memory to fill up the watermark. However, the IO is slow to page out, but the
memory pressure keep draining atomic reservoir, and some of those skb_build()
will fail eventually.

Only if there is a fast IO, it will finish swapping sooner and then invoke the
OOM to end the memory pressure.

> 
> > I suppose what happens is those skb_build() allocations are from softirq,
> > and
> > once one of them failed, it calls printk() which generates more interrupts.
> > Hence, the infinite loop.
> 
> Please elaborate more.
> 

If you look at the original report, the failed allocation dump_stack() is,

 <IRQ>
 warn_alloc.cold.43+0x8a/0x148
 __alloc_pages_nodemask+0x1a5c/0x1bb0
 alloc_pages_current+0x9c/0x110
 allocate_slab+0x34a/0x11f0
 new_slab+0x46/0x70
 ___slab_alloc+0x604/0x950
 __slab_alloc+0x12/0x20
 kmem_cache_alloc+0x32a/0x400
 __build_skb+0x23/0x60
 build_skb+0x1a/0xb0
 igb_clean_rx_irq+0xafc/0x1010 [igb]
 igb_poll+0x4bb/0xe30 [igb]
 net_rx_action+0x244/0x7a0
 __do_softirq+0x1a0/0x60a
 irq_exit+0xb5/0xd0
 do_IRQ+0x81/0x170
 common_interrupt+0xf/0xf
 </IRQ>

Since it has no __GFP_NOWARN to begin with, it will call,

printk
  vprintk_default
    vprintk_emit
      wake_up_klogd
        irq_work_queue
          __irq_work_queue_local
            arch_irq_work_raise
              apic->send_IPI_self(IRQ_WORK_VECTOR)
                smp_irq_work_interrupt
                  exiting_irq
                    irq_exit

and end up processing pending net_rx_action softirqs again which are plenty due
to connected via ssh etc.