linux-kernel - Re: [PATCH] printk: Add best-effort printk() buffering.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20170515131517.GB3383@pathway.suse.cz>
Date:   Mon, 15 May 2017 15:15:17 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc:     sergey.senozhatsky.work@...il.com, sergey.senozhatsky@...il.com,
        linux-kernel@...r.kernel.org, mingo@...hat.com,
        peterz@...radead.org, rostedt@...dmis.org
Subject: Re: [PATCH] printk: Add best-effort printk() buffering.

On Wed 2017-05-10 20:27:32, Tetsuo Handa wrote:
> Sergey Senozhatsky wrote:
> > On (05/09/17 20:41), Tetsuo Handa wrote:
> > [..]
> > > > what I meant was -- "can we sleep under printk_buffered_begin() or not".
> > > > printk-safe disables local IRQs. so what I propose is something like this
> > > > 
> > > > 	printk-safe-enter    //disable local IRQs, use per-CPU buffer
> > > > 	backtrace
> > > > 	printk-safe-exit     //flush per-CPU buffer, enable local IRQs
> > > > 
> > > > except that 'printk-safe-enter/exit' will have new names here, say
> > > > printk-buffered-begin/end, and, probably, handle flush differently.
> > > 
> > > OK. Then, answer is that we are allowed to sleep after get_printk_buffer()
> > > if get_printk_buffer() is called from schedulable context because different
> > > printk_buffer will be assigned by get_printk_buffer() if get_printk_buffer()
> > > is called from non-schedulable context.
> > > 
> > > > 
> > > > 
> > > > > > hm, 16 is rather random, it's too much for UP and probably not enough for
> > > > > > a 240 CPUs system. for the time being there are 3 buffered-printk users
> > > > > > (as far as I can see), but who knows how more will be added in the future.
> > > > > > each CPU can have overlapping printks from process, IRQ and NMI contexts.
> > > > > > for NMI we use printk-nmi buffers, so it's out of the list; but, in general,
> > > > > > *it seems* that we better depend on the number of CPUs the system has.
> > > > > > which, once again, returns us back to printk-safe...
> > > > > > 
> > > > > > thoughts?
> > > > > 
> > > > > I can make 16 a CONFIG_ option.
> > > > 
> > > > but still, why use additional N buffers, when we already have per-CPU
> > > > buffers? what am I missing?
> > > 
> > > Per-CPU buffers need to disable preemption by disabling local hard
> > > IRQ / soft IRQ. But printk_buffers need not to disable preemption.
> > 
> > yes. ok. seems that I can't explain what I want.
> > 
> > my point is:
> > 
> > printk-buffered does not disable preemption and we can sleep under
> > printk-buffered-begin. fine. but why would you want to sleep there anyway?
> > you just want to print a backtrace and be done with it. and backtracing
> > does not sleep, afaiu, or it least it should not, because it must be
> > possible to dump_stack() from atomic context. so why have
> > 
> > 	printk-buffered keeps preemption and irqs enable and uses one
> > 	of aux buffers (if any).
> > 
> > instead of
> > 
> > 	printk-buffered starts an atomic section - it disables preemption
> > 	and local irqs, because it uses per-CPU buffer (which is always,
> > 	and already, there).
> > 
> > ?
> 
> Backtrace is just an example chunk where multiple lines are expected to be
> printed together. In practice, backtraces are surrounded by preceding lines
> and trailing lines. For example, an OOM killer message looks like
> 
> [   88.584836] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
> [   88.590706] a.out cpuset=/ mems_allowed=0
> [   88.594524] CPU: 1 PID: 2981 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588
> [   88.599445] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [   88.604543] Call Trace:
> [   88.606779]  ? dump_stack+0x5c/0x7d
> [   88.608655]  ? dump_header+0x97/0x233
> [   88.610083]  ? oom_kill_process+0x223/0x3e0
> [   88.611511]  ? has_capability_noaudit+0x17/0x20
> [   88.613046]  ? oom_badness+0xeb/0x160
> [   88.614359]  ? out_of_memory+0x10b/0x490
> [   88.615592]  ? __alloc_pages_slowpath+0x701/0x8e2
> [   88.616947]  ? __alloc_pages_nodemask+0x1ed/0x210
> [   88.618267]  ? alloc_pages_vma+0x9f/0x220
> [   88.619475]  ? __handle_mm_fault+0xc22/0x11e0
> [   88.620793]  ? handle_mm_fault+0xc5/0x220
> [   88.622222]  ? __do_page_fault+0x21e/0x4b0
> [   88.623903]  ? do_page_fault+0x2b/0x70
> [   88.625235]  ? page_fault+0x28/0x30
> [   88.626392] Mem-Info:
> [   88.627314] active_anon:395525 inactive_anon:2096 isolated_anon:0
> [   88.627314]  active_file:60 inactive_file:56 isolated_file:0
> [   88.627314]  unevictable:0 dirty:7 writeback:0 unstable:0
> [   88.627314]  slab_reclaimable:2211 slab_unreclaimable:6459
> [   88.627314]  mapped:411 shmem:2162 pagetables:2418 bounce:0
> [   88.627314]  free:13040 free_pcp:54 free_cma:0
> [   88.637107] Node 0 active_anon:1582100kB inactive_anon:8384kB active_file:240kB inactive_file:224kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1644kB dirty:28kB writeback:0kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1349632kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [   88.644027] Node 0 DMA free:7176kB min:408kB low:508kB high:608kB active_anon:8696kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:40kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [   88.651384] lowmem_reserve[]: 0 1696 1696 1696
> [   88.653015] Node 0 DMA32 free:44984kB min:44644kB low:55804kB high:66964kB active_anon:1573396kB inactive_anon:8384kB active_file:1268kB inactive_file:968kB unevictable:0kB writepending:28kB present:2080640kB managed:1756232kB mlocked:0kB slab_reclaimable:8844kB slab_unreclaimable:25804kB kernel_stack:5856kB pagetables:9632kB bounce:0kB free_pcp:248kB local_pcp:40kB free_cma:0kB
> [   88.661952] lowmem_reserve[]: 0 0 0 0
> [   88.663388] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7140kB
> [   88.666629] Node 0 DMA32: 1041*4kB (UME) 616*8kB (UE) 157*16kB (UE) 95*32kB (UE) 59*64kB (UME) 61*128kB (ME) 27*256kB (UME) 15*512kB (ME) 4*1024kB (UME) 0*2048kB 0*4096kB = 44916kB
> [   88.671430] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [   88.674182] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [   88.676448] 2299 total pagecache pages
> [   88.678121] 0 pages in swap cache
> [   88.679472] Swap cache stats: add 0, delete 0, find 0/0
> [   88.681159] Free swap  = 0kB
> [   88.682698] Total swap = 0kB
> [   88.683996] 524157 pages RAM
> [   88.685385] 0 pages HighMem/MovableOnly
> [   88.686970] 81123 pages reserved
> [   88.688221] 0 pages cma reserved
> [   88.689831] 0 pages hwpoisoned
> [   88.691238] Out of memory: Kill process 2982 (a.out) score 999 or sacrifice child
> [   88.693403] Killed process 2982 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> 
> where showing memory related information involves full of pr_cont() usage.
> 
> If sysctl_oom_dump_tasks is set to 1, hundreds to a few thousands of lines
> could be added by dump_tasks() and per-CPU buffers might not be large enough
> (e.g. CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 is only 8KB, isn't it?).

A solution might be to use separate buffers for this buffered
context but it would be great to reuse some of the printk_safe code.

This is one of my problems with this patch. It adds a lot of
non-trivial code and buffers that are handled very different
way than the others. It might be much more acceptable:

   + if it reuses some of the existing code for temporary
     buffering (from printk_safe.c)

   + if it is usable for continuous lines (transparently?) and allows
     to remove the complex code around struct cont (immediately or
     middle term)


There is one catch with the transparent solution. Using per-CPU
buffers might require disabling preemption. But normal printk()
must not have side effects. I mean that it would be quite error
prone if printk() disables interrupts when it prints a text
without a newline.

A solution might be to use the pool of buffers that might
be assigned to a task. But we still should try to reuse
the printk_safe code for writing and flushing the buffers.


> Surrounding oom_kill_process() with printk_safe_enter()/printk_safe_exit() is not
> enough, and there is no need to disable hard IRQ during oom_kill_process().

As you say, there is no need to disable hard IRQ during oom_kill_process().
But will it cause any harm? As Sergey mentioned, would you want to
sleep with some messages buffered?

One big question if we want to use separate buffers in this
situations at all. Some of these reports are often printed when
the system is in very bad state. We should try to get the messages
to the console immediately when the system is dying.


> I don't want to worry about current thread migrating to different CPUs during
> oom_kill_process() and messages stored during printk_safe_enter()/printk_safe_exit() are
> written to console devices out of order. Can the printk kernel thread guarantee ordering?

printk_safe_enter*() disables interrupts. As a result
the process cannot be migrated in this context and the ordering
is guaranteed.


Best Regards,
Petr