lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 18 Dec 2017 14:31:01 +0100
From:   Petr Mladek <pmladek@...e.com>
To:     Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>
Cc:     Steven Rostedt <rostedt@...dmis.org>, Tejun Heo <tj@...nel.org>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Jan Kara <jack@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Rafael Wysocki <rjw@...ysocki.net>,
        Pavel Machek <pavel@....cz>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCHv6 00/12] printk: introduce printing kernel thread

On Mon 2017-12-18 18:36:15, Sergey Senozhatsky wrote:
> On (12/15/17 10:08), Petr Mladek wrote:
> 1) it opens both soft and hard lockup vectors
> 
>    I see *a lot* of cases when CPU that call printk in a loop does not
>    end up flushing its messages. And the problem seems to be - preemption.
> 
> 
>   CPU0						CPU1
> 
>   for_each_process_thread(g, p)
>     printk()

You print one message for each process in a tight loop.
Is there a code like this?

I think that more realistic is to print a message for
each CPU. But even in this case, the messages are usually
written by each CPU separately and thus less synchronized.

One exception might be the backtraces of all CPUs. These
are printed in NMI and flushed synchronously from a
single CPU. But they were recently optimized by
not printink idle threads.


> > Why did you completely ignored the paragraph about step by step
> > approach? Is there anything wrong about it?
> 
> frankly, I don't even understand what our plan is. I don't see how are
> we going to verify the patch. over the last 3 years, how many emails
> you have from facebook or samsung or linaro, or any other company
> reporting the printk-related lockups? I don't have that many in my gmail
> inbox, to be honest. and this is not because there were no printk lockups
> observed. this is because people usually don't report those issues to
> the upstream community. especially vendors that use "outdated" LTS
> kernels, which are approx 1-2 year(s) behind the mainline. and I don't see
> why it should be different this time. it will take years before vendors
> pick the next LTS kernel, which will have that patch in it. but the really
> big problem here is that we already know that the patch has some problems.
> are we going to conclude that "no emails === no problems"? with all my
> respect, it does seem like, in the grand scheme of things, we are going
> to do the same thing, yet expecting different result. that's my worry.

The patches for offloading printk console work are floating around
about 5 years and nothing went upstream. Therefore 2 more years look
acceptable if we actually could make thigs better a bit now. Well, I
believe that we would know the usefulness earlier than this.


> > You are looking for a perfect solution.
> 
> gentlemen, once again, you really can name it whatever you like.
> the bottom line is [speaking for myself only]:
> 
> -  the patch does not fit my needs.
> -  it does not address the issues I'm dealing with.

I am still missing some real life reproducer of the problems.

> -  it has a significantly worse behaviour compared to old async printk.
> -  it keeps sleeping on console_sem tasks in TASK_UNINTERRUPTIBLE
>       for a long time.
> -  it timeouts user space apps.
> -  it re-introduces the lockup vector, by passing console_sem ownership
>       to atomic tasks.

All this is in the current upstream code as well. Steven's patch
should make it better in compare with the current upstream code.

Sure, the printk offload approach does all these things better.
But there is still the fear that the offload is not reliable
in all situations. The lazy offload should handle this well from
my POV but I am not 100% sure.

If we have hard data (real life reproducers) in hand, we could
start pushing the offloading again.


> -  it doubles the amount of time CPU spins with local IRQs disabled in
>       the worst case.

It happens only during the hand shake. And the other CPU waits only
until a single line is flushed. It is much less that a single CPU
might spend flushing lines.


> -  I probably need to run more tests [I didn't execute any OOM tests, etc.],
>       but, preliminary, I can't see Samsung using this patch.

Does Samsung already use some offload implementation?

Best Regards,
Petr

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ