linux-kernel - Re: hostile takeover: Re: [PATCH printk v2 3/8] printk: nbcon: Add acquire/release logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZPnu5P3b74K1mYTu@alley>
Date:   Thu, 7 Sep 2023 17:40:20 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     John Ogness <john.ogness@...utronix.de>
Cc:     Sergey Senozhatsky <senozhatsky@...omium.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: hostile takeover: Re: [PATCH printk v2 3/8] printk: nbcon: Add
 acquire/release logic

On Tue 2023-08-29 12:28:10, John Ogness wrote:
> On 2023-08-09, Petr Mladek <pmladek@...e.com> wrote:
> >> Add per console acquire/release functionality. The console 'locked'
> >> state is a combination of multiple state fields:
> >> 
> >>   - Hostile takeover
> >> 
> >>       The new owner takes the console over without 'req_prio'
> >>       handshake.
> >> 
> >>       This is required when friendly handovers are not possible,
> >>       i.e. the higher priority context interrupted the owning
> >>       context on the same CPU or the owning context is not able
> >>       to make progress on a remote CPU.
> >
> > I always expected that there would be only one hostile takeover.
> > It would allow to take the lock a harsh way when the friendly
> > way fails.
> 
> You are referring to the hostile takeover when unsafe. A hostile
> takeover when safe is still considered a hostile takeover. (At least,
> until now that is how it was considered. More below...)

Yes.

Note: I did not distinguish much the "takeover" and "handover"
      words when thinking about the logic. They sound almost
      the same to me. I think that I have some partial dyslexia[*]
      or so.

      Also I had the words "hostile" and "harsh" connected with
      the "takeover when unsafe".

      As a result, I made a simplified mental model where
      "handover" and "takeover when safe" were equivalent.

      I am sorry if it caused some confusion. I see
      the difference now.


[*] This might be the reason why I am so sensitive on naming.

> >> All other policy decisions have to be made at the call sites:
> >> 
> >>   - What is marked as an unsafe section.
> >>   - Whether to spinwait if there is already an owner.
> >>   - Whether to attempt a hostile takeover when safe.
> >>   - Whether to attempt a hostile takeover when unsafe.
> >
> > But there seems to be actually two variants. How they are
> > supposed to be used, please?
> >
> > I would expect that a higher priority context would always
> > be able to takeover the lock when it is in a safe context.
> >
> > By other words, "hostile takeover when safe" would be
> > the standard behavior for context with a higher priority.
> 
> The difference is that with "hostile takeover when safe" there is a
> foreign CPU that still thinks it has the lock, even though it does not.

Strictly speaking, the foreign CPU should not expect that it owns
the lock when it is in a safe region. The "safe" means that
the hostile takeover is safe. And the new owner could use
the console without any risk.

My understanding was:

   + The lock allows to emit one record atomically.

   + The "safe region under the lock" marks code when the console
     is in a consistent state and any synchronized resources
     are not accessed.

   + Only a higher priority context might take over the lock
     in the "safe region". Then it might use the console device
     a safe way. And it is supposed to re-emit the interrupted
     record.


I think that the catch is in the "synchronized resources".
It is not only the console device but also the printk buffer.

I though that we needed the extra printk buffer in panic only
because of the harsh takeover when unsafe when the message is being
read and formatted.

And I though that the console drivers would handle characters
one by one. Then it might be safe to read one character
even in "safe" state.

Hmm, I am actually not sure if nbcon_emit_next_record()
calls printk_get_next_message() and con->write_atomic(con, wctxt)
in "safe" or "unsafe" context.

I mean, I am not sure if the access to the printk buffer
is synchronized by the console lock or not. It would be nice
to document it and maybe add some assert_context_is_safe()/unsafe().

> > By other words, the difference between normal takeover and
> > "hostile takeover when safe" is that the 1st one has to
> > wait until the current owner calls nbcon_enter_unsafe().
> > But the result is the same. The current owner might
> > prematurely end after calling nbcon_enter_unsafe().

OK, there is one difference:

  + The "handover" passes the lock on well defined locations.

  + The "takeover when safe" can happen anywhere in the "safe"
    region.

And the "takeover when safe" is currently allowed only in
console_flush_on_panic(). It means that the lock is
passed to a higher priority context on well defined
locations most of the time. Which is a more conservative approach.

> > Maybe, this another relic from the initial more generic approach?
> 
> I suppose so. But then why not try the "hostile takeover when safe"
> first and only use the handover request as a fallback? Like this:
> 
> 1. try direct
> 2. try hostile takeover when safe
> 3. try handover
> 4. try hostile takeover when unsafe
> 
> Then we should remove the "hostile" label for #2 and just call it:
> 
> 1. try direct
> 2. try safe takeover
> 3. try handover
> 4. try hostile takeover

I rather meant:

  1. try direct
  2. try handover +
     try safe takeover in every waiting cycle
  3. try hostile takeover

But then it won't be a handover anymore.

Anyway, I would keep it as is for now. As mentioned above,
the current handover is more conservative approach because
the lock is passed on well defined locations.


> (I guess this is how you imagined things should be.)
> 
> >> +/**
> >> + * struct nbcon_context - Context for console acquire/release
> >> + * @console:		The associated console
> >> + * @spinwait_max_us:	Limit for spinwait acquire
> >> + * @prio:		Priority of the context
> >> + * @unsafe:		This context is in an unsafe section
> >
> > This seems to be an input value for try_acquire(). It is
> > controversial.
> >
> > I guess that it removes the need to call nbcon_enter_unsafe()
> > right after try_acquire_console().
> 
> Yes. It removes the "safe window" between try_acquire_console() and
> nbcon_enter_unsafe() by allowing to acquire and mark unsafe
> atomically. For example, this would be useful for the port_lock()
> wrapper we discussed. But it is not strictly necessary. I can remove it.
> 
> Below I answer your comments anyway.
> 
> > Hmm, this semantic is problematic:
> >
> >   1. The result would be non-paired enter_unsafe()/exit_unsafe()
> >      calls.
> 
> The paired calls are either:
> 
> try_acquire(unsafe=1) -> release()

We should not create such an API. It would be confusing. And it would
create strange code.

People, should be allowed enter/exit unsafe inside the lock.
If the lock sets unsafe by default then the code would
look like:

   try_acquire(unsafe=1)

     #do something synchronized

     exit_unsafe()
     # do something innocent
     enter_unsafe()

     #do something synchronized again

   release()

In this case, it would make more sense to implement the
opposite enter_safe() and exit_safe().

But this would be ugly. It would create the felling that
the code in this section might do anything because it is
safe. But it is actually the other way around.
The code in the "safe" section must be very careful because
it must be safe for takeover.

> or
> 
> try_acquire(unsafe=0) + enter_unsafe() -> exit_unsafe() + release()

I am not sure what is the difference between "+" and "->"

Anyway, IMHO, try_acquire() should always return with unsafe=0
(except for unsafe_takeover).

It would allow use enter_unsafe()/exit_unsafe() around
the sensitive code. They both will be in the same function
and "enter" will be before "exit".

BTW: The name enter_unsafe() somehow looks more acceptable than
     the reverted enter_safe(). It makes the feeling that we
     do something unsafe in this section which is actually true.

     I though about a better name then "unsafe" but I did
     not found any.


Best Regards,
Petr