netdev - Re: [PATCH v4 00/13] "Task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <831e023422aa0e4cb3da37ceef6fdcd5bc854682.camel@marvell.com>
Date:   Thu, 23 Jul 2020 15:18:42 +0000
From:   Alex Belits <abelits@...vell.com>
To:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "frederic@...nel.org" <frederic@...nel.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>
CC:     "mingo@...nel.org" <mingo@...nel.org>,
        Prasun Kapoor <pkapoor@...vell.com>,
        "linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "catalin.marinas@....com" <catalin.marinas@....com>,
        "will@...nel.org" <will@...nel.org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH v4 00/13] "Task_isolation" mode

On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote:
> 
> Without going into details of the individual patches, let me give you a
> high level view of this series:
> 
>   1) Entry code handling:
> 
>      That's completely broken vs. the careful ordering and instrumentation
>      protection of the entry code. You can't just slap stuff randomly
>      into places which you think are safe w/o actually trying to understand
>      why this code is ordered in the way it is.
> 
>      This clearly was never built and tested with any of the relevant
>      debug options enabled. Both build and boot would have told you.

This is intended to avoid a race condition when entry or exit from isolation
happens at the same time as an event that requires synchronization. The idea
is, it is possible to insulate the core from all events while it is running
isolated task in userspace, it will receive those calls normally after
breaking isolation and entering kernel, and it will synchronize itself on
kernel entry.

This has two potential problems that I am trying to solve:

1. Without careful ordering, there will be a race condition with events that
happen at the same time as kernel entry or exit.

2. CPU runs some kernel code after entering but before synchronization. This
code should be restricted to early entry that is not affected by the "stale"
state, similar to how IPI code that receives synchronization events does it
normally.

I can't say that I am completely happy with the amount of kernel entry
handling that had to be added. The problem is, I am trying to introduce a
feature that allows CPU cores to go into "de-synchronized" state while running
isolated tasks and not receiving synchronization events that normally would
reach them. This means, there should be established some point on kernel entry
when it is safe for the core to catch up with the rest of kernel. It may be
useful for other purposes, however at this point task isolation is the first
to need it, so I had to determine where such point is for every supported
architecture and method of kernel entry.

I have found that each architecture has its own way of handling this,
and sometimes individual interrupt controller drivers vary in their
sequence of calls on early kernel entry. For x86 I also have an
implementation for kernel 5.6, before your changes to IDT macros.
That version is much less straightforward, so I am grateful for those
relatively recent improvements.

Nevertheless, I believe that the goal of finding those points and using
them for synchronization is valid. If you can recommend me a better way
for at least x86, I will be happy to follow your advice. I have tried to
cover kernel entry in a generic way while making the changes least
disruptive, and this is why it looks simple and spread over multiple
places. I also had to do the same for arm and arm64 (that I use for
development), and for each architecture I had to produce sequences of
entry points and function calls to determine the correct placement of
task_isolation_enter() calls in them. It is not random, however it does
reflect the complex nature of kernel entry code. I believe, RCU
implementation faced somewhat similar requirements for calls on kernel
entry, however it is not completely unified, either

>  2) Instruction synchronization
>     Trying to do instruction synchronization delayed is a clear recipe
>     for hard to diagnose failures. Just because it blew not up in your
>     face does not make it correct in any way. It's broken by design and
>     violates _all_ rules of safe instruction patching and introduces a
>     complete trainwreck in x86 NMI processing.

The idea is that just like synchronization events are handled by regular IPI,
we already use some code with the assumption that it is safe to be entered in
"stale" state before synchronization. I have extended it to allow
synchronization points on all kernel entry points.

>     If you really think that this is correct, then please have at least
>     the courtesy to come up with a detailed and precise argumentation
>     why this is a valid approach.
>
>     While writing that up you surely will find out why it is not.
> 

I had to document a sequence of calls for every entry point on three supported
architectures, to determine the points for synchronization. It is possible that
I have somehow missed something, however I don't see a better approach, save
for establishing a kernel-wide infrastructure for this. And even if we did just
that, it would be possible to implement this kind of synchronization point
calls first, and convert them to something more generic later.

> 
>   3) Debug calls
> 
>      Sprinkling debug calls around the codebase randomly is not going to
>      happen. That's an unmaintainable mess.

Those report isolation breaking causes, and are intended for application and
system debugging.

> 
>      Aside of that none of these dmesg based debug things is necessary.
>      This can simply be monitored with tracing.

I think, it would be better to make all that information available to the
userspace application, however I have based this on the Chris Metcalf code,
and gradually updated the mechanisms and interfaces. The original reporting
of isolation breaking causes had far greater problems, so at first I wanted
to have something that produces easily visible and correct reporting, and
does not break things while doing so.

> 
>   4) Tons of undocumented smp barriers
> 
>      See Documentation/process/submit-checklist.rst #25
> 

That should be fixed.

>   5) Signal on page fault
> 
>      Why is this a magic task isolation feature instead of making it
>      something which can be used in general? There are other legit
>      reasons why a task might want a notification about an unexpected
>      (resolved) page fault.

Page fault causes isolation breaking. When a task runs in isolated mode it
does so because it requires predictable timing, so causing page faults and
expecting them to be handled along the way would defeat the purpose of
isolation. So if page fault did happen, it is important that application will
receive notification about isolation being broken, and then may decide to do
something about it, re-enter isolation, etc.

> 
>   6) Coding style violations all over the place
> 
>      Using checkpatch.pl is mandatory
> 
>   7) Not Cc'ed maintainers
> 
>      While your Cc list is huge, you completely fail to Cc the relevant
>      maintainers of various files and subsystems as requested in
>      Documentation/process/*

To be honest, I am not sure, whom I have missed, I tried to include everyone
from my previous attempt.

> 
>   8) Changelogs
> 
>      Most of the changelogs have something along the lines:
> 
>      'task isolation does not want X, so do Y to make it not do X'
> 
>      without any single line of explanation why this approach was chosen
>      and why it is correct under all circumstances and cannot have nasty
>      side effects.

This is the same as the previous version, except for the addition of kernel
entry handling. As far as I can tell, the rest was discussed before, and not
many questions remained except for the race condition on kernel entry. I
agree that kernel entry handling is a complex issue in itself, so I have
included explanation of entry points / function calls sequences for each
supported architecture. I have longer call diagram, that I used to track
each particular function, it probably should be included as a separate
document.

>      It's not the job of the reviewers/maintainers to figure this out.
> 
> Please come up with a coherent design first and then address the
> identified issues one by one in a way which is palatable and reviewable.
> 
> Throwing a big pile of completely undocumented 'works for me' mess over
> the fence does not get you anywhere, not even to the point that people
> are willing to review it in detail.

There is a design, and it is a result of a careful tracking of calls in the
kernel source. It has multiple point where task_isolation_enter() is called
for a reason similar to why RCU-related functions are called in multiple
places.

If someone can recommend a better way to introduce a kernel entry
checkpoint for synchronization that did not exist before, I will be happy
to hear it.

-- 
Alex