netdev - Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ed60dc426cfd8a2fe5e389d3a7f36bafa6a8439e.camel@marvell.com>
Date:   Tue, 6 Oct 2020 11:01:49 +0000
From:   Alex Belits <abelits@...vell.com>
To:     "frederic@...nel.org" <frederic@...nel.org>
CC:     "mingo@...nel.org" <mingo@...nel.org>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "catalin.marinas@....com" <catalin.marinas@....com>,
        "will@...nel.org" <will@...nel.org>,
        Prasun Kapoor <pkapoor@...vell.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard
 isolation from kernel

On Mon, 2020-10-05 at 01:14 +0200, Frederic Weisbecker wrote:
> Speaking of which, I agree with Thomas that it's unnecessary.
> > > It's
> > > too much
> > > code and complexity. We can use the existing trace events and
> > > perform
> > > the
> > > analysis from userspace to find the source of the disturbance.
> > 
> > The idea behind this is that isolation breaking events are supposed
> > to
> > be known to the applications while applications run normally, and
> > they
> > should not require any analysis or human intervention to be
> > handled.
> 
> Sure but you can use trace events for that. Just trace interrupts,
> workqueues,
> timers, syscalls, exceptions and scheduler events and you get all the
> local
> disturbance. You might want to tune a few filters but that's pretty
> much it.

And keep all tracing enabled all the time, just to be able to figure
out that disturbance happened at all?

Or do you mean that we can use kernel entry mechanism to reliably
determine that isolation breaking event happened (so the isolation-
breaking procedure can be triggered as early as possible), yet avoid
trying to determine why exactly it happened, and use tracing if we want
to know?

Original patch did the opposite, it triggered any isolation-breaking
procedure only once it was known specifically, what kind of event
happened -- a hardware interrupt, IPI, syscall, page fault, or any
other kind of exception, possibly something architecture-specific.
This, of course, always had a potential problem with coverage -- if
handling of something is missing, isolation breaking is not handled at
all, and there is no obvious way of finding if we covered everything.
This also made the patch large and somewhat ugly.

When I have added a mechanism for low-level isolation breaking handling
on kernel entry, it also partially improved the problem with
completeness. Partially because I have not yet added handling of
"unknown cause" before returning to userspace, however that would be a
logical thing to do. Then if we entered kernel from isolation, did
something, and are returning to userspace still not knowing what kind
of isolation-breaking event happened, we can still trigger isolation
breaking.

Did I get it right, and you mean that we can remove all specific
handling of isolation breaking causes, except for syscall that exits
isolation, and report isolation breaking instead of normally returning
to userspace? Then isolation breaking will be handled reliably without
knowing the cause, and we can leave determining the cause to the
tracing mechanism (if enabled)?

This does make sense. However for me it looks somewhat strange, because
I assume isolation breaking to be a kind of runtime error, that
userspace software is supposed to get some basic information about --
like, signals distinguishing between, say, SIGSEGV and SIGPIPE, or
write() being able to set errno to ENOSPC or EIO. Then userspace
receives basic information about the cause of exception or error, and
can do some meaningful reporting, or decide if the error should be
fatal for the application or handled differently, based on its internal
logic. To get those distinctions, application does not have to be aware
of anything internal to the kernel.

Similarly distinguishing between, say, a page fault, device interrupt
and a timer may be important for a logic implemented in userspace, and
I think, it may be nice to allow userspace to get this information
immediately and without being aware of any additional details of kernel
implementation. The current patch doesn't do this yet, however the
intention is to implement reliable isolation breaking by checking on
userspace re-entry, plus make reporting of causes, if any were found,
visible to the userspace in some convenient way.

The part that determines the cause can be implemented separately from
isolation breaking mechanism. Then we can have isolation breaking on
kernel entry (or potentially some other condition on kernel entry that
requires logging the cause) enable reporting, then reporting mechanism,
if it exists will fill the blanks, and once either cause is known, or
it's time to return to userspace, notification will be done with
whatever information is available. For some in-depth analysis, if
necessary for debugging the kernel, we can have tracing check if we are
in this "suspicious kernel entry" mode, and log things that otherwise
would not be.

> As for the source of the disturbances, if you really need that
> information,
> you can trace the workqueue and timer queue events and just filter
> those that
> target your isolated CPUs.

For the purpose of human debugging the kernel or application, the more
information is (usually) the better, so the only concern here is that
now user is responsible for completeness of things he is tracing.
However from application's point of view, or for logging in a
production environment it's usually more important to get general type
of events, so it's possible to, say, confirm that nothing "really bad"
happened, or to trigger the emergency response if it did. Say, if the
only causes of isolation breaking was IPI within few moments of
application startup, or signal from somewhere else when application was
restarted, there is no cause for concern. However if hardware
interrupts arrive at random points in time, something is clearly wrong.
And if page faults happen, most likely application forgot to page-in
and lock its address space.

Again, in my opinion this is not unlike reporting ENOSPC vs. EIO while
doing file I/O -- the former (usually) indicates a common problem that
may require application-level cleanup, the latter (also usually) means
that something is seriously wrong.

> > A process may exit isolation because some leftover delayed work,
> > for
> > example, a timer or a workqueue, is still present on a CPU, or
> > because
> > a page fault or some other exception, normally handled silently, is
> > caused by the task. It is also possible to direct an interrupt to a
> > CPU
> > that is running an isolated task -- currently it's perfectly valid
> > to
> > set interrupt smp affinity to a CPU running isolated task, and then
> > interrupt will cause breaking isolation. While it's probably not
> > the
> > best way of handling interrupts, I would rather not prohibit this
> > explicitly.
> 
> Sure, but you can trace all these events with the existing tracing
> interface we have.

Right. However it would require someone to intentionally do tracing of
all those events, all for the purpose of obtaining a type of runtime
error. As an embedded systems developer, who had to look for signs of
unusual bugs on a large number of customers' systems, and had to
distinguish them from reports of hardware malfunctions, I would prefer
something clearly identifiable in the logs (of kernel, application, or
anything else) when no one is specifically investigating any problem.

When anything suspicious happens, often the system is physically
unreachable, and the problem may or may not happen again, so the first
report from a running system may be the only thing available. When
everything is going well, the same systems more often have hardware
failures than report valid software bugs (or, ideally, all reports are
from hardware failures), so it's much better to know that if software
will do something wrong, it would be possible to identify the problem
from the first report, rather than guess.

Sometimes equipment gets firmware updates many years after production,
when there are reports of all kinds of failures due to mechanical or
thermal damage, faulty parts, bad repair work, deteriorating flash,
etc. Among those there might be something that indicates new bugs made
by a new generation of developers (occasionally literally),
regressions, etc. In those situations getting useful information from
the error message in the first report can make a difference between
quickly identifying the problem and going on a wild goose chase.

-- 
Alex