lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210810183746.GA32986@fuller.cnet>
Date:   Tue, 10 Aug 2021 15:37:46 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, Nitesh Lal <nilal@...hat.com>,
        Nicolas Saenz Julienne <nsaenzju@...hat.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Christoph Lameter <cl@...ux.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Alex Belits <abelits@...its.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [patch 0/4] extensible prctl task isolation interface and vmstat
 sync (v2)

On Tue, Aug 10, 2021 at 06:40:48PM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Fri, Jul 30 2021 at 17:18, Marcelo Tosatti wrote:
> 
> can you pretty please:
> 
>  1) Add a version number to your patch series right where it belongs:
> 
>     [patch V2 N/M]
> 
>     Just having a (v2) at the end of the subject line of 0/M is sloppy
>     at best.
> 
>  2) Provide a lore link to the previous version
> 
> Thanks,
> 
>         tglx

Thomas,

Sure, will resend -v3 once done with the following:

1) Adding support for KVM.

2) Adding a tool called "chisol" to util-linux, similar 
to chrt/taskset, to prctl+exec (for unmodified applications).

This raises the question whether or not to add an option to preserve
the task parameters across fork (i think the answer is yes).

--

But the following points are unclear to me (in quotes are earlier 
comments you made):

1) "It's about silencing different and largely independent parts of the OS
on a particular CPU. Just defining upfront that there is only the choice
of all or nothing _is_ policy.

There is a very wide range of use case scenarios out there and just
because the ones which you care about needs X does not mean that X is
the right thing for everybody else. You still can have X and let other
people define their own set of things they want to be protected
against.

Aside of that having it selectively is a plus for debugability, testing
etc."

So for the ability to individually select what parts of the OS 
on a particular CPU are quiesced, there is:

+	defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
+
+	ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
+                   0, 0);
+	if (ret == -1) {
+               perror("prctl PR_ISOL_SET");
+               return EXIT_FAILURE;
+	}

However there is a feeling that implementation details are being exposed 
to userspace... However that seems to be alright: what could happen is that
the feature ceases to exist (say vmstat sync), in kernel, and the bit
is kept for compability (but the kernel does nothing about it). 

That of course means whatever "vmstat sync" replacement comes up, it should
avoid IPIs as well.

Any thoughts on this?

2) "Again: I fundamentaly disagree with the proposed task isolation patches
approach as they leave no choice at all.

There is a reasonable middle ground where an application is willing to
pay the price (delay) until the reqested quiescing has taken place in
order to run undisturbed (hint: cache ...) and also is willing to take
the addtional overhead of an occacional syscall in the slow path without
tripping some OS imposed isolation safe guard.

Aside of that such a granular approach does not necessarily require the
application to be aware of it. If the admin knows the computational
pattern of the application, e.g.

 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement.

That's what we need to think about and once we figured out how to do
that it gives Marcelo the mechanism to solve his 'run virt undisturbed
by vmstat or whatever' problem and it allows Alex to build his stuff on
it.

Summary: The problem to be solved cannot be restricted to

    self_defined_important_task(OWN_WORLD);

Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.

Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.

Emphasis on "mechanims provided by the kernel", aka. user API.

Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires."

OK, so perhaps a handful of use-cases can clarify whether the proposed
interface requires changes?

The example on samples/task_isolation/ is focused on "enter task isolation
and very rarely exit".

There are two other cases i am aware of:

A) Christoph's use-case:

	1) Enter task-isolation.
	2) Latency sensitive loop begins.
	3) Some event interrupts latency sensitive section.

	4) Handling of the event requires N syscalls, which the programmer
	   would be interested in happening without quiescing at 
	   every return to system call. The current scheme would be:


       /*
        * Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
        * which is configurable through
        * /sys/kernel/task_isolation/default_quiesce_activities,
        * or specific values.
        *
        * Using ISOL_F_QUIESCE_DEFMASK allows for the application to
        * take advantage of future quiescing capabilities without
        * modification (provided default_quiesce_activities is
        * configured accordingly).
        */
       defmask = defmask | ISOL_F_QUIESCE_VMSTATS;

       ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
                   0, 0);
       if (ret == -1) {
               perror("prctl PR_ISOL_SET");
               return EXIT_FAILURE;
       }

lat_loop:
       ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
       if (ret == -1) {
               perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
               return EXIT_FAILURE;
       }

       latency sensitive loop

       if (event == 1) {
		/* disables quiescing of all features, while maintaining
		 * other features such as logging and avoidance of
		 * interruptions enabled.
		 */
		ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
		syscall1
		syscall2
		...
		syscallN
		/* reenter isolated mode with quiescing */
		goto lat_loop;
	}
	...

Should it be possible to modify individual quiescing parts individually
while maintaining isolated mode? Yes, that seems to be desired.


The other use-case (from you) seems to be:

 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

Well, the implementation of Christoph's use above seems not
to be that bad as well:

 1     read_data_set() <- involving syscalls/OS obviously
	/* disables quiescing of all (or some, if desired)
	 * features, while maintaining other features such
	 * as logging and avoidance of interruptions enabled.
	 */
	ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);

 2     compute_set()   <- let me alone

	ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);

 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

What kind of different behaviour, other than enabling/disabling
quiescing, would be desired in this use-case?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ