lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 21 Sep 2015 10:49:39 +0800
From:	Chen Fan <chen.fan.fnst@...fujitsu.com>
To:	"Serge E. Hallyn" <serge@...lyn.com>,
	"Eric W. Biederman" <ebiederm@...ssion.com>
CC:	Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
	Serge Hallyn <serge.hallyn@...ntu.com>,
	Stéphane Graber <stgraber@...ntu.com>,
	<linux-api@...r.kernel.org>,
	<containers@...ts.linux-foundation.org>,
	Oleg Nesterov <oleg@...hat.com>,
	<linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: Re: [PATCH RFC] pidns: introduce syscall getvpid


On 09/17/2015 12:31 AM, Serge E. Hallyn wrote:
> On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serge@...lyn.com> writes:
>>
>>> On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote:
>>>> On 15.09.2015 20:41, Serge Hallyn wrote:
>>>>> Quoting Stéphane Graber (stgraber@...ntu.com):
>>>>>> On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote:
>>>>>>> On 15.09.2015 17:27, Eric W. Biederman wrote:
>>>>>>>> Konstantin Khlebnikov <khlebnikov@...dex-team.ru> writes:
>>>>>>>>
>>>>>>>>> pid_t getvpid(pid_t pid, pid_t source, pid_t target);
>>>>>>>>>
>>>>>>>>> This syscall converts pid from one pid-ns into pid in another pid-ns:
>>>>>>>>> it takes @pid in namespace of @source task (zero for current) and
>>>>>>>>> returns related pid in namespace of @target task (zero for current too).
>>>>>>>>> If pid is unreachable from target pid-ns then it returns zero.
>>>>>>>> This interface as presented is inherently racy.  It would be better
>>>>>>>> if source and target were file descriptors referring to the namespaces
>>>>>>>> you wish to translate between.
>>>>>>> Yep, it's racy. As well as any operation with non-child pids.
>>>>>>> With file descriptors for source/target result will be racy anyway.
>>>>>>>
>>>>>>>>> Such conversion is required for interaction between processes from
>>>>>>>>> different pid-namespaces. For example when system service talks with
>>>>>>>>> client from isolated container via socket about task in container:
>>>>>>>> Sockets are already supported.  At least the metadata of sockets is.
>>>>>>>>
>>>>>>>> Maybe we need this but I am not convinced of it's utility.
>>>>>>>>
>>>>>>>> What are you trying to do that motivates this?
>>>>>>> I'm working on hierarchical container management system which
>>>>>>> allows to create and control nested sub-containers from containers
>>>>>>> ( https://github.com/yandex/porto ). Main server works in host and
>>>>>>> have to interact with all levels of nested namespaces. This syscall
>>>>>>> makes some operations much easier: server must remember only pid in
>>>>>>> host pid namespace and convert it into right vpid on demand.
>>>>>> Note that as Eric said earlier, sending a PID inside a ucred through a
>>>>>> unix socket will have the pid translated.
>>>>>>
>>>>>> So while your solution certainly should be faster, you can already achieve
>>>>>> what you want today by doing:
>>>>>>
>>>>>> == Translate PID in container to PID in host
>>>>>>   - open a socket
>>>>>>   - setns to container's pidns
>>>>>>   - send ucred from that container containing the requested container PID
>>>>>>   - host sees the host PID
>>>>>>
>>>>>> == Translate PID on host to PID in container
>>>>>>   - open a socket
>>>>>>   - setns to container's pidns
>>>>>>   - send ucred from the host containing the request host PID
>>>>>>     (send will fail if the host PID isn't part of that container)
>>>>>>   - container sees the container PID
>>>>> In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns
>>>>> we now also have 'NSpid' etc in /proc/$$/status.
>>>>>
>>>> As I see this works perfectly only for converting host pid into virtual.
>>>>
>>>> Backward conversion is troublesome: we have to scan all pids in host
>>>> procfs and somehow filter tasks from container and its sub-pid-ns.
>>>> Or I am missing something trivial?
>>> Ah, no that doesn't help with this.
>>>
>>> What Stéphane describes is what I've done in several projects.
>>> Getting it right is however actually quite tricky.  I'm not
>>> convinced it's at the level of "since you can do (sweep hands)
>>> all this, we don't need a simple syscall to do it."
>>>
>>> So I'd encourage you to resend using namespace inode fds for
>>> source and target as Eric suggested.  We still may decide that
>>> the syscall isn't needed, but it's a trivial change to your
>>> patch and removes that race.  And I'm not convinced it's not
>>> needed.
>> At this point my primary concern is that a pattern that would need to be
>> convering to and from pids quickly is potentially fundamentally racy to
>> the point of broken.
> The cgmanager GetTasks and GetTasksRecursive, and reading of the
> lxcfs cgroup /tasks files, require converting every pid from the
> cgmanager's namespace to the reading task's namespace.
>
>> Especially with unix domain sockets passing and converting pids in a way
>> that covers the common case.
>>
>> I am clearly missing some nuance of this use case.
> lxcfs and cgmanager are imo proof that we *can* do without the new
> syscall.  However, the git history will show that there are some
> complications, and the system load when a few systemds are starting
> will show that it does take a performance toll on the host at some
> point.  Still as I say it's doable.  The syscall implementation was
> very simple, though.

Yes, previous email discussed about the implementation of syscall or procfs:
http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723

but it seems complicated implemented by procfs, the original discussion at:
http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440

Thanks,
Chen

>
> -serge
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ