linux-kernel - Re: [PATCH v4] pidns: introduce syscall translate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bb6fa90b-ffc5-1263-23ef-e99e6480b09c@oracle.com>
Date:   Wed, 1 Nov 2017 17:38:10 -0700
From:   "prakash.sangappa" <prakash.sangappa@...cle.com>
To:     Jann Horn <jannh@...gle.com>
Cc:     Andy Lutomirski <luto@...capital.net>,
        Nagarathnam Muthusamy <nagarathnam.muthusamy@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
        Oleg Nesterov <oleg@...hat.com>,
        Linux API <linux-api@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Serge Hallyn <serge.hallyn@...ntu.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        Eugene Syromiatnikov <esyr@...hat.com>
Subject: Re: [PATCH v4] pidns: introduce syscall translate_pid



On 11/01/2017 10:43 AM, Jann Horn wrote:
> On Tue, Oct 17, 2017 at 5:38 PM, Prakash Sangappa
> <prakash.sangappa@...cle.com> wrote:
>>
>> On 10/16/17 5:52 PM, Andy Lutomirski wrote:
>>> On Mon, Oct 16, 2017 at 3:54 PM, prakash.sangappa
>>> <prakash.sangappa@...cle.com> wrote:
>>>>
>>>> On 10/16/2017 03:07 PM, Nagarathnam Muthusamy wrote:
>>>>>
>>>>>
>>>>> On 10/16/2017 02:36 PM, Andrew Morton wrote:
>>>>>> On Sat, 14 Oct 2017 11:17:47 +0300 Konstantin Khlebnikov
>>>>>> <khlebnikov@...dex-team.ru> wrote:
>>>>>>
>>>>>>>>>> pid_t translate_pid(pid_t pid, int source, int target);
>>>>>>>>>>
>>>>>>>>>> This syscall converts pid from source pid-ns into pid in target
>>>>>>>>>> pid-ns.
>>>>>>>>>> If pid is unreachable from target pid-ns it returns zero.
>>>>>>>>>>
>>>>>>>>>> Pid-namespaces are referred file descriptors opened to proc files
>>>>>>>>>> /proc/[pid]/ns/pid or /proc/[pid]/ns/pid_for_children. Negative
>>>>>>>>>> argument
>>>>>>>>>> refers to current pid namespace, same as file /proc/self/ns/pid.
>>>>>>>>>>
>>>>>>>>>> Kernel expose virtual pids in /proc/[pid]/status:NSpid, but
>>>>>>>>>> backward
>>>>>>>>>> translation requires scanning all tasks. Also pids could be
>>>>>>>>>> translated
>>>>>>>>>> by sending them through unix socket between namespaces, this method
>>>>>>>>>> is
>>>>>>>>>> slow and insecure because other side is exposed inside pid
>>>>>>>>>> namespace.
>>>>>>> Andrew asked why we might need this.
>>>>>>>
>>>>>>> Such conversion is required for interaction between processes across
>>>>>>> pid-namespaces.
>>>>>>> For example to identify process in container by pid file looking from
>>>>>>> outside.
>>>>>>>
>>>>>>> Two years ago I've solved this in project of mine with monstrous code
>>>>>>> which
>>>>>>> forks couple times just to convert pid, lucky for me performance
>>>>>>> wasn't
>>>>>>> important.
>>>>>> That's a single user who needed this a single time, and found a
>>>>>> userspace-based solution anyway.  This is not exactly compelling!
>>>>>>
>>>>>> Is there a stronger case to be made?  How does this change benefit our
>>>>>> users?  Sell it to us!
>>>>> Oracle database is planning to use pid namespace for sandboxing database
>>>>> instances and they need an API similar to translate_pid to effectively
>>>>> translate process IDs from other pid namespaces. Prakash (cced in mail)
>>>>> can
>>>>> provide more details on this usecase.
>>>>
>>>> As Nagarathnam indicated, Oracle Database will be using pid namespaces
>>>> and
>>>> needs a direct method of converting pids of processes in the pid
>>>> namespace
>>>> hierarchy. In this use case multiple
>>>> nested PID namespaces will be used.  The currently available mechanism
>>>> are
>>>> not very efficient for this use case. For ex. as Konstantin described,
>>>> using
>>>> /proc/<pid>/status would require the application to scan all the pid's
>>>> status files to determine the pid of given process in a child namespace.
>>>>
>>>> Use of SCM_CREDENTIALS's socket message is another way, which would
>>>> require
>>>> every process starting inside a pid namespace to send this message and
>>>> the
>>>> receiving process in the target namespace would have to save the
>>>> converted
>>>> pid and reference it. This mechanism becomes cumbersome especially if the
>>>> application has to deal with multiple nested pid namespaces. Also, the
>>>> Database needs to be able to convert a thread's global pid(gettid()).
>>>> Passing the thread's pid(gettid()) in SCM_CREDENTIALS message requires
>>>> CAP_SYS_ADMIN, which is an issue.
>>>>
>>>> So having a direct method, like the API that Konstantin is proposing,
>>>> will
>>>> work best for the Database
>>>> since pid of a process in any of the nested pid namespaces can be
>>>> converted
>>>> as and when required. I think with the proposed API, the application
>>>> should
>>>> be able to convert pid of a process or tid(gettid()) of a thread as well.
>>>>
>>> Can you explain what Oracle's database is planning to do with this
>>> information?
>>
>> Database uses the PID to programmatically find out if the process/thread is
>> alive(kill 0) also send signals to the processes requesting it to dump
>> status/debug information and kill the processes in case of a shutdown abort
>> of the instance.
> But if kill(pid, 0) returns 0, that doesn't tell you anything, right?
> It could be that
> the process you're trying to check is still alive, but it could also
> be that it has
> died, ns_last_pid has wrapped around, and the PID is now being reused by
> another process, right?

That is true. Database checks the process start time by reading 
/proc/<pid>/stat
file to verify that it is the correct process.

>
> Wouldn't it be more reliable to open("/proc/self", O_RDONLY)
> (or /proc/thread-self) in the process you want to monitor, then send
> the resulting file descriptor to the monitoring process with SCM_RIGHTS?
> Then something like this should work for checking whether the process
> is still alive without relying on PIDs at all:
>
>      int retval = faccessat(child_proc_self_fd, "stat", F_OK, 0);
>      if (retval == 0) {
>        /* process still exists */
>      } else if (retval == -1 && errno == ESRCH) {
>        /* process is gone */
>      } else {
>        err(1, "unexpected fstatat result");
>      }

Yes, but there will be a large number of processes to deal with
and few  processes monitoring. All these processes would have to
open /proc/self and send fd to all the monitoring processes. In the
database case, there is one fixed  monitoring process, but other
processes monitoring can exit and new ones started.