linux-kernel - Re: Futexes and network filesystems.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <m1lk8sp0t9.fsf@ebiederm.dsl.xmission.com>
Date:	Tue, 20 Nov 2007 23:30:58 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Kyle Moffett <mrmacman_g4@....com>
Cc:	Ingo Molnar <mingo@...e.hu>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Hansen <haveblue@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Pavel Emelyanov <xemul@...nvz.org>,
	Ulrich Drepper <drepper@...hat.com>,
	linux-kernel@...r.kernel.org,
	"Dinakar Guniguntala [imap]" <dino@...ibm.com>,
	Sripathi Kodi <sripathik@...ibm.com>
Subject: Re: Futexes and network filesystems.

Kyle Moffett <mrmacman_g4@....com> writes:

> On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
>> I had a chance to think about this a bit more, and realized that the problem
>> is that futexes don't appear to work on network  filesystems, even if the
>> network filesystems provide coherent  shared memory.
>>
>> It seems to me that we need to have a call that gets a unique token for a
>> process for each filesystem per filesystem for use in futexes  (especially
>> robust futexes).  Say get_fs_task_id(const char *path);
>>
>> On local filesystems this could just be the pid as we use today, but for
>> filesystems that can be accessed from contexts with  potentially overlapping
>> pid values this could be something else.   It is an extra syscall in the
>> preparation path, but it should be hardly more expensive the current getpid().
>>
>> Once we have fixed the futex infrastructure to be able to handle futexes on
>> network filesystems, the pid namespace case will be  trivial to implement.
>
> Actually, I would think that get_vm_task_id(void *addr) would be a more useful
> interface.  The call would still be a relatively simple  lookup to find the
> struct file associated with the particular virtual  mapping, but it would be
> race-free from the perspective of userspace  and would not require that we
> somehow figure out the file descriptor  associated with a particular mmap()
> (which may be closed by this  point in time).  Useful extension would be the
> get_fd_task_id(int fd) and get_fs_task_id(const char *path), but those are less
> important.

You are probably right.  The important thing is that we get an interface where
user space can cache the result.  Or else you totally loose the benefit of
avoiding trapping into the kernel.

> The other important thing is to ensure that somehow the numbers are considered
> unique only within the particular domain of a container,  such that you can
> migrate a container from one system to another even  using a simple local ext3
> filesystem (on a networked block device)  and still be able to have things work
> properly even after the  migration.  Naturally this would only work with an
> upgraded libc but  I think that's a reasonable requirement to enforce for
> migration of  futexes and cross-network futexes.

Well. The numbers are unique per filesystem.  Which is what should save you.
The numbers can't simply be unique per container.

> Even for network filesystems which don't implement coherent shared memory, you
> might add a memexcl() system call which (when used by  multiple cooperating
> processes) ensures that a given page is only  ever mapped by at most one
> computer accessing a given network filesystem.  The page-outs and page-ins when
> shuttling that page  across the network would be expensive, but I believe the
> cost would  be reasonable for many applications and it would allow traditional
> atomic ops on the mapped pages to take and release futexes in the  uncontended
> case.

Sounds reasonable to me. 

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/