linux-kernel - Re: sysfs: tagged directories not merged completely yet

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <m1myh8rnf4.fsf@frodo.ebiederm.org>
Date:	Mon, 13 Oct 2008 18:11:11 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Tejun Heo <tj@...nel.org>
Cc:	Greg KH <greg@...ah.com>, Al Viro <viro@...IV.linux.org.uk>,
	Benjamin Thery <benjamin.thery@...l.net>,
	linux-kernel@...r.kernel.org, "Serge E. Hallyn" <serue@...ibm.com>,
	Al Viro <viro@....linux.org.uk>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: sysfs: tagged directories not merged completely yet

Tejun Heo <tj@...nel.org> writes:

> Hello, Greg.
>
> Greg KH wrote:
>> On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
>>> Unless someone will give an example of how having multiple superblocks
>>> sharing inodes is a problem in practice for sysfs and call it good
>>> for 2.6.28.  Certainly it shouldn't be an issue if the network namespace
>>> code is compiled out.  And it should greatly improve testing of the
>>> network namespace to at least have access to sysfs.
>> 
>> But if the network namespace code is in?  THen we have problems, right?
>> And that's the whole point here.
>> 
>> The fact that you are trying to limit userspace view of in-kernel data
>> structures, based on that specific user, is, in my opinion, crazy.
>
> Well, that's the whole point of all the namespace stuff.  If we're
> gonna do namespaces, view of in-kernel data structures need to be
> limited and modified one way or the other.
>
>> Why not just keep all users from seeing sysfs, and then have a user
>> daemon doing something on top of FUSE if you really want to see this
>> kind of stuff.
>
> That sounds nice.  Out of ignorance, how is the /proc dealt with?
> Maybe we can have some unified approach for this multiple views of the
> system stuff.

/proc uses just about every trick in the book to make this work.

/proc/sys uses a magic d_compare method.
      
/proc/net becomes a symlink to /proc/<pid>/net and we get completely
      different directory trees below that.   Shortly that code will
      use auto mounts of a proc_net filesystem, that has different
      super blocks one for each different network namespace.

/proc/sysvipc/* simply returns different values from it's files depending
      upon which process is reading them.

/proc itself has multiple super blocks one for each different pid namespace.

The long term direction is to be able to see everything at once.  If
you mount all of the filesystems multiple times in the proper way.
That allows monitoring software to watch what is going on inside of a
container without a challenge, and it makes it a user space policy
decision how much an individual container sees.

For sysfs there isn't have the option of putting things under
/proc/<pid>, the directories I am interested in (at least for network
devices) are scattered all over sysfs and come and go with device
hotplug events so I don't see a realistic way of splitting those
directories out into their own filesystem.

>From a user interface design perspective I don't see a good
alternative to having /sys/class/net/, /sys/virtual/net/, and all of
the other directories different based on network namespace.  Then have
the network namespace be specified by super block.  Looking at current
and doing the magic d_compare trick almost works, but it runs into
problems with sysfs_get_dentry.

>From the perspective of the internal sysfs data structures tagged
dirents are clean and simple so I don't see a reason to re-architect
that.

I have spent the last several days looking deeply at what the vfs
can do, and how similar situations are handled.  My observations
are:
1) exportfs from nfsd is similar to our kobject to sysfs_dirent layer,
   and solves that set of problems cleanly, including remote rename.
   So there is no fundamental reason we need inverted twisting locking in
   sysfs, or otherwise violate existing vfs rules.
2) i_mutex seems to protect very little if anything that we care about.
   The dcache has it's own set of locks.  So we may be able to completely
   avoid taking i_mutex in sysfs and simplify things enormously.
   Currently I believe we are very similar to ocfs2 in terms of locking
   requirements.
3) For i_notify and d_notify that seems to require pinning the inode
   or the dentry in question, so I see no reason why a d_revalidate
   style of update will have problems.
4) For finer locking granularity of readdir.  All we need to do is do
   the semi-expensive restart for each dirent, and the problem is
   trivially solved.
5) Large directories are a potential performance problem in sysfs.

So it appears that the path forward is:
- Cleanup sysfs locking and other issues.
- Return to the network namespace code.

Possibly with an intermediate step of only showing the network
devices in the initial network namespace in sysfs.

> Can somebody hammer the big picture regarding namespaces into my
> small head?

100,000 foot view.  A namespace introduces a scope so multiple
objects can have the same name.  Like network devices.

10,000 foot view.  The network namespace looks to user space
as if the kernel has multiple independent network stacks.

1000 foot view.  I have two network devices named lo, and sysfs
does not currently have a place for me to put them.

Leakage and being able to fool an application that it has the entire
kernel to itself are not concerns.  The goal is simply to get the
entire object name to object translation boundary and the namespace
work is done.  We have largely achieved, and the code to do
so once complete is reasonable enough that it should be no
worse than dealing with any other kernel bug.

Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/