linux-kernel - [PATCH =-v3 21/21] fanotify: add Documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081112161217.25434.80361.stgit@paris.rdu.redhat.com>
Date:	Wed, 12 Nov 2008 11:12:17 -0500
From:	Eric Paris <eparis@...hat.com>
To:	linux-kernel@...r.kernel.org, malware-list@...ts.printk.net
Cc:	viro@...iv.linux.org.uk, alan@...rguk.ukuu.org.uk,
	arjan@...radead.org, greg@...ah.com, tytso@....edu,
	akpm@...ux-foundation.org
Subject: [PATCH =-v3 21/21] fanotify: add Documentation

Add an example/test program which exercises all of fanotify's abilities and
include all of my writeup about fanotify in Documentation.

Signed-off-by: Eric Paris <eparis@...hat.com>
---

 Documentation/filesystems/fanotify/fanotify.txt    |  214 +++++++++++++++++
 .../filesystems/fanotify/fanotify_tester.c         |  255 ++++++++++++++++++++
 2 files changed, 469 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/fanotify/fanotify.txt
 create mode 100644 Documentation/filesystems/fanotify/fanotify_tester.c

diff --git a/Documentation/filesystems/fanotify/fanotify.txt b/Documentation/filesystems/fanotify/fanotify.txt
new file mode 100644
index 0000000..b07d401
--- /dev/null
+++ b/Documentation/filesystems/fanotify/fanotify.txt
@@ -0,0 +1,214 @@
+fanotify is a new fscking all notify subsystem.  Much as inotify
+provides notification of filesystem activity for some registered subset
+of inodes fanotify provides notification of filesystem activity for ALL
+of the system's S_ISREG() files.  fanotify has a smaller number of
+notifications than inotify.
+
+User interface work can be found in net/fanotify.  Events are pulled from
+the kernel using getsockopt() and responses are sent to the kernel using
+setsockopt.  An example program which implements most if not all of the
+fanotify features can be found in the documentation directory. 
+
+The main implementation is in fs/notify/.  My layout is like so
+
+fanotify.c     --- all functions called from the main kernel
+group.c        --- groups are my implementation of my multiple listeners
+                   you can register different groups to get different
+                   event types.
+notification.c --- implementation surrounding the sending of events to a
+                   userspace listener.
+fastpath.c     --- implementation surrounding the addition of inode
+                   fastpath entries for performance.
+access.c       --- implementation surrounding the processing of
+                   responses from userspace when the events require
+                   a response.
+
+GROUPS AND EVENTS:
+A "group" registers with fanotify and in doing so indicates what that
+group should be called, what types of events it wishes to receive and
+in what order it should receive events relative to other groups.  An
+"event" is simply a notification about some filesystem action.  The list
+of all fanotify events are read, write, open_for_exec, open,
+close_was_not_writable, close_was_writable, open_need_access_decision,
+read_for exec_need_access_decision, read_need access_decisions.  Any number
+of groups may be created for any subset of event types.  One group may
+register to get reads and writes while another maybe register for opens
+and read_need_access_desision.
+
+Any number of userspace listeners may be active in a single group.  Each
+group will get ONE copy of a given filesystem event.  If there are 10
+listeners in a single group and one fanotify event is generated only ONE
+of those listeners will get the event.  If more than one group registers
+for the same type of event one listener in EACH GROUP will get a copy of
+that event.
+
+It all starts when a listener registers a group.  Registering a group
+is as done by binding group information to a fanotify socket.  Simple
+operation is something like.
+
+addr.name = 123
+addr.priority = 456
+addr.mask = 0x002
+
+sock = socket(PF_FANOTIFY);
+fd = bind(sock, addr);
+
+123 is just the name of the group, used so it is unlikely two listeners
+will accidentally use the same priority and mask and stumble into each
+others events.  456 is the priority (only interesting for blocking/access
+events, will describe later) and 0x002 is FAN_WRITE.  If one wanted read
+and write you would use 0x03 = (FAN_ACCESS | FAN_WRITE).  If this is the
+first time a process has bound to this address a struct fanotify_group
+is allocated and initialized (see fanotify_find_group()).  The group is
+added to a kernel global list called groups.
+
+The listener should now call getsockopt(fd, FANOTIFY_GET_EVENT,...).
+Since at this point there are no events this will block waiting for an
+event.
+
+Now lets say the original process calls open().  Open is going to happen
+exactly as before until it gets to the fsnotify code (this is where both
+inotify and dnotify hook into the kernel.)  From fsnotify we will call
+into the function fanotify() with the mask FAN_OPEN.  We will then walk
+to global groups list (which is ordered by priority, low first) looking
+for any groups which want to receive notification about FAN_OPEN and
+let's say we will find the group '123' that was registered above. An
+fanotify_event is allocated and any data we want the listener process to
+get about the original process is added to the fanotify_event.  The
+event contains a struct path with the dentry and vfsmount from the open
+done by the original process.
+
+Now we call add_event_to_group_notification() to add the event to the
+group->notification_list.  This function has a little bit of magic.
+Since an event may be needed in multiple groups notification_list I
+created a helper structure, a struct fanotify_event_holder.  Each entry
+in the group->event_list points to a unique event_holder which in turn
+points to the ref counted event in question.
+
+(assuming 2 groups)
+group1->notification_list ==> fanotify_event_holder1 ==> single_fanotify_event
+group2->notification_list ==> fanotify_event_holder2 ==> single_fanotify_event
+
+The magic is that since we will always need at least 1 holder I embedded
+one fanotify_event_holder inside an fanotity_event.  This means that
+when removing an event from the group->notification_list we may need to
+free the fanotify_event_holder (if it was allocated seperately) or we
+may need to just clear it (if it was the embedded holder.)
+
+After the event is added to the group->notification_list we wake up the
+listener processes.  The original process never blocked and at this
+point and is returning to userspace with the completed open.
+
+Simultaneously the listener process will now remove the event from the
+group->notification_list, see remove_event_from_group_notification().
+Create a new file, fd, and install such in the listener process, see
+fanotify_notification_read().  We will put_event (since this group
+is finished with it) and will return the getsockopt() call to userspace.
+
+The listener process will get a structure filled back from teh
+getsockopt call which will include an fd and some metadata about that
+fd.  This includes things like the original files f_flags, the original
+processes pid and things like that.
+
+The listener process must call close() when it is finished with this
+new fd.  But lets assume the listener, for whatever reason, decides it
+doesn't want to hear any more of this type of message for this inode.
+That means the listener process needs to "create a fastpath" entry.  To
+do this the listener process will need to call setsockopt(fd,
+FANOTIY_SET_FASTPATH, ...) the struct with the setsockopt will include
+things like what kind of events we don't want to hear about and what
+file this is associated with.  All fastpaths will be cleared on the next
+fsnotify write event (happens the same place in the code as mtime
+update)
+
+Inside the kernel (fanotify_fastpath_add()) what happens is that we will
+create a new fanotify_fastpath_entry for the group and mask in question
+and attach it to the inode.  The next time a process opens the inode in
+question we will search the global groups list for a group that matches
+the mask and we will look at the inode to see if there is a fastpath
+entry for this group and mask.  If there is an entry no event will be
+added to the group->event_list.
+
+The need for fastpaths (or calling what it really is, an in kernel
+cache) has been questioned.  I decided to include a little unscientific data
+here.  On a 32 way machine a make -j 32 took about 9 minutes 12 seconds.
+With that same machine running having one group and 32 listeners receiving
+every fanotify event that the kernel could send to userspace while the 
+listeners were responding to accesses as fast as they could (just a very tight get
+event, allow loop) it took 95 minutes 12 seconds.  Same process with in kernel
+fastpaths/cache results took 19 minutes 5 seconds.  More reasonable event
+requirements and a single listener took 10 minutes 35 sec.  So about a 15%
+perf hit to do any kind of permission checking to userspace.  Anyway, the need
+for fastpaths is quite clear.
+
+Assuming the event was for FAN_OPEN_PERM much of the above is the
+same.  Biggest difference is that the place from the kernel we will call
+into the fanotify code is different (fsnotify is not in a good place to
+provide security hooks).  If a group is found that wants this event the
+event is added to the group->access_list AND to the group->event_list.
+The original process is then blocked for a (now fixed 5 second) timeout
+waiting for the event to get a non-zero event->response on the
+group->access_waitq.
+
+The listener process will get a notification exactly as above from the
+notification file but this time will need to write an answer to the
+access file.  The answer is again a simple string indicating the cookie
+and the response (allow/deny).  If a response is received from userspace
+the event is removed from the group->access_list and the original
+process is woken up to continue, either by looking for the next group or
+by returning -EPERM.  Userspace may also return FAN_RESETTIMER which
+will reset the 5 second timeout.  A badly behaving userspace may hang an
+open indeffinetly.
+
+If the original process times out waiting for the listener process to
+give a response we currently just allow the security access.
+
+An interesting part of the code is fastpath cleanup handling.  Any time
+fanotify gets a FAN_MODIFY event we clear the fastpath entries for the
+associated inode.  This means our notification and access decisions are
+NOT race free but the races are small.  This is not perfect security.  A
+'problomatic' sequence of events would be like such
+
+process1 calls read on file
+listener scans file and finds it safe
+process2 writes to file
+listener creates a fastpath entry for file
+
+Maybe I should add an explicit clear fastpath request so userspace can
+close that race when it gets the write notification on file (if it so
+chooses.)  Remember I'm not trying to protect against a rogue process
+intelligently and actively attacking the system.  I'm trying to stop a
+correctly functioning yum from reading a bad rpm that it downloaded.
+I'm trying to stop an NFS server from accepting bad files and then
+apache serving them out on the net.
+
+There is the also a race between when a process calls mmap() (at that
+point we get an event and scan) and when the process actually faults in
+a page which may have been changed by a write from another process.  Its
+not perfect, but it's damn sure better than what we have now.
+
+Object lifetime:
+
+fanotify_group - exists from registration to unregistration.
+Unregistration racing with some the associated special file notification
+open was a bitch to figure out but eventually I based it on the groups
+existence in the global groups list.
+
+fanotify_event - created inside the main fanotify loop which runs the
+events list.  An event lasts until both the main loop ends AND the event
+is no longer needed by all groups for which it was queued.  A refcnt on
+the event is taken every time it is added to a group notification/access
+list and is dropped when the group has removed the event from the list
+and is finished with its contents.
+
+fanotify_event_holder - allocated when an event is added to a group
+notification/access list.  destroyed when an event is removed from a
+notification/access list.  There is the special case of the embedded
+holder inside the fanotify_event.  The embedded holder is assumed to be
+available for use if holder->event_list is empty.
+
+fanotify_fastpath_entry - created when a process writes to the fastpath
+special file and added to the inode list.  This entry is destroy in 3
+possible places.  If an inode has a modify event we flush them all.  If
+an inode is eviced from core we flush them all.  If a group is
+unregistered we flush them all for that group.
diff --git a/Documentation/filesystems/fanotify/fanotify_tester.c b/Documentation/filesystems/fanotify/fanotify_tester.c
new file mode 100644
index 0000000..ba5e830
--- /dev/null
+++ b/Documentation/filesystems/fanotify/fanotify_tester.c
@@ -0,0 +1,255 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <signal.h>
+
+#include "fanotify.h"
+
+typedef void (*sighandler_t)(int);
+
+#ifndef SOL_FANOTIFY
+#define SOL_FANOTIFY 276
+#endif
+
+#ifndef AF_FANOTIFY
+#define AF_FANOTIFY 36
+#endif
+
+#ifndef PF_FANOTIFY
+#define PF_FANOTIFY AF_FANOTIFY
+#endif
+
+#define SEND_FASTPATH		0x01
+#define SEND_ACCESS_ALLOW	0x02
+#define SEND_ACCESS_DENY	0x04
+#define SEND_DELAY		0x08
+#define SEND_SURVIVE_WRITE	0x10
+
+#define SEND_ACCESS		(SEND_ACCESS_ALLOW | SEND_ACCESS_DENY)
+
+/* file descriptor we use for fanotify */
+int fan_fd;
+int get_evicted = 0;
+
+static int __send_fastpath(int fd, unsigned int mask)
+{
+	struct fanotify_so_fastpath fastpath;
+
+	fastpath.fd = fd;
+	fastpath.mask = mask;
+
+	return setsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_SET_FASTPATH, &fastpath, sizeof(struct fanotify_so_fastpath));
+}
+
+static int send_fastpath(struct fanotify_event_metadata *data)
+{
+	return __send_fastpath(data->fd, data->mask);
+}
+
+static int send_survive_write(struct fanotify_event_metadata *data)
+{
+	return __send_fastpath(data->fd, FAN_SURVIVE_WRITE);
+}
+
+static int send_access(struct fanotify_event_metadata *data, int allow)
+{
+	struct fanotify_so_access access;
+
+	access.cookie = data->cookie;
+	if (allow)
+		access.response = FAN_ALLOW;
+	else
+		access.response = FAN_DENY;
+
+	return setsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_SEND_RESPONSE, &access, sizeof(struct fanotify_so_access));
+}
+
+static int send_delay(struct fanotify_event_metadata *data)
+{
+	struct fanotify_so_access access;
+
+	access.cookie = data->cookie;
+	access.response = FAN_RESETTIMER;
+
+	return setsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_SEND_RESPONSE, &access, sizeof(struct fanotify_so_access));
+}
+
+/* timeout is in msec */
+static int send_set_timeout(int timeout)
+{
+	return setsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_SET_TIMEOUT, &timeout, sizeof(int));
+}
+
+static void clear_all_fastpath(int signum __attribute__((unused)))
+{
+	setsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_CLEAR_ALL_FASTPATH, NULL, 0);
+}
+
+static int get_response_opts(struct fanotify_event_metadata *data, char *path)
+{
+	unsigned int flags = 0;
+
+	if (data->fd < 0)
+		return flags;
+
+	if (data->mask & FAN_ALL_EVENTS_PERM)
+		flags = SEND_ACCESS_ALLOW;
+
+	if (!strcmp("/root/denyme", path)) {
+		flags &= ~SEND_ACCESS_ALLOW;
+		flags |= SEND_ACCESS_DENY;
+	} else if (!strcmp("/root/ignoreme", path)) {
+		flags = 0;
+	} else if (!strcmp("/root/delayme", path)) {
+		if (data->mask & FAN_ALL_EVENTS_PERM)
+			flags |= SEND_DELAY;
+	} else if (!strcmp("/root/surviveme", path)) {
+		flags |= (SEND_SURVIVE_WRITE | SEND_FASTPATH);
+	} else if (!strcmp("/root/evictme", path)) {
+		get_evicted = 1;
+	} else {
+		flags |= SEND_FASTPATH;
+
+		if (data->mask & FAN_ALL_EVENTS_PERM)
+			flags |= SEND_ACCESS_ALLOW;
+	}
+
+	if (get_evicted)
+		flags = 0;
+
+	return flags;
+}
+
+static int bind_fan_fd(char *argv[])
+{
+	struct fanotify_addr addr;
+	unsigned int priority;
+	unsigned int group_num;
+	unsigned int mask;
+
+	group_num = atoi(argv[1]);
+	priority = atoi(argv[2]);
+	mask = strtol(argv[3], (char **) NULL, 16);
+
+	memset(&addr, 0, sizeof(struct fanotify_addr));
+	addr.group_num = group_num;
+	addr.priority = priority;
+	addr.mask = mask;
+
+	return bind(fan_fd, (struct sockaddr *)&addr, sizeof(struct fanotify_addr));
+}
+
+int main(int argc, char *argv[])
+{
+	int ret;
+	unsigned long cnt;
+	socklen_t len;
+	sighandler_t sig;
+	struct fanotify_event_metadata metadata;
+	char readbuf[4096];
+	char path[4096];
+	unsigned int response_opts;
+
+	if (argc != 4) {
+		fprintf(stderr, "usage: %s group_num priority mask\n", argv[0]);
+		return 1;
+	}
+
+	sig = signal(SIGUSR1, clear_all_fastpath);
+	if (sig == SIG_ERR) {
+		fprintf(stderr, "unable to set the signal handler\n");
+		return 1;
+	}
+
+	fan_fd = socket(PF_FANOTIFY, SOCK_RAW, 0);
+	if (fan_fd == -1) {
+		perror("Creating socket");
+		return 1;
+	}
+
+	ret = bind_fan_fd(argv);
+	if (ret) {
+		perror("Binding");
+		goto out;
+	}
+
+	ret = send_set_timeout(1000);
+	if (ret) {
+		perror("Setting timeout");
+		goto out;
+	}
+
+	cnt = 0;
+
+	while (1) {
+		len = sizeof(struct fanotify_event_metadata);
+		/* get an event */
+		ret = getsockopt(fan_fd, SOL_FANOTIFY, FANOTIFY_GET_EVENT, &metadata, &len);
+		if (ret) {
+			perror("getsockopt");
+			goto out;
+		}
+
+		/* get the path */
+		if (metadata.fd >= 0) {
+			snprintf(readbuf, sizeof(readbuf), "/proc/self/fd/%d", metadata.fd);
+			ret = readlink(readbuf, path, sizeof(path)-1);
+			if (ret > 0)
+				path[ret] = '\0';
+			else
+				strcpy(path, "<error>");
+		} else {
+			strcpy(path, strerror(-metadata.fd));
+		}
+
+		/* using the event and path figure out how to respond */
+		response_opts = get_response_opts(&metadata, path);
+
+		/* should we send a fastpath? */
+		if (response_opts & SEND_FASTPATH) {
+			ret = send_fastpath(&metadata);
+			if (ret)
+				perror("Sending fastpath");
+		}
+
+		/* should out fastpath entry survive writes? */
+		if (response_opts & SEND_SURVIVE_WRITE) {
+			ret = send_survive_write(&metadata);
+			if (ret)
+				perror("Sending survive write");
+		}
+
+		/* should we delay this operation? */
+		if (response_opts & SEND_DELAY) {
+			int i;
+			for (i = 0; i < 14; i++) {
+				ret = send_delay(&metadata);
+				if (ret)
+					perror("Sending delay");
+				/* timeout is 1,000,000 so sleep for 750,000 */
+				usleep(750000);
+			}
+		}
+		/* do we need to send an access response? */
+		if (response_opts & SEND_ACCESS) {
+			ret = send_access(&metadata, !!(response_opts & SEND_ACCESS_ALLOW));
+			if (ret)
+				perror("Sending access");
+		}
+
+		/* dump something on the console so we know what events we took */
+		fprintf(stdout, "#%lu: mask:%x cookie:%llu fd:%d pid:%d tgid:%d f_flags:%x path:%s\n",
+			cnt++, metadata.mask, metadata.cookie, metadata.fd, metadata.pid, metadata.tgid,
+			metadata.f_flags, path);
+
+		/* if we got an open fd we need to close it */
+		if (metadata.fd >= 0)
+			close(metadata.fd);
+	};
+
+out:
+	close(fan_fd);
+	return ret;
+}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/