lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 04 Jun 2018 09:44:47 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     "Hatayama\, Daisuke" <d.hatayama@...fujitsu.com>
Cc:     "'gregkh\@linuxfoundation.org'" <gregkh@...uxfoundation.org>,
        "'tj\@kernel.org'" <tj@...nel.org>,
        "Okajima\, Toshiyuki" <toshi.okajima@...fujitsu.com>,
        "linux-kernel\@vger.kernel.org" <linux-kernel@...r.kernel.org>,
        "'ebiederm\@aristanetworks.com'" <ebiederm@...stanetworks.com>
Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

"Hatayama, Daisuke" <d.hatayama@...fujitsu.com> writes:

> Hello,
>
> Thanks for this patch, Eric.
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:ebiederm@...ssion.com]
>> Sent: Monday, June 4, 2018 3:51 AM
>> To: Hatayama, Daisuke <d.hatayama@...fujitsu.com>
>> Cc: 'gregkh@...uxfoundation.org' <gregkh@...uxfoundation.org>;
>> 'tj@...nel.org' <tj@...nel.org>; Okajima, Toshiyuki 
>> <toshi.okajima@...fujitsu.com>; linux-kernel@...r.kernel.org;
>> 'ebiederm@...stanetworks.com' <ebiederm@...stanetworks.com>
>> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>> 
>> 
>> Over the years two bugs have crept into the code that handled the last
>> returned kernfs directory being deleted, and handled general seeking
>> in kernfs directories.  The result is that reading the /sys/module
>> directory while moduled are loading and unloading will sometimes
>> skip over a module that was present for the entire duration of
>> the directory read.
>> 
>> The first bug is that kernfs_dir_next_pos was advancing the position
>> in the directory even when kernfs_dir_pos had found the starting
>> kernfs directory entry was not usable.  In this case kernfs_dir_pos is
>> guaranteed to return the directory entry past the starting directory
>> entry, making it so that advancing the directory position is wrong.
>> 
>> The second bug is that kernfs_dir_pos when the saved kernfs_node
>> did not validate, was not always returning a directory after
>> the the target position, but was sometimes returning the directory
>> entry just before the target position.
>> 
>> To correct these issues and to make the code easily readable and
>> comprehensible several cleanups are made.
>> 
>> - A function kernfs_dir_next is factored out to make it straight-forward
>>   to find the next node in a kernfs directory.
>> 
>> - A function kernfs_dir_skip_pos is factored out to skip over directory
>>   entries that should not be shown to userspace.  This function is called
>>   from kernfs_fop_readdir directly removing the complication of calling
>>   it from the other directory advancement functions.
>> 
>> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>>   to kernfs_fop_readdir.  Keeping it with the kernfs_get when it is saved
>>   and ensuring the directory position advancment functions can reference
>>   the saved node without complications.
>> 
>> - The function kernfs_dir_next_pos is modified to only advance the directory
>>   position in the common case when kernfs_dir_pos validates and returns
>>   the starting kernfs node.  For all other cases kernfs_dir_pos is guaranteed
>>   to return a directory position in advance of the starting directory position.
>> 
>> - kernfs_dir_pos is given a significant make over.  It's essense remains the
>>   same but some siginficant changes were made.
>> 
>>   + The offset is validated to be a valid hash value at the very beginning.
>>     The validation is updated to handle the fact that neither 0 nor 1 are
>>     valid hash values.
>> 
>>   + An extra test is added at the end of the rbtree lookup to ensure
>>     the returned position is at or beyond the target position.
>> 
>>   + kernfs_name_compare is used during the rbtree lookup instead of just
>> comparing
>>     the hash.  This allows the the passed namespace parameter to be used, and
>>     if it is available from the saved entry the old saved name as well.
>> 
>>   + The validation of the saved kernfs node now checks for two cases.
>>     If the saved node is returnable.
>>     If the name of the saved node is usable during lookup.
>> 
>> In net this should result in a easier to understand, and more correct
>> version of directory traversal for kernfs directories.
>>
>> Reported-by: "Hatayama, Daisuke" <d.hatayama@...fujitsu.com>
>> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
>> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
>> scalability v2")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@...ssion.com>
>> ---
>> 
>> Can you test this and please verify it fixes your issue?
>>
>
> I tried this patch on top of v4.17 but the system fails to boot
> without detecting root disks by dracut like this:
>
>     [  196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  199.945999] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  200.490281] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  201.071177] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
>     [  201.074958] dracut-initqueue[324]: Warning: Could not boot.
> 	     Starting Setup Virtual Console...
>     [  OK  ] Started Setup Virtual Console.
>     [  201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> 	     [  201.256378] audit: type=1131 audit(1528132266.260:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
>     Starting Dracut Emergency Shell...
>     Warning: /dev/fedora/root does not exist
>     Warning: /dev/fedora/swap does not exist
>     Warning: /dev/mapper/fedora-root does not exist
>
>     Generating "/run/initramfs/rdsosreport.txt"
>
>
>     Entering emergency mode. Exit the shell to continue.
>     Type "journalctl" to view system logs.
>     You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
>     after mounting them and attach it to a bug report.
>
>
>     dracut:/# ls /sys/module
>     8139cp        dm_bufio        kernel       psmouse      ttm
>     8139too       dm_mirror       keyboard     pstore       uhci_hcd
>     8250          dm_mod          kgdboc       qemu_fw_cfg  usbcore
>     acpi          dm_snapshot     kgdbts       qxl          usbhid
>     acpiphp       drm             libahci      random       uv_nmi
>     ahci          drm_kms_helper  libata       rcupdate     virtio
>     ata_generic   dynamic_debug   md_mod       rcutree      virtio_console
>     ata_piix      edac_core       mii          rng_core     virtio_pci
>     battery       ehci_hcd        module       scsi_mod     virtio_ring
>     block         fb              mousedev     serio_raw    vt
>     button        firmware_class  pata_acpi    sg           watchdog
>     configfs      fscrypto        pci_hotplug  slab_common  workqueue
>     cpufreq       hid             pcie_aspm    spurious     xhci_hcd
>     cpuidle       hid_magicmouse  pciehp       sr_mod       xz_dec
>     crc32c_intel  hid_ntrig       pcmcia       srcutree     zswap
>     cryptomgr     i8042           pcmcia_core  suspend
>     debug_core    intel_idle      printk       sysrq
>     devres        ipv6            processor    tcp_cubic
>     dracut:/# lsmod
>     Module                  Size  Used by
>     qxl                    77824  1
>     drm_kms_helper        196608  1
>     ttm                   126976  1
>     drm                   454656  4 qxl,ttm
>     virtio_console         32768  0
>     ata_generic            16384  0
>     crc32c_intel           24576  0
>     8139too                40960  0
>     virtio_pci             28672  0
>     8139cp                 32768  0
>     virtio_ring            28672  2 virtio_pci
>     serio_raw              16384  0
>     pata_acpi              16384  0
>     virtio                 16384  2 virtio_pci
>     mii                    16384  2 8139too
>     qemu_fw_cfg            16384  0
>     dracut:/#
>
> OTOH, there's no issue on the pure v4.17 kernel.
>
> As above, ls /sys/module looks apparently good. But I guess any part of
> behavior of getdentries() on sysfs must have changed, affecting the disk
> detection...

I haven't been able to reproduce this yet.  My test system boots fine.
Which fedora are you testing on?


>>  fs/kernfs/dir.c | 109
>> ++++++++++++++++++++++++++++++++++----------------------
>>  1 file changed, 67 insertions(+), 42 deletions(-)
>> 
>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> index 89d1dc19340b..8148b5fec48d 100644
>> --- a/fs/kernfs/dir.c
>> +++ b/fs/kernfs/dir.c
>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
>> struct file *filp)
>>  	return 0;
>>  }
>> 
>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
>> +{
>> +	struct rb_node *node = rb_next(&pos->rb);
>> +	return node ? rb_to_kn(node) : NULL;
>> +}
>> +
>>  static struct kernfs_node *kernfs_dir_pos(const void *ns,
>> -	struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>>  {
>> -	if (pos) {
>> -		int valid = kernfs_active(pos) &&
>> -			pos->parent == parent && hash == pos->hash;
>> -		kernfs_put(pos);
>> -		if (!valid)
>> -			pos = NULL;
>> -	}
>> -	if (!pos && (hash > 1) && (hash < INT_MAX)) {
>> -		struct rb_node *node = parent->dir.children.rb_node;
>> -		while (node) {
>> -			pos = rb_to_kn(node);
>> -
>> -			if (hash < pos->hash)
>> -				node = node->rb_left;
>> -			else if (hash > pos->hash)
>> -				node = node->rb_right;
>> -			else
>> -				break;
>> +	struct kernfs_node *pos;
>> +	struct rb_node *node;
>> +	unsigned int hash;
>> +	const char *name = "";
>> +
>> +	/* Is off a valid name hash? */
>> +	if ((off < 2) || (off >= INT_MAX))
>> +		return NULL;
>> +	hash = off;
>> +
>> +	/* Is the saved position usable? */
>> +	if (saved) {
>> +		/* Proper parent and hash? */
>> +		if ((parent != saved->parent) || (saved->hash != hash)) {
>> +			saved = NULL;
>
> name is uninitialized in this path.

It is.  name is initialized to "" see above.

>> +		} else {
>> +			if (kernfs_active(saved))
>> +				return saved;
>> +			name = saved->name;
>>  		}
>>  	}
>> -	/* Skip over entries which are dying/dead or in the wrong namespace
>> */
>> -	while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
>> -		struct rb_node *node = rb_next(&pos->rb);
>> -		if (!node)
>> -			pos = NULL;
>> +
>> +	/* Find the closest pos to the hash we are looking for */
>> +	pos = NULL;
>> +	node = parent->dir.children.rb_node;
>> +	while (node) {
>> +		int result;
>> +
>> +		pos = rb_to_kn(node);
>> +		result = kernfs_name_compare(hash, name, ns, pos);
>> +		if (result < 0)
>> +			node = node->rb_left;
>> +		else if (result > 0)
>> +			node = node->rb_right;
>>  		else
>> -			pos = rb_to_kn(node);
>> +			break;
>>  	}
>> +
>> +	/* Ensure pos is at or beyond the target position */
>> +	if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
>> +		pos = kernfs_dir_next(pos);
>> +
>
> Why not trying to find the final target object here?
>
> Looking at the code, I think the operation needed here is to find the
> smallest active object in the same namespace from the objects larger
> than the saved object given as argument. The saved object appears to
> be used as cache. I think dividing the process into kernfs_dir_pos()
> is not necessary.

What you are suggesting below is wrong when restarting readdir.

Ordinarily saved is the next entry we are going to return from readdir.
When dir_emit does not succeed we stop returnning entries and when
we restart readdir we need to attempt dir_emit on saved again.

Always advancing past saved will cause us to skip saved in the
event a single readdir is not enough.

The restart (kernfs_dir_pos) must return saved or if saved is now gone,
the first entry past saved.

Saved dying in the middle is what I believe caused the original issue.

Making all of this clear is part of the reason I moved the
kernfs_dir_skip_pos logic into it's own function.

Eric

>
> I mean like this:
>
> # git diff
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 8148b5f..eeffc8c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>
>         /* Is the saved position usable? */
>         if (saved) {
> +               name = saved->name;
>                 /* Proper parent and hash? */
>                 if ((parent != saved->parent) || (saved->hash != hash)) {
>                         saved = NULL;
> -               } else {
> -                       if (kernfs_active(saved))
> -                               return saved;
> -                       name = saved->name;
> +               } else if (kernfs_active(saved)) {
> +                       pos = saved;
> +                       goto skip;
>                 }
>         }
>
> @@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>                         break;
>         }
>
> +skip:
>         /* Ensure pos is at or beyond the target position */
> -       if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> +       if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
>                 pos = kernfs_dir_next(pos);
>
>         return pos;
>  }
>
> -static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> -       struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> -{
> -       struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> -       if (likely(pos == start))
> -               pos = kernfs_dir_next(pos);
> -       return pos;
> -}
> -
>  static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
>                                                struct kernfs_node *pos)
>  {
> @@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>
>         for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
>              (pos = kernfs_dir_skip_pos(ns, pos));
> -            pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> +            pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
>                 const char *name = pos->name;
>                 unsigned int type = dt_type(pos);
>                 int len = strlen(name);
>
>>  	return pos;
>>  }
>> 
>>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>> -	struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>> +	struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
>>  {
>> -	pos = kernfs_dir_pos(ns, parent, ino, pos);
>> -	if (pos) {
>> -		do {
>> -			struct rb_node *node = rb_next(&pos->rb);
>> -			if (!node)
>> -				pos = NULL;
>> -			else
>> -				pos = rb_to_kn(node);
>> -		} while (pos && (!kernfs_active(pos) || pos->ns != ns));
>> -	}
>> +	struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
>> +	if (likely(pos == start))
>> +		pos = kernfs_dir_next(pos);
>> +	return pos;
>> +}
>> +
>> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
>> +					       struct kernfs_node *pos)
>> +{
>> +	/* Skip entries we don't want to return to userspace */
>> +	while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
>> +		pos = kernfs_dir_next(pos);
>>  	return pos;
>>  }
>> 
>> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct
>> dir_context *ctx)
>>  {
>>  	struct dentry *dentry = file->f_path.dentry;
>>  	struct kernfs_node *parent = kernfs_dentry_node(dentry);
>> -	struct kernfs_node *pos = file->private_data;
>> +	struct kernfs_node *pos, *saved = file->private_data;
>>  	const void *ns = NULL;
>> 
>>  	if (!dir_emit_dots(file, ctx))
>> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
>> struct dir_context *ctx)
>>  	if (kernfs_ns_enabled(parent))
>>  		ns = kernfs_info(dentry->d_sb)->ns;
>> 
>> -	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
>> -	     pos;
>> +	for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
>> +	     (pos = kernfs_dir_skip_pos(ns, pos));
>>  	     pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
>>  		const char *name = pos->name;
>>  		unsigned int type = dt_type(pos);
>>  		int len = strlen(name);
>>  		ino_t ino = pos->id.ino;
>> 
>> -		ctx->pos = pos->hash;
>> -		file->private_data = pos;
>> -		kernfs_get(pos);
>> +		kernfs_put(saved);
>> +		saved = pos;
>> +		ctx->pos = saved->hash;
>> +		file->private_data = saved;
>> +		kernfs_get(saved);
>> 
>>  		mutex_unlock(&kernfs_mutex);
>>  		if (!dir_emit(ctx, name, len, ino, type))
>>  			return 0;
>>  		mutex_lock(&kernfs_mutex);
>>  	}
>> +	kernfs_put(saved);
>>  	mutex_unlock(&kernfs_mutex);
>>  	file->private_data = NULL;
>>  	ctx->pos = INT_MAX;
>> --
>> 2.14.1
>> 
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ