linux-kernel - Re: [PATCH 1/3] sysfs: simplify handling for s

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <m1bpebfrhv.fsf@fess.ebiederm.org>
Date:	Fri, 26 Mar 2010 00:53:48 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	Neil Brown <neilb@...e.de>
Cc:	Greg Kroah-Hartman <gregkh@...e.de>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/3] sysfs: simplify handling for s_active refcount

Neil Brown <neilb@...e.de> writes:

> On Thu, 25 Mar 2010 21:24:43 -0700
> ebiederm@...ssion.com (Eric W. Biederman) wrote:
>
>> NeilBrown <neilb@...e.de> writes:
>> 
>> > s_active counts the number of active references to a 'sysfs_direct'.
>> > When we wish to deactivate a sysfs_direct, we subtract a large
>> > number for the refcount so it will always appear negative.  When
>> > it is negative, new references will not be taken.
>> > After that subtraction, we wait for all the active references to
>> > drain away.
>> >
>> > The subtraction of the large number contains exactly the same
>> > information as the setting of the flag SYSFS_FLAG_REMOVED.
>> > (We know this as we already assert that SYSFS_FLAG_REMOVED is set
>> > before adding the large-negative-bias).
>> > So doing both is pointless.
>> >
>> > By starting s_active with a value of 1, not 0 (as is typical of
>> > reference counts) and using atomic_inc_not_zero, we can significantly
>> > simplify the code while keeping exactly the same functionality.
>> 
>> Overall your logic appears correct but in detail this patch scares me.
>> 
>> sd->s_flags is protected by the sysfs_mutex, and you aren't
>> taking it when you read it.  So in general I don't see the new check
>> if (sd->s_flags & SYSFS_FLAG_REMOVED) == 0 providing any guarantee of
>> progress whatsoever with user space applications repeated reading from
>> a sysfs file when that sysfs file is being removed.  They could easily
>> have the sd->s_flags value cached and never see the new value, given a
>> crazy enough cache architecture.
>
> As you say, this is only a liveness issue.  The atomic_inc_not_zero
> guarantees that we don't take a new reference after the last one is gone.
> The test on SYSFS_FLAG_REMOVED is only there to ensure that the count does
> eventually get to zero.
> There could only be a problem here if the change to s_flags was not
> propagated to all CPUs in some reasonably small time.

Having this handled correctly is a key part of the abstraction
implemented with sysfs_get_active sysfs_put_active and
sysfs_deactivate.  It is a requirement that userspace can not cuause
denial of service attacks on the rest of the kernel.

> I'm not expert on these things but it was my understanding that interesting
> cache architectures could arbitrarily re-order accesses, but does not delay
> them indefinitely.
> Inserting barriers could possibly make this more predictable, but that would
> just delay certain loads/stores until a known state was reached - it would
> not make the data visible at another CPU any faster (or would it?).

> So unless there is no cache-coherency protocol at all, I think that
> SYSFS_FLAG_REMOVED will be seen promptly and that s_active will drop to zero
> and quickly as it does today.

This is my problem with your patch.  You have replaced something that
used an existing abstraction properly, with something that does not
use any abstraction at all and we are reduced to talking about cpu
memory ordering requirements.

Furthermore this code you are changing is not used raw but it is the
building blocks for a well defined abstraction.  The users of
sysfs_get_active sysfs_put_active and sysfs_deactivate don't have to
know the details of how they are implemented to think about and
reason about it accurately.

I am very much in favor of a general primitive that we can use for
blocking ref counts.  We have them in sysfs, proc, sysctl, and a few
other places with similar requirements.  Each location has a different
implementation, but essentially the same semantics.

What is really needed is something with the semantics of:
read_trylock
read_unlock
write_lock() perhaps synchronize?

With multiple reads in side the read side critical section, and 
biased so that the read side is very cheap, and the data structure
size is very cheap.  We have an ugly but otherwise near optimal
implementation in sysfs.  srcu is close the data structure
costs too much to use on every sysfs dirent.

Improving and getting it right is very attractive.  The code that
exists today is correct and signals that something fishy is going
on so there is no point in doing something less than right.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/