linux-kernel - Re: [RFC] writev() semantics with invalid iovec in the middle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALXu0Uf3wimbDexXbxfu8LQRiE97KzjKsuyOF46x8TVK_T1iBQ@mail.gmail.com>
Date:   Fri, 16 Sep 2016 00:32:46 +0200
From:   Cedric Blancher <cedric.blancher@...il.com>
To:     Al Viro <viro@...iv.linux.org.uk>
Cc:     Mike Marshall <hubcap@...ibond.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [RFC] writev() semantics with invalid iovec in the middle

PAGE_SIZE isn't accurate on architectures which do multiple page
sizes, like 8k, 64k, 512k, 4M, 32M, 256M on SPARC64 and same on
PPC64/Power.

Ced

On 16 September 2016 at 00:29, Al Viro <viro@...iv.linux.org.uk> wrote:
> On Thu, Sep 15, 2016 at 06:23:24AM -0400, Mike Marshall wrote:
>> If you squeeze out every byte won't you still have a short
>> write? And the written data wouldn't be cut at the bad
>> place, but it would have a weird hole or discontinuity there.
>
> ???
>
> What I mean is that if we have an invalid address in the middle of a buffer
> (unmapped, for example), we do not attempt to write every byte prior to that
> invalid address.  Of course what we write is going to be contiguous.
>
> Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) -
> 7 valid, 3 invalid:
>         VVVVIIIVV
> and it starts 100 bytes into the first page.  And write goes into a regular
> file on e.g. tmpfs, starting at offset 31.  We _can't_ write more than
> 4*4096-100 bytes, no matter what.  It will be a short write.  As the matter
> of fact, it will be even shorter than that - it will be 3*4096-31 bytes,
> up to the last pagecache boundary we can cover completely.  That obviously
> depends upon the filesystem - not everything uses pagecache, for starters.
> However, the caller is *not* guaranteed that write() with an invalid page
> in the middle of a buffer would write everything up to the very beginning
> of the invalid page.  A short write will happen, but the amount written
> might be up to page size less than the actual length of valid part in the
> beginning of the buffer.
>
> Now, for writev() we could have invalid pages in any iovec; again, we
> obviously can't write anything past the first invalid page - we'll get
> either a short write or -EFAULT (if nothing got written).  That's fine;
> the question is what the caller can count upon wrt shortening.
>
> Again, we are *not* guaranteed writing up to exact boundary.  However, the
> current implementation will end up shortening no more than to the iovec
> boundary.  I.e. if the first iovec contains only valid pages and there's
> an invalid one in the second iovec, the current implementation will write
> at least everything in the first iovec.  That's _not_ promised by POSIX
> or our manpages; moreover, I'm not sure if it's even true for each filesystem.
> And keeping that property is actually inconvenient - if we could discard it,
> we could make partial-copy ->write_end() calls a lot more infrequent.
>
> Unfortunately, some of LTP writev tests end up checking that writev() does
> behave that way - they feed it a three-element iovec with shorter-than-page
> segments, the second of which is all invalid.  And they check that the
> entire first segment had been written.
>
> I would really like to drop that property, making it "if some addresses
> in the buffer(s) we are asked to write are invalid, the write will be
> shortened by up to a PAGE_SIZE from the first such invalid address", making
> writev() rules exactly the same as write() ones.  Does anybody have objections
> to it?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cedric Blancher <cedric.blancher@...il.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur