linux-kernel - Re: [RFC PATCH v2] ceph: ceph: fix out-of-bound array access when doing a file read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87mshj8dbg.fsf@orpheu.olymp>
Date: Thu, 28 Nov 2024 17:42:43 +0000
From: Luis Henriques <luis.henriques@...ux.dev>
To: Alex Markuze <amarkuze@...hat.com>
Cc: Luis Henriques <luis.henriques@...ux.dev>,  Goldwyn Rodrigues
 <rgoldwyn@...e.de>,  Xiubo Li <xiubli@...hat.com>,  Ilya Dryomov
 <idryomov@...il.com>,  ceph-devel@...r.kernel.org,
  linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2] ceph: ceph: fix out-of-bound array access when
 doing a file read

Hi Alex,

[ Thank you for looking into this. ]

On Wed, Nov 27 2024, Alex Markuze wrote:

> Hi, Folks.
> AFAIK there is no side effect that can affect MDS with this fix.
> This crash happens following this patch
> "1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
> at EOF on sync reads".
>
> Per your fix Luis, it seems to address only the cases when i_size goes
> to zero but can happen anytime the `i_size` goes below  `off`.
> I propose fixing it this way:

Hmm... you're probably right.  I didn't see this happening, but I guess it
could indeed happen.

> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4b8d59ebda00..19b084212fee 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
> loff_t *ki_pos,
>         if (ceph_inode_is_shutdown(inode))
>                 return -EIO;
>
> -       if (!len)
> +       if (!len || !i_size)
>                 return 0;
>         /*
>          * flush any page cache pages in this range.  this
> @@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
> loff_t *ki_pos,
>                 }
>
>                 idx = 0;
> -               if (ret <= 0)
> -                       left = 0;

Right now I don't have any means for testing this patch.  However, I don't
think this is completely correct.  By removing the above condition you're
discarding cases where an error has occurred (i.e. where ret is negative).

Why not simply modify my patch and do:

		if (i_size < off)
			ret = 0;

instead of:
		if (i_size == 0)
			ret = 0;

?

(Again, totally untested!)

Cheers,
-- 
Luís

> -               else if (off + ret > i_size)
> -                       left = i_size - off;
> +               if (off + ret > i_size)
> +                       left = (i_size > off) ? i_size - off : 0;
>                 else
> -                       left = ret;
> +                       left = (ret > 0) ? ret : 0;
> +
>                 while (left > 0) {
>                         size_t plen, copied;
>
>
> On Thu, Nov 7, 2024 at 1:09 PM Luis Henriques <luis.henriques@...ux.dev> wrote:
>>
>> (CC'ing Alex)
>>
>> On Wed, Nov 06 2024, Goldwyn Rodrigues wrote:
>>
>> > Hi Xiubo,
>> >
>> >> BTW, so in the following code:
>> >>
>> >> 1202                 idx = 0;
>> >> 1203                 if (ret <= 0)
>> >> 1204                         left = 0;
>> >> 1205                 else if (off + ret > i_size)
>> >> 1206                         left = i_size - off;
>> >> 1207                 else
>> >> 1208                         left = ret;
>> >>
>> >> The 'ret' should be larger than '0', right ?
>> >>
>> >> If so we do not check anf fix it in the 'else if' branch instead?
>> >>
>> >> Because currently the read path code won't exit directly and keep
>> >> retrying to read if it found that the real content length is longer than
>> >> the local 'i_size'.
>> >>
>> >> Again I am afraid your current fix will break the MIX filelock semantic ?
>> >
>> > Do you think changing left to ssize_t instead of size_t will
>> > fix the problem?
>> >
>> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > index 4b8d59ebda00..f8955773bdd7 100644
>> > --- a/fs/ceph/file.c
>> > +++ b/fs/ceph/file.c
>> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>> >       if (ceph_inode_is_shutdown(inode))
>> >               return -EIO;
>> >
>> > -     if (!len)
>> > +     if (!len || !i_size)
>> >               return 0;
>> >       /*
>> >        * flush any page cache pages in this range.  this
>> > @@ -1087,7 +1087,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>> >               size_t page_off;
>> >               bool more;
>> >               int idx;
>> > -             size_t left;
>> > +             ssize_t left;
>> >               struct ceph_osd_req_op *op;
>> >               u64 read_off = off;
>> >               u64 read_len = len;
>> >
>>
>> I *think* (although I haven't tested it) that you're patch should work as
>> well.  But I also think it's a bit more hacky: the overflow will still be
>> there:
>>
>>                 if (ret <= 0)
>>                         left = 0;
>>                 else if (off + ret > i_size)
>>                         left = i_size - off;
>>                 else
>>                         left = ret;
>>                 while (left > 0) {
>>                         // ...
>>                 }
>>
>> If 'i_size' is '0', 'left' (which is now signed) will now have a negative
>> value in the 'else if' branch and the loop that follows will not be
>> executed.  My version will simply set 'ret' to '0' before this 'if'
>> construct.
>>
>> So, in my opinion, what needs to be figured out is whether this will cause
>> problems on the MDS side or not.  Because on the kernel client, it should
>> be safe to ignore reads to an inode that has size set to '0', even if
>> there's already data available to be read.  Eventually, the inode metadata
>> will get updated and by then we can retry the read.
>>
>> Unfortunately, the MDS continues to be a huge black box for me and the
>> locking code in particular is very tricky.  I'd rather defer this for
>> anyone that is familiar with the code.
>>
>> Cheers,
>> --
>> Luís
>>