linux-kernel - Re: [PATCH v3 4/4] io_uring: add support for zone-append

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+1E3rK9LCmB4Lt8hTLrCx7bXaF6sETWgm=M6=D6grOnGSgiRQ@mail.gmail.com>
Date:   Mon, 20 Jul 2020 22:16:28 +0530
From:   Kanchan Joshi <joshiiitr@...il.com>
To:     Jens Axboe <axboe@...nel.dk>
Cc:     Christoph Hellwig <hch@...radead.org>,
        Kanchan Joshi <joshi.k@...sung.com>, viro@...iv.linux.org.uk,
        bcrl@...ck.org, Damien.LeMoal@....com, asml.silence@...il.com,
        linux-fsdevel@...r.kernel.org, "Matias Bj??rling" <mb@...htnvm.io>,
        linux-kernel@...r.kernel.org, linux-aio@...ck.org,
        io-uring@...r.kernel.org, linux-block@...r.kernel.org,
        Selvakumar S <selvakuma.s1@...sung.com>,
        Nitesh Shetty <nj.shetty@...sung.com>,
        Javier Gonzalez <javier.gonz@...sung.com>
Subject: Re: [PATCH v3 4/4] io_uring: add support for zone-append

On Fri, Jul 10, 2020 at 7:39 PM Jens Axboe <axboe@...nel.dk> wrote:
>
> On 7/10/20 7:10 AM, Christoph Hellwig wrote:
> > On Fri, Jul 10, 2020 at 12:35:43AM +0530, Kanchan Joshi wrote:
> >> Append required special treatment (conversion for sector to bytes) for io_uring.
> >> And we were planning a user-space wrapper to abstract that.
> >>
> >> But good part (as it seems now) was: append result went along with cflags at
> >> virtually no additional cost. And uring code changes became super clean/minimal
> >> with further revisions.
> >> While indirect-offset requires doing allocation/mgmt in application,
> >> io-uring submission
> >> and in completion path (which seems trickier), and those CQE flags
> >> still get written
> >> user-space and serve no purpose for append-write.
> >
> > I have to say that storing the results in the CQE generally make
> > so much more sense.  I wonder if we need a per-fd "large CGE" flag
> > that adds two extra u64s to the CQE, and some ops just require this
> > version.
>
> I have been pondering the same thing, we could make certain ops consume
> two CQEs if it makes sense. It's a bit ugly on the app side with two
> different CQEs for a request, though. We can't just treat it as a large
> CQE, as they might not be sequential if we happen to wrap. But maybe
> it's not too bad.

Did some work on the two-cqe scheme for zone-append.
First CQE is the same (as before), while second CQE does not keep
res/flags and instead has 64bit result to report append-location.
It would look like this -

struct io_uring_cqe {
        __u64   user_data;      /* sqe->data submission passed back */
-       __s32   res;            /* result code for this event */
-       __u32   flags;
+       union {
+               struct {
+                       __s32   res;            /* result code for this event */
+                       __u32   flags;
+               };
+               __u64   append_res;   /*only used for append, in
secondary cqe */
+       };

And kernel will produce two CQEs for append completion-

static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
{
-       struct io_uring_cqe *cqe;
+       struct io_uring_cqe *cqe, *cqe2 = NULL;

-       cqe = io_get_cqring(ctx);
+       if (unlikely(req->flags & REQ_F_ZONE_APPEND))
+ /* obtain two CQEs for append. NULL if two CQEs are not available */
+               cqe = io_get_two_cqring(ctx, &cqe2);
+       else
+               cqe = io_get_cqring(ctx);
+
        if (likely(cqe)) {
                WRITE_ONCE(cqe->user_data, req->user_data);
                WRITE_ONCE(cqe->res, res);
                WRITE_ONCE(cqe->flags, cflags);
+               /* update secondary cqe for zone-append */
+               if (req->flags & REQ_F_ZONE_APPEND) {
+                       WRITE_ONCE(cqe2->append_res,
+                               (u64)req->append_offset << SECTOR_SHIFT);
+                       WRITE_ONCE(cqe2->user_data, req->user_data);
+               }
  mutex_unlock(&ctx->uring_lock);


This seems to go fine in Kernel.
But the application will have few differences such as:

- When it submits N appends, and decides to wait for all completions
it needs to specify min_complete as 2*N (or at least 2N-1).
Two appends will produce 4 completion events, and if application
decides to wait for both it must specify 4 (or 3).

io_uring_enter(unsigned int fd, unsigned int to_submit,
                   unsigned int min_complete, unsigned int flags,
                   sigset_t *sig);

- Completion-processing sequence for mixed-workload (few reads + few
appends, on the same ring).
Currently there is a one-to-one relationship. Application looks at N
CQE entries, and treats each as distinct IO completion - a for loop
does the work.
With two-cqe scheme, extracting, from a bunch of completion, the ones
for read (one cqe) and append (two cqe): flow gets somewhat
non-linear.

Perhaps this is not too bad, but felt that it must be put here upfront.

-- 
Kanchan Joshi