summaryrefslogtreecommitdiff
path: root/io_uring/register.c
AgeCommit message (Collapse)Author
2025-11-26io_uring/query: drop unused io_handle_query_entry() ctx argCaleb Sander Mateos
io_handle_query_entry() doesn't use its struct io_ring_ctx *ctx argument. So remove it from the function and its callers. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18io_uring/register: use correct location for io_rings_layoutJens Axboe
A previous consolidated the ring size etc calculations into io_prepare_config(), but missed updating io_register_resize_rings() correctly to use the calculated values. As a result, it ended up using on-stack uninitialized values, and hence either failed validating the size correctly, or just failed resizing because the sizes were random. This caused failures in the liburing regression tests: [...] Running test resize-rings.t resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/sda resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/nvme1n1 resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/dm-0 resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 because io_create_region() would return -E2BIG because of unintialized reg->size values. Adjust the struct io_rings_layout rl pointer to point to the correct location, and remove the (now dead) __rl on stack struct. Fixes: eb76ff6a6829 ("io_uring: pre-calculate scq layout") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRLPavel Begunkov
It'll be annoying and take enough of boilerplate code to implement new zcrx features as separate io_uring register opcode. Introduce IORING_REGISTER_ZCRX_CTRL that will multiplex such calls to zcrx. Note, there are no real users of the opcode in this patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13io_uring: pre-calculate scq layoutPavel Begunkov
Move ring layouts calculations into io_prepare_config(), so that more misconfiguration checking can be done earlier before creating a ctx. It also deduplicates some code with ring resizing. And as a bonus, now it initialises params->sq_off.array, which is closer to all other user offset init, and also applies it to ring resizing, which was previously missing it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13io_uring: keep ring laoyut in a structurePavel Begunkov
Add a structure keeping SQ/CQ sizes and offsets. For now it only records data previously returned from rings_size and the SQ size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13io_uring: introduce struct io_ctx_configPavel Begunkov
There will be more information needed during ctx setup, and instead of passing a handful of pointers around, wrap them all into a new structure. Add a helper for encapsulating all configuration checks and preparation, that's also reused for ring resizing. Note, it indirectly adds a io_uring_sanitise_params() check to ring resizing, which is a good thing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13io_uring: convert params to pointer in ring reiszePavel Begunkov
The parameters in io_register_resize_rings() will be moved into another structure in a later patch. In preparation to that, convert the params variable it to a pointer, but still store the data on stack. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13Merge branch 'io_uring-6.18' into for-6.19/io_uringJens Axboe
Merge 6.18-rc io_uring fixes, as certain coming changes depend on some of these. * io_uring-6.18: io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs io_uring/query: return number of available queries io_uring/rw: ensure allocated iovec gets cleared for early failure io_uring: fix regbuf vector size truncation io_uring: fix types for region size calulation io_uring/zcrx: remove sync refill uapi io_uring: fix buffer auto-commit for multishot uring_cmd io_uring: correct __must_hold annotation in io_install_fixed_file io_uring zcrx: add MAINTAINERS entry io_uring: Fix code indentation error io_uring/sqpoll: be smarter on when to update the stime usage io_uring/sqpoll: switch away from getrusage() for CPU accounting io_uring: fix incorrect unlikely() usage in io_waitid_prep() Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11io_uring: pass sq entries in the params structPavel Begunkov
There is no need to pass the user requested number of SQ entries separately from the main parameter structure io_uring_params. Initialise it at the beginning and stop passing it in favour of struct io_uring_params::sq_entries. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06io_uring/memmap: refactor io_free_region() to take user_struct paramDavid Wei
Refactor io_free_region() to take user_struct directly, instead of accessing it from the ring ctx. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03io_uring/zcrx: remove sync refill uapiPavel Begunkov
There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL solves. The uapi can also be slightly adjusted to accommodate future extensions. Remove the feature for now, it'll be reworked for the next release. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22io_uring: add support for IORING_SETUP_SQE_MIXEDKeith Busch
Normal rings support 64b SQEs for posting submissions, while certain features require the ring to be configured with IORING_SETUP_SQE128, as they need to convey more information per submission. This, in turn, makes ALL the SQEs be 128b in size. This is somewhat wasteful and inefficient, particularly when only certain SQEs need to be of the bigger variant. This adds support for setting up a ring with mixed SQE sizes, using IORING_SETUP_SQE_MIXED. When setup in this mode, SQEs posted to the ring may be either 64b or 128b in size. If a SQE is 128b in size, then opcode will be set to a variante to indicate that this is the case. Any other non-128b opcode will assume the SQ's default size. SQEs on these types of mixed rings may also utilize NOP with skip success set. This can happen if the ring is one (small) SQE entry away from wrapping, and an attempt is made to get a 128b SQE. As SQEs must be contiguous in the SQ ring, a 128b SQE cannot wrap the ring. For this case, a single NOP SQE should be inserted with the SKIP_SUCCESS flag set. The kernel will process this as a normal NOP and without posting a CQE. Signed-off-by: Keith Busch <kbusch@kernel.org> [axboe: {} style fix and assign sqe before opcode read] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20io_uring: only publish fully handled mem regionPavel Begunkov
io_register_mem_region() can try to remove a region right after publishing it. This non-atomicity is annoying. Do it in two steps similar to io_register_mem_region(), create memory first and publish it once the rest of the handling is done. Remove now unused io_create_region_mmap_safe(), which was assumed to be a temporary solution from day one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20io_uring: don't free never created regionsPavel Begunkov
io_free_region() tolerates empty regions but there is no reason to that either. If the first io_create_region() in io_register_resize_rings() fails, just return the error without attempting to clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20io_uring: remove extra args from io_register_free_ringsPavel Begunkov
io_register_free_rings() doesn't use its "struct io_uring_params" parameter, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20io_uring: use no mmap safe region helpers on resizingPavel Begunkov
io_create_region_mmap_safe() is only needed when the created region is exposed to userspace code via mmap. io_register_resize_rings() creates them locally on stack, so the no mmap_safe version of the helper is enough. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15io_uring: fix unexpected placement on same size resizingPavel Begunkov
There might be many reasons why a user is resizing a ring, e.g. moving to huge pages or for some memory compaction using IORING_SETUP_NO_MMAP. Don't bypass resizing, the user will definitely be surprised seeing 0 while the rings weren't actually moved to a new place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15io_uring: protect mem region deregistrationPavel Begunkov
io_create_region_mmap_safe() protects publishing of a region against concurrent mmap calls, however we should also protect against it when removing a region. There is a gap io_register_mem_region() where it safely publishes a region, but then copy_to_user goes wrong and it unsafely frees the region. Cc: stable@vger.kernel.org Fixes: 087f997870a94 ("io_uring/memmap: implement mmap for regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: allow synchronous buffer returnPavel Begunkov
Returning buffers via a ring is performant and convenient, but it becomes a problem when/if the user misconfigured the ring size and it becomes full. Add a synchronous way to return buffers back to the page pool via a new register opcode. It's supposed to be a reliable slow path for refilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08io_uring: don't include filetable.h in io_uring.hCaleb Sander Mateos
io_uring/io_uring.h doesn't use anything declared in io_uring/filetable.h, so drop the unnecessary #include. Add filetable.h includes in .c files previously relying on the transitive include from io_uring.h. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08io_uring: Replace kzalloc() + copy_from_user() with memdup_user()Thorsten Blum
Replace kzalloc() followed by copy_from_user() with memdup_user() to improve and simplify io_probe(). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08io_uring: introduce io_uring queryingPavel Begunkov
There are many parameters users might want to query about io_uring like available request types or the ring sizes. This patch introduces an interface for such slow path queries. It was written with several requirements in mind: - Can be used with or without an io_uring instance. Asking for supported setup flags before creating an instance as well as qeurying info about an already created ring are valid use cases. - Should be moderately fast. For example, users might use it to periodically retrieve ring attributes at runtime. As a consequence, it should be able to query multiple attributes in a single syscall. - Backward and forward compatible. - Should be reasobably easy to use. - Reduce the kernel code size for introducing new query types. It's implemented as a new registration opcode IORING_REGISTER_QUERY. The user passes one or more query strutctures linked together, each represented by struct io_uring_query_hdr. The header stores common control fields needed for processing and points to query type specific information. The header contains - The query type - The result field, which on return contains the error code for the query - Pointer to the query type specific information - The size of the query structure. The kernel will only populate up to the size, which helps with backward compatibility. The kernel can also reduce the size, so if the current kernel is older than the inteface the user tries to use, it'll get only the supported bits. - next_entry field is used to chain multiple queries. Apart from common registeration syscall failures, it can only immediately return an error code in case when the headers are incorrect or any other addresses and invalid. That usually mean that the userspace doesn't use the API right and should be corrected. All query type specific errors are returned in the header's result field. As an example, the patch adds a single query type for now, i.e. IO_URING_QUERY_OPCODES, which tells what register / request / etc. opcodes are supported, but there are particular plans to extend it. Note: there is a request probing interface via IORING_REGISTER_PROBE, but it's a mess. It requires the user to create a ring first, it only works for requests, and requires dynamic allocations. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-08io_uring: add helper for *REGISTER_SEND_MSG_RINGPavel Begunkov
Move handling of IORING_REGISTER_SEND_MSG_RING into a separate function in preparation to growing io_uring_register_blind(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-02io_uring/register: drop redundant submitter_task checkCaleb Sander Mateos
For IORING_SETUP_SINGLE_ISSUER io_ring_ctx's, io_register_resize_rings() checks that the current task is the ctx's submitter_task. However, its caller __io_uring_register() already checks this. Drop the redundant check in io_register_resize_rings(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250902215108.1925105-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-27io_uring: add support for IORING_SETUP_CQE_MIXEDJens Axboe
Normal rings support 16b CQEs for posting completions, while certain features require the ring to be configured with IORING_SETUP_CQE32, as they need to convey more information per completion. This, in turn, makes ALL the CQEs be 32b in size. This is somewhat wasteful and inefficient, particularly when only certain CQEs need to be of the bigger variant. This adds support for setting up a ring with mixed CQE sizes, using IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring may be either 16b or 32b in size. If a CQE is 32b in size, then IORING_CQE_F_32 is set in the CQE flags to indicate that this is the case. If this flag isn't set, the CQE is the normal 16b variant. CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set. This can happen if the ring is one (small) CQE entry away from wrapping, and an attempt is made to post a 32b CQE. As CQEs must be contigious in the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single dummy CQE is posted with the SKIP flag set. The application should simply ignore those. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-12io_uring: consistently use rcu semantics with sqpoll threadKeith Busch
The sqpoll thread is dereferenced with rcu read protection in one place, so it needs to be annotated as an __rcu type, and should consistently use rcu helpers for access and assignment to make sparse happy. Since most of the accesses occur under the sqd->lock, we can use rcu_dereference_protected() without declaring an rcu read section. Provide a simple helper to get the thread from a locked context. Fixes: ac0b8b327a5677d ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250611205343.1821117-1-kbusch@meta.com [axboe: fold in fix for register.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-17io_uring/zcrx: add interface queue and refill queueDavid Wei
Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ. The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again. The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data. For now, each io_uring instance only has a single ifq. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-2-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-24io_uring/register: use atomic_read/write for sq_flags migrationJens Axboe
A previous commit changed all of the migration from the old to the new ring for resizing to use READ/WRITE_ONCE. However, ->sq_flags is an atomic_t, and while most archs won't complain on this, some will indeed flag this: io_uring/register.c:554:9: sparse: sparse: cast to non-scalar io_uring/register.c:554:9: sparse: sparse: cast from non-scalar Just use atomic_set/atomic_read for handling this case. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501242000.A2sKqaCL-lkp@intel.com/ Fixes: 2c5aae129f42 ("io_uring/register: document io_register_resize_rings() shared mem usage") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21io_uring: clean up io_uring_register_get_file()Pavel Begunkov
Make it always reference the returned file. It's safer, especially with unregistrations happening under it. And it makes the api cleaner with no conditional clean ups by the caller. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0d0b13a63e8edd6b5d360fc821dcdb035cb6b7e0.1736995897.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-20Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Not a lot in terms of features this time around, mostly just cleanups and code consolidation: - Support for PI meta data read/write via io_uring, with NVMe and SCSI covered - Cleanup the per-op structure caching, making it consistent across various command types - Consolidate the various user mapped features into a concept called regions, making the various users of that consistent - Various cleanups and fixes" * tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits) io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname io_uring: reuse io_should_terminate_tw() for cmds io_uring: Factor out a function to parse restrictions io_uring/rsrc: require cloned buffers to share accounting contexts io_uring: simplify the SQPOLL thread check when cancelling requests io_uring: expose read/write attribute capability io_uring/rw: don't gate retry on completion context io_uring/rw: handle -EAGAIN retry at IO completion time io_uring/rw: use io_rw_recycle() from cleanup path io_uring/rsrc: simplify the bvec iter count calculation io_uring: ensure io_queue_deferred() is out-of-line io_uring/rw: always clear ->bytes_done on io_async_rw setup io_uring/rw: use NULL for rw->free_iovec assigment io_uring/rw: don't mask in f_iocb_flags io_uring/msg_ring: Drop custom destructor io_uring: Move old async data allocation helper to header io_uring/rw: Allocate async data through helper io_uring/net: Allocate msghdr async data through helper io_uring/uring_cmd: Allocate async data through generic helper io_uring/poll: Allocate apoll with generic alloc_cache helper ...
2025-01-15io_uring: Factor out a function to parse restrictionsJosh Triplett
Preparation for subsequent work on inherited restrictions. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9bac2b4d1b9b9ab41c55ea3816021be847f354df.1736932318.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-15io_uring/register: cache old SQ/CQ head reading for copiesJens Axboe
The SQ and CQ ring heads are read twice - once for verifying that it's within bounds, and once inside the loops copying SQE and CQE entries. This is technically incorrect, in case the values could get modified in between verifying them and using them in the copy loop. While this won't lead to anything truly nefarious, it may cause longer loop times for the copies than expected. Read the ring head values once, and use the verified value in the copy loops. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-15io_uring/register: document io_register_resize_rings() shared mem usageJens Axboe
It can be a bit hard to tell which parts of io_register_resize_rings() are operating on shared memory, and which ones are not. And anything reading or writing to those regions should really use the read/write once primitives. Hence add those, ensuring sanity in how this memory is accessed, and helping document the shared nature of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-15io_uring/register: use stable SQ/CQ ring data during resizeJens Axboe
Normally the kernel would not expect an application to modify any of the data shared with the kernel during a resize operation, but of course the kernel cannot always assume good intent on behalf of the application. As part of resizing the rings, existing SQEs and CQEs are copied over to the new storage. Resizing uses the masks in the newly allocated shared storage to index the arrays, however it's possible that malicious userspace could modify these after they have been sanity checked. Use the validated and locally stored CQ and SQ ring sizing for masking to ensure the values are both stable and valid. Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: use region api for CQPavel Begunkov
Convert internal parts of the CQ/SQ array managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46fc3c801290d6b1ac16023d78f6b8e685c87fd6.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: use region api for SQPavel Begunkov
Convert internal parts of the SQ managment to the region API. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1fb73ced6b835cb319ab0fe1dc0b2e982a9a5650.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: pass ctx to io_register_free_ringsPavel Begunkov
A preparation patch, pass the context to io_register_free_rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1865fd2b3d4db22d1a1aac7dd06ea22cb990834.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: implement mmap for regionsPavel Begunkov
The patch implements mmap for the param region and enables the kernel allocation mode. Internally it uses a fixed mmap offset, however the user has to use the offset returned in struct io_uring_region_desc::mmap_offset. Note, mmap doesn't and can't take ->uring_lock and the region / ring lookup is protected by ->mmap_lock, and it's directly peeking at ctx->param_region. We can't protect io_create_region() with the mmap_lock as it'd deadlock, which is why io_create_region_mmap_safe() initialises it for us in a temporary variable and then publishes it with the lock taken. It's intentionally decoupled from main region helpers, and in the future we might want to have a list of active regions, which then could be protected by the ->mmap_lock. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0f1212bd6af7fb39b63514b34fae8948014221d1.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring/memmap: implement kernel allocated regionsPavel Begunkov
Allow the kernel to allocate memory for a region. That's the classical way SQ/CQ are allocated. It's not yet useful to user space as there is no way to mmap it, which is why it's explicitly disabled in io_register_mem_region(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7b8c40e6542546bbf93f4842a9a42a7373b81e0d.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23io_uring: rename ->resize_lockPavel Begunkov
->resize_lock is used for resizing rings, but it's a good idea to reuse it in other cases as well. Rename it into mmap_lock as it's protects from races with mmap. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/68f705306f3ac4d2fb999eb80ea1615015ce9f7f.1732886067.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-19io_uring/register: limit ring resizing to DEFER_TASKRUNJens Axboe
With DEFER_TASKRUN, we know the ring can't be both waited upon and resized at the same time. This is important for CQ resizing. Allowing SQ ring resizing is more trivial, but isn't the interesting use case. Hence limit ring resizing in general to DEFER_TASKRUN only for now. This isn't a huge problem as CQ ring resizing is generally the most useful on networking type of workloads where it can be hard to size the ring appropriately upfront, and those should be using DEFER_TASKRUN for better performance. Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-18io_uring: protect register tracingPavel Begunkov
Syz reports: BUG: KCSAN: data-race in __se_sys_io_uring_register / io_sqe_files_register read-write to 0xffff8881021940b8 of 4 bytes by task 5923 on cpu 1: io_sqe_files_register+0x2c4/0x3b0 io_uring/rsrc.c:713 __io_uring_register io_uring/register.c:403 [inline] __do_sys_io_uring_register io_uring/register.c:611 [inline] __se_sys_io_uring_register+0x8d0/0x1280 io_uring/register.c:591 __x64_sys_io_uring_register+0x55/0x70 io_uring/register.c:591 x64_sys_call+0x202/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:428 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff8881021940b8 of 4 bytes by task 5924 on cpu 0: __do_sys_io_uring_register io_uring/register.c:613 [inline] __se_sys_io_uring_register+0xe4a/0x1280 io_uring/register.c:591 __x64_sys_io_uring_register+0x55/0x70 io_uring/register.c:591 x64_sys_call+0x202/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:428 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Which should be due to reading the table size after unlock. We don't care much as it's just to print it in trace, but we might as well do it under the lock. Reported-by: syzbot+5a486fef3de40e0d8c76@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8233af2886a37b57f79e444e3db88fcfda1817ac.1731942203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: restore back registered wait argumentsPavel Begunkov
Now we've got a more generic region registration API, place IORING_ENTER_EXT_ARG_REG and re-enable it. First, the user has to register a region with the IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races with already running waiters. With that we should have stable constant values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no READ_ONCE required. The other API difference is that we're now passing byte offsets instead of indexes. The user _must_ align all offsets / pointers to the native word size, failing to do so might but not necessarily has to lead to a failure usually returned as -EFAULT. liburing will be hiding this details from users. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: add memory region registrationPavel Begunkov
Regions will serve multiple purposes. First, with it we can decouple ring/etc. object creation from registration / mapping of the memory they will be placed in. We already have hacks that allow to put both SQ and CQ into the same huge page, in the future we should be able to: region = create_region(io_ring); create_pbuf_ring(io_uring, region, offset=0); create_pbuf_ring(io_uring, region, offset=N); The second use case is efficiently passing parameters. The following patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which optimises wait arguments. It'll also be useful for request arguments replacing iovecs, msghdr, etc. pointers. Eventually it would also be handy for BPF as well if it comes to fruition. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-15io_uring: temporarily disable registered waitsPavel Begunkov
Disable wait argument registration as it'll be replaced with a more generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing in a few commits so leave it be. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70b1d1d218c41ba77a76d1789c8641dab0b0563e.1731689588.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-02io_uring/rsrc: unify file and buffer resource tablesJens Axboe
For files, there's nr_user_files/file_table/file_data, and buffers have nr_user_bufs/user_bufs/buf_data. There's no reason why file_table and file_data can't be the same thing, and ditto for the buffer side. That gets rid of more io_ring_ctx state that's in two spots rather than just being in one spot, as it should be. Put all the registered file data in one locations, and ditto on the buffer front. This also avoids having both io_rsrc_data->nodes being an allocated array, and ->user_bufs[] or ->file_table.nodes. There's no reason to have this information duplicated. Keep it in one spot, io_rsrc_data, along with how many resources are available. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring: add support for fixed wait regionsJens Axboe
Generally applications have 1 or a few waits of waiting, yet they pass in a struct io_uring_getevents_arg every time. This needs to get copied and, in turn, the timeout value needs to get copied. Rather than do this for every invocation, allow the application to register a fixed set of wait regions that can simply be indexed when asking the kernel to wait on events. At ring setup time, the application can register a number of these wait regions and initialize region/index 0 upfront: struct io_uring_reg_wait *reg; reg = io_uring_setup_reg_wait(ring, nr_regions, &ret); /* set timeout and mark as set, sigmask/sigmask_sz as needed */ reg->ts.tv_sec = 0; reg->ts.tv_nsec = 100000; reg->flags = IORING_REG_WAIT_TS; where nr_regions >= 1 && nr_regions <= PAGE_SIZE / sizeof(*reg). The above initializes index 0, but 63 other regions can be initialized, if needed. Now, instead of doing: struct __kernel_timespec timeout = { .tv_nsec = 100000, }; io_uring_submit_and_wait_timeout(ring, &cqe, nr, &t, NULL); to wait for events for each submit_and_wait, or just wait, operation, it can just reference the above region at offset 0 and do: io_uring_submit_and_wait_reg(ring, &cqe, nr, 0); to achieve the same goal of waiting 100usec without needing to copy both struct io_uring_getevents_arg (24b) and struct __kernel_timeout (16b) for each invocation. Struct io_uring_reg_wait looks as follows: struct io_uring_reg_wait { struct __kernel_timespec ts; __u32 min_wait_usec; __u32 flags; __u64 sigmask; __u32 sigmask_sz; __u32 pad[3]; __u64 pad2[2]; }; embedding the timeout itself in the region, rather than passing it as a pointer as well. Note that the signal mask is still passed as a pointer, both for compatability reasons, but also because there doesn't seem to be a lot of high frequency waits scenarios that involve setting and resetting the signal mask for each wait. The application is free to modify any region before a wait call, or it can use keep multiple regions with different settings to avoid needing to modify the same one for wait calls. Up to a page size of regions is mapped by default, allowing PAGE_SIZE / 64 available regions for use. The registered region must fit within a page. On a 4kb page size system, that allows for 64 wait regions if a full page is used, as the size of struct io_uring_reg_wait is 64b. The region registered must be aligned to io_uring_reg_wait in size. It's valid to register less than 64 entries. In network performance testing with zero-copy, this reduced the time spent waiting on the TX side from 3.12% to 0.3% and the RX side from 4.4% to 0.3%. Wait regions are fixed for the lifetime of the ring - once registered, they are persistent until the ring is torn down. The regions support minimum wait timeout as well as the regular waits. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/register: add IORING_REGISTER_RESIZE_RINGSJens Axboe
Once a ring has been created, the size of the CQ and SQ rings are fixed. Usually this isn't a problem on the SQ ring side, as it merely controls the available number of requests that can be submitted in a single system call, and there's rarely a need to change that. For the CQ ring, it's a different story. For most efficient use of io_uring, it's important that the CQ ring never overflows. This means that applications must size it for the worst case scenario, which can be wasteful. Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize the existing rings. It takes a struct io_uring_params argument, the same one which is used to setup the ring initially, and resizes rings according to the sizes given. Certain properties are always inherited from the original ring setup, like SQE128/CQE32 and other setup options. The implementation only allows flag associated with how the CQ ring is sized and clamped. Existing unconsumed SQE and CQE entries are copied as part of the process. If either the SQ or CQ resized destination ring cannot hold the entries already present in the source rings, then the operation is failed with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new submissions, and the internal mapping holds the completion lock as well across moving CQ ring state. To prevent races between mmap and ring resizing, add a mutex that's solely used to serialize ring resize and mmap. mmap_sem can't be used here, as as fork'ed process may be doing mmaps on the ring as well. The ctx->resize_lock is held across mmap operations, and the resize will grab it before swapping out the already mapped new data. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29io_uring/msg_ring: add support for sending a sync messageJens Axboe
Normally MSG_RING requires both a source and a destination ring. But some users don't always have a ring avilable to send a message from, yet they still need to notify a target ring. Add support for using io_uring_register(2) without having a source ring, using a file descriptor of -1 for that. Internally those are called blind registration opcodes. Implement IORING_REGISTER_SEND_MSG_RING as a blind opcode, which simply takes an sqe that the application can put on the stack and use the normal liburing helpers to initialize it. Then the app can call: io_uring_register(-1, IORING_REGISTER_SEND_MSG_RING, &sqe, 1); and get the same behavior in terms of the target, where a CQE is posted with the details given in the sqe. For now this takes a single sqe pointer argument, and hence arg must be set to that, and nr_args must be 1. Could easily be extended to take an array of sqes, but for now let's keep it simple. Link: https://lore.kernel.org/r/20240924115932.116167-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-16io_uring: clean up a type in io_uring_register_get_file()Dan Carpenter
Originally "fd" was unsigned int but it was changed to int when we pulled this code into a separate function in commit 0b6d253e084a ("io_uring/register: provide helper to get io_ring_ctx from 'fd'"). This doesn't really cause a runtime problem because the call to array_index_nospec() will clamp negative fds to 0 and nothing else uses the negative values. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/6f6cb630-079f-4fdf-bf95-1082e0a3fc6e@stanley.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>