linux-stable-mirror

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2026-06-21 15:43:21 +02:00

Author	SHA1	Message	Date
Bryam Vargas	2dc0bfd2fe	partitions: aix: bound the pp_count scan to the ppe array aix_partition() reads the physical volume descriptor into a fixed-size struct pvd and then scans its physical-partition-extent array: int numpps = be16_to_cpu(pvd->pp_count); ... for (i = 0; i < numpps; i += 1) { struct ppe *p = pvd->ppe + i; ... lp_ix = be16_to_cpu(p->lp_ix); pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the on-disk pp_count. pp_count is an unvalidated __be16 read straight from the descriptor, so a crafted AIX image with pp_count larger than 1016 drives the loop to read pvd->ppe[i] past the end of the allocation (up to 65535 entries, ~2 MB out of bounds). The partition scan runs without mounting anything, when a block device with a crafted AIX/IBM partition table appears (an attacker-supplied image attached with losetup -P, or a device auto-scanned by udev), via msdos_partition() -> aix_partition(). Clamp the scan to the number of entries the ppe[] array can hold. Fixes: `6ceea22bbb` ("partitions: add aix lvm partition support files") Cc: stable@vger.kernel.org Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Acked-by: Philippe De Muyter <phdm@macqel.be> Link: https://patch.msgid.link/20260607064137.302574-1-hexlabsecurity@proton.me Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-08 07:41:21 -06:00
Bart Van Assche	5f0777166e	block: Enable lock context analysis Now that all block/*.c files have been annotated, enable lock context analysis for all these source files. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/e248ca3aeead238bbc489cf3afdafcbff9e41faf.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	f10b2de2af	block/mq-deadline: Make the lock context annotations compatible with Clang While sparse ignores the __acquires() and __releases() arguments, Clang verifies these. Make the arguments of __acquires() and __releases() acceptable for Clang. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/3b6e336ced91e27213608ffce205ccd24f4ba285.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	b4591b9152	block/Kyber: Make the lock context annotations compatible with Clang While sparse ignores the __acquires() and __releases() arguments, Clang verifies these. Make the arguments of __acquires() and __releases() acceptable for Clang. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/91cb8c790fc8b26b8aa742569fbf8c2c1d099dac.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	131f14125a	block/blk-mq-debugfs: Improve lock context annotations Make the existing lock context annotations compatible with Clang. Add the lock context annotations that are missing. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/f58fe220ff98f9dfddfed4573f40005c773b7fb7.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	a255026594	block/blk-iocost: Inline iocg_lock() and iocg_unlock() Both iocg_lock() and iocg_unlock() use conditional locking. Fold these functions into their callers such that unlocking becomes unconditional. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/f8c9867788957d2e40a32e23c6d9b866e480ad9d.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	1ff85a3879	block/blk-iocost: Split ioc_rqos_throttle() Prepare for inlining iocg_lock() and iocg_unlock() by moving the code between these two calls into a new function. No functionality has been changed. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/a6d3ed953cef6669d23a80923bf46600733cbdae.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	73bb2480e3	block/crypto: Annotate the crypto functions Add the lock context annotations required for Clang's thread-safety analysis. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/297b40e43a7f9b7d20e91a6c44b41a69d01f5c63.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	998cda78d4	block/cgroup: Inline blkg_conf_{open,close}_bdev_frozen() The blkg_conf_open_bdev_frozen() calling convention is not compatible with lock context annotations. Fold both blkg_conf_open_bdev_frozen() and blkg_conf_close_bdev_frozen() into their only caller. This patch prepares for enabling lock context analysis. The type of 'memflags' has been changed from unsigned long into unsigned int to match the type of current->flags. See also <linux/sched.h>. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/05661d1555decc6dd5389174ba448d803b72ed9a.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	6a7717a2df	block/blk-iocost: Combine two error paths in ioc_qos_write() Reduce code duplication by combining two error paths. No functionality has been changed. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/80d4fc1ecd5eaf187c0a31c63a1033a7326d4c7e.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	9865e41664	block/cgroup: Improve lock context annotations Add lock context annotations where these are missing. Move the blkg_conf_prep() annotation into block/blk-cgroup.h to make it visible to all blkg_conf_prep() callers. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/58ddd6e2b960bdfa03d0007984386bc0ba351391.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	c574c3cc36	block/cgroup: Split blkg_conf_exit() Split blkg_conf_exit() into blkg_conf_unprep() and blkg_conf_close_bdev() because blkg_conf_exit() is not compatible with the Clang thread-safety annotations. Remove blkg_conf_exit(). Rename blkg_conf_exit_frozen() into blkg_conf_close_bdev_frozen(). Add thread-safety annotations to the new functions. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/c1ec1f1c4b675bc5f187f77b3e6436234c6b244c.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	ea4f575e72	block/cgroup: Split blkg_conf_prep() Move the blkg_conf_open_bdev() call out of blkg_conf_prep() to make it possible to add lock context annotations to blkg_conf_prep(). Change an if-statement in blkg_conf_open_bdev() into a WARN_ON_ONCE() call. Export blkg_conf_open_bdev() because it is called by the BFQ I/O scheduler and the BFQ I/O scheduler may be built as a kernel module. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/e6ea0387f413217c8561a0ca54ce7b846aa5c7c5.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	3033c86fa1	block/bdev: Annotate the blk_holder_ops callback functions The four callback functions in blk_holder_ops all release the bd_holder_lock. Annotate these functions accordingly. Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/be51cf81110f691ebd5868ac2f15ceb847805bc8.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Bart Van Assche	08d912bc44	block: Annotate the queue limits functions Let the thread-safety checker verify whether every start of a queue limits update is followed by a call to a function that finishes a queue limits update. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/8f71062b6d0fcf2b80bc8cda701c453224755439.1780682325.git.bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 13:41:11 -06:00
Marco Crivellari	7e712f292e	block: Add WQ_PERCPU to alloc_workqueue users This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit `128ea9f6cc` ("workqueue: Add system_percpu_wq and system_dfl_wq") commit `930c2ea566` ("workqueue: Add new WQ_PERCPU flag") The refactoring is going to alter the default behavior of alloc_workqueue() to be unbound by default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. For more details see the Link tag below. In order to keep alloc_workqueue() behavior identical, explicitly request WQ_PERCPU. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260604105347.168322-1-marco.crivellari@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-05 11:21:39 -06:00
Jens Axboe	ed60c09f29	Merge tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme into for-7.2/block Pull NVMe updates from Keith: "- Per-controller timeouts - Multipath telemetry - Namespace format validation - Various other fixes" * tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits) nvme: export controller reconnect event count via sysfs nvme: export controller reset event count via sysfs nvme: export I/O failure count when no path is available via sysfs nvme: export I/O requeue count when no path is usable via sysfs nvme: export command error counters via sysfs nvme: export multipath failover count via sysfs nvme: export command retry count via sysfs nvme: add diag attribute group under sysfs nvme-tcp: lockdep: use dynamic lockdep keys per socket instance nvme-tcp: move nvme_tcp_reclassify_socket() nvme: validate FDP configuration descriptor sizes nvmet-auth: validate reply message payload bounds against transfer length nvme: refresh multipath head zoned limits from path limits nvme: fix FDP fdpcidx bounds check nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false. nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks nvme-multipath: require exact iopolicy names for module parameter nvme-multipath: pass NS head to nvme_mpath_revalidate_paths() nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools ...	2026-06-05 05:18:58 -06:00
Nilay Shroff	3c8c284dfc	nvme: export controller reconnect event count via sysfs When an NVMe-oF link goes down, the driver attempts to recover the connection by repeatedly reconnecting to the remote controller at configured intervals. A maximum number of reconnect attempts is also configured, after which recovery stops and the controller is removed if the connection cannot be re-established. The driver maintains a counter, nr_reconnects, which is incremented on each reconnect attempt. However if in case the reconnect is successful then this counter reset to zero. Moreover, currently, this counter is only reported via kernel log messages and is not exposed to userspace. Since dmesg is a circular buffer, this information may be lost over time. So introduce a new accumulator which accumulates nr_reconnect attempts and also expose this accumulator per-fabric ctrl via a new sysfs attribute reconnect_count, under diag attribute grroup to provide persistent visibility into the number of reconnect attempts made by the host. This information can help users diagnose unstable links or connectivity issues. Furthermore, this sysfs attribute is also writable so user may reset it to zero, if needed. The reconnect_count can also be consumed by monitoring tools such as nvme-top to improve controller-level observability. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:40 -07:00
Nilay Shroff	29aafaaf58	nvme: export controller reset event count via sysfs The NVMe controller transitions into the RESETTING state during error recovery, link instability, firmware activation, or when a reset is explicitly triggered by the user. Expose a per-ctrl sysfs attribute reset_count, under diag attribute group to provide visibility into these RESETTING state transitions. Observing the frequency of reset events can help users identify issues such as PCIe errors or unstable fabric links. This counter is also writable thus allowing user to reset its value, if needed. This counter can also be consumed by monitoring tools such as nvme-top to improve controller-level observability. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:36 -07:00
Nilay Shroff	a8e434cb03	nvme: export I/O failure count when no path is available via sysfs When I/O is submitted to the NVMe namespace head and no available path can handle the request, the driver fails the I/O immediately. Currently, such failures are only reported via kernel log messages, which may be lost over time since dmesg is a circular buffer. Add a new ns-head sysfs counter io_fail_no_available_path_count, under diag attribute group to expose the number of I/Os that failed due to the absence of an available path. This provides persistent visibility into path-related I/O failures and can help users diagnose the cause of I/O errors. This counter is also writable and so user may reset its value, if needed. This counter can also be consumed by monitoring tools such as nvme-top. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:32 -07:00
Nilay Shroff	76b5e1591e	nvme: export I/O requeue count when no path is usable via sysfs When the NVMe namespace head determines that there is no currently available path to handle I/O (for example, while a controller is resetting/connecting or due to a transient link failure), incoming I/Os are added to the requeue list. Currently, there is no visibility into how many I/Os have been requeued in this situation. Add a new ns-head sysfs counter io_requeue_no_usable_path_count, under diag attribute group to expose the number of I/Os that were requeued due to the absence of an available path. This counter is also writable thus allowing user to reset it, if needed. This statistic can help users understand I/O slowdowns or stalls caused by temporary path unavailability, and can be consumed by monitoring tools such as nvme-top for real-time observability. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:28 -07:00
Nilay Shroff	30ab37a128	nvme: export command error counters via sysfs When an NVMe command completes with an error status, the driver logs the error to the kernel log. However, these messages may be lost or overwritten over time since dmesg is a circular buffer. Expose per-path and ctrl sysfs attribute command_error_count, under diag attribute group to provide persistent visibility into error occurrences. This allows users to observe the total number of commands that have failed on a given path over time, which can be useful for diagnosing path health and stability. This attribute is both readable and writable thus allowing user to reset these counters. These counters can also be consumed by observability tools such as nvme-top to provide additional insight into NVMe error behavior. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:25 -07:00
Nilay Shroff	66ee95b3d4	nvme: export multipath failover count via sysfs When an NVMe command completes with a path-specific error, the NVMe driver may retry the command on an alternate controller or path if one is available. These failover events indicate that I/O was redirected away from the original path. Currently, the number of times requests are failed over to another available path is not visible to userspace. Exposing this information can be useful for diagnosing path health and stability. Export per-path sysfs attribute "multipath_failover_count" under diag attribute group. This attribute is both readable and writable and thus allowing user to reset the counter. This counter can be consumed by monitoring tools such as nvme-top to help identify paths that consistently trigger failovers under load. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:19 -07:00
Nilay Shroff	ab5af2903b	nvme: export command retry count via sysfs When Advanced Command Retry Enable (ACRE) is configured, a controller may interrupt command execution and return a completion status indicating command interrupted with the DNR bit cleared. In this case, the driver retries the command based on the Command Retry Delay (CRD) value provided in the completion status. Currently, these command retries are handled entirely within the NVMe driver and are not visible to userspace. As a result, there is no observability into retry behavior, which can be a useful diagnostic signal. Expose a per-namespace sysfs attribute command_retries_count, under diag attribute group to provide visibility into retry activity. This information can help identify controller-side congestion under load and enables comparison across paths in multipath setups (for example, detecting cases where one path experiences significantly more retries than another under identical workloads). This exported metric is intended for diagnostics and monitoring tools such as nvme-top, and does not change command retry behavior. A new sysfs attribute named "command_retries_count" is added for this purpose. This attribute is both readable as well as writable. So user could reset this counter if needed. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:13 -07:00
Nilay Shroff	37afebc79a	nvme: add diag attribute group under sysfs Add a new diag attribute group under: /sys/class/nvme/<ctrl>/ /sys/block/<nvme-path-dev>/ /sys/block/<ns-head-dev>/ This new sysfs attribute group will be used to organize NVMe diagnostic and telemetry-related counters under it. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:57:06 -07:00
Shin'ichiro Kawasaki	19bdb70c77	nvme-tcp: lockdep: use dynamic lockdep keys per socket instance When NVMe-TCP controller setup and teardown are repeated with lockdep enabled, lockdep reports false positives WARN for the following locks: 1) &q->elevator_lock : IO scheduler change context 2) &q->q_usage_counter(io) : SCSI disk probe context 3) fs_reclaim : CPU hotplug bring-up context 4) cpu_hotplug_lock : socket establishment context 5) sk_lock-AF_INET-NVME : MQ sched dispatch context for the socket 6) set->srcu : NVMe controller delete context The lockdep WARN was observed by running blktests test case nvme/005 for tcp transport on v7.1-rc1 kernel with a patch. Refer to the Link tag for the details of the WARN. This is a false positive because lockdep confuses lock 4) (socket establishment) with lock 5) (socket in use) for different socket instances. The locks belong to different sockets, but lockdep treats them as the same due to shared static lockdep keys. Fix this by using dynamically allocated lockdep keys per socket instance instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue and pass them to sock_lock_init_class_and_name() for proper lockdep tracking. Change the argument of nvme_tcp_reclassify_socket() from 'struct socket ' to 'struct nvme_tcp_queue ' to pass both the socket and the keys. Add CONFIG_DEBUG_LOCK_ALLOC guards to nvme_tcp_alloc_queue() and nvme_tcp_free_queue() to register and unregister the dynamic keys. Additionally, move nvme_tcp_reclassify_socket() inside these guards since it's only needed when lockdep is enabled. Link: https://lore.kernel.org/linux-nvme/afB5syZbUrppgsDQ@shinmob/ Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-04 01:19:08 -07:00
Shin'ichiro Kawasaki	2caaa52c1a	nvme-tcp: move nvme_tcp_reclassify_socket() Move nvme_tcp_reclassify_socket() in tcp.c after the struct nvme_tcp_queue definition. This is preparation for adding a reference to struct nvme_tcp_queue in the function, which would otherwise cause a compile failure due to the struct being defined after the function. Move the entire CONFIG_DEBUG_LOCK_ALLOC block along with the function to maintain the code organization. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-03 02:43:54 -07:00
liuxixin	0ef4daa653	nvme: validate FDP configuration descriptor sizes Validate descriptor sizes while walking the FDP configurations log so dsze == 0 or a descriptor past the log end cannot cause unbounded iteration or reads past the buffer. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liuxixin <gliuxen@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-03 02:42:07 -07:00
Tianchu Chen	3a413ece25	nvmet-auth: validate reply message payload bounds against transfer length nvmet_auth_reply() accesses the variable-length rval[] array using attacker-controlled hl (hash length) and dhvlen (DH value length) fields without verifying they fit within the allocated buffer of tl bytes. A malicious NVMe-oF initiator can craft a DHCHAP_REPLY message with a small transfer length but large hl/dhvlen values, causing out-of-bounds heap reads when the target processes the DH public key (rval + 2hl) or performs the host response memcmp. With DH authentication configured, the OOB pointer is passed directly to sg_init_one() and read by crypto_kpp_compute_shared_secret(), reaching up to 526 bytes past the buffer. This is exploitable pre-authentication. Add bounds validation ensuring sizeof(data) + 2*hl + dhvlen <= tl before any access to the variable-length fields. Discovered by Atuin - Automated Vulnerability Discovery Engine. Fixes: `db1312dd95` ("nvmet: implement basic In-Band Authentication") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Tianchu Chen <flynnnchen@tencent.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-03 02:40:33 -07:00
Thorsten Blum	3f1eccd372	n64cart: use strscpy in n64cart_probe strcpy() has been deprecated [1] because it performs no bounds checking on the destination buffer, which can lead to buffer overflows. While the current code works correctly, replace strcpy() with the safer strscpy() to follow secure coding best practices. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260517172617.3954-2-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-02 17:43:32 -06:00
Thorsten Blum	aa528cd12c	block/partitions/acorn: use min in {riscix,linux}_partition Use min() to replace the open-coded implementations and to simplify riscix_partition() and linux_partition(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260602160757.973736-3-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-02 11:14:52 -06:00
Yu Kuai	6636e16e60	block, bfq: release cgroup stats with bfq_group BFQ cgroup stats contain percpu counters embedded in struct bfq_group, but the old free path destroys them from bfq_pd_free(), which is tied to blkg policy-data teardown. That is not the same lifetime as struct bfq_group. BFQ pins bfq_group while bfq_queue entities refer to it, so bfq_pd_free() can drop the policy-data reference while other bfq_group references still exist. The following blkcg change also defers policy-data release through RCU and leaves BFQ to run the final bfqg_put() from an RCU callback. For that conversion, stats teardown must belong to the last bfq_group put, not to policy-data teardown. Move stats teardown to bfqg_put() so the embedded counters are destroyed exactly when the last bfq_group reference is released, before kfree(bfqg). Without this preparatory change, the RCU-delayed policy-data free conversion reproduced the following KASAN report: BUG: KASAN: slab-use-after-free in percpu_counter_destroy_many+0xf1/0x2e0 Write of size 8 at addr ffff88811d9409e0 by task test_blkcg/535 CPU: 0 UID: 0 PID: 535 Comm: test_blkcg Not tainted 7.1.0-rc2-g1e14adca0199 #1 PREEMPT ea13f83d4b74a12510d20db4a7d9a0fe8275f05c Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x54/0x70 print_address_description+0x77/0x200 ? percpu_counter_destroy_many+0xf1/0x2e0 print_report+0x64/0x70 kasan_report+0x118/0x150 ? percpu_counter_destroy_many+0xf1/0x2e0 percpu_counter_destroy_many+0xf1/0x2e0 __mmdrop+0x1d8/0x350 finish_task_switch+0x3f5/0x570 __schedule+0xe8e/0x18a0 schedule+0xfe/0x1c0 schedule_timeout+0x7f/0x1d0 __wait_for_common+0x26c/0x3f0 wait_for_completion_state+0x21/0x40 call_usermodehelper_exec+0x271/0x2c0 __request_module+0x296/0x410 elv_iosched_store+0x1bc/0x2c0 queue_attr_store+0x152/0x1c0 kernfs_fop_write_iter+0x1d7/0x280 vfs_write+0x580/0x630 ksys_write+0xec/0x190 do_syscall_64+0x156/0x490 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 535: kasan_save_track+0x3e/0x80 __kasan_kmalloc+0x72/0x90 bfq_pd_alloc+0x60/0x100 [bfq] blkg_create+0x3bb/0xbe0 blkg_lookup_create+0x3a2/0x460 blkg_conf_start+0x24a/0x2d0 bfq_io_set_weight+0x17f/0x430 [bfq] cgroup_file_write+0x1c5/0x4b0 kernfs_fop_write_iter+0x1d7/0x280 vfs_write+0x580/0x630 ksys_write+0xec/0x190 do_syscall_64+0x156/0x490 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 0: kasan_save_track+0x3e/0x80 kasan_save_free_info+0x46/0x50 __kasan_slab_free+0x3a/0x60 kfree+0x14e/0x4f0 rcu_core+0x6f3/0xcd0 handle_softirqs+0x1a0/0x550 __irq_exit_rcu+0x8c/0x150 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x6e/0x80 asm_sysvec_apic_timer_interrupt+0x1a/0x20 Last potentially related work creation: kasan_save_stack+0x3e/0x60 kasan_record_aux_stack+0x99/0xb0 call_rcu+0x55/0x5c0 blkg_free_workfn+0x130/0x220 process_scheduled_works+0x655/0xb60 worker_thread+0x446/0x600 kthread+0x1f4/0x230 ret_from_fork+0x259/0x420 ret_from_fork_asm+0x1a/0x30 Signed-off-by: Yu Kuai <yukuai@fygo.io> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260601061502.899552-1-yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-02 07:24:58 -06:00
Yao Sang	59c0517123	nvme: refresh multipath head zoned limits from path limits queue_limits_stack_bdev() updates the multipath head limits from the path queue, but it does not propagate max_open_zones or max_active_zones. As a result, a zoned multipath namespace head can keep stale 0/0 values even after a ready path reports finite zoned resource limits. When refreshing the head limits in nvme_update_ns_info(), stack the zoned resource limits directly after stacking the path queue limits. Use min_not_zero() so the block layer's 0 value keeps its "no limit" meaning while finite limits are combined conservatively. This avoids advertising "no limit" on the multipath head while keeping the zoned-limit handling local to the NVMe multipath update path. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yao Sang <sangyao@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 05:30:59 -07:00
liuxixin	0967074f68	nvme: fix FDP fdpcidx bounds check The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=, incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1). Fixes: `30b5f20bb2` ("nvme: register fdp parameters with the block layer") Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liuxixin <gliuxen@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 05:20:57 -07:00
Kuniyuki Iwashima	8757fd9500	nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false. Since commit `21c05ca88a` ("workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly set WQ_PERCPU or WQ_UNBOUND when creating workqueue. nvme_tcp_init_module() sets WQ_UNBOUND when the module param wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering the warning below: workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU. WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1 Let's set WQ_PERCPU if wq_unbound is false. Reported-by: syzbot+d078cba4418e65f61984@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6a1a9a86.323e8352.141b09.0001.GAE@google.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 03:46:27 -07:00
Bryam Vargas	53cd102a7a	nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page nvmet_execute_disc_get_log_page() validates only the dword alignment of the host-supplied Log Page Offset (lpo). The 64-bit offset is then added to a small kzalloc'd buffer that holds the discovery log page and the result is passed straight to nvmet_copy_to_sgl(), which memcpy()s data_len bytes out to the host with no source-side bound check: u64 offset = nvmet_get_log_page_offset(req->cmd); /* 64-bit host / size_t data_len = nvmet_get_log_page_len(req->cmd); / 32-bit host / ... if (offset & 0x3) { ... } / only check / ... alloc_len = sizeof(hdr) + entry_size * discovery_log_entries(req); buffer = kzalloc(alloc_len, GFP_KERNEL); ... status = nvmet_copy_to_sgl(req, 0, buffer + offset, data_len); The Discovery controller is unauthenticated -- nvmet_host_allowed() returns true unconditionally for the discovery subsystem -- so the call is reachable pre-authentication by any TCP/RDMA/FC peer that can reach the nvmet target. With a discovery log page of ~1 KiB, an attacker requesting up to 4 KiB starting at offset == alloc_len reads the next slab page out and gets its content returned over the fabric (an empirical run on a default nvmet-tcp loopback target leaked 81 canonical kernel pointers in one Get Log Page response). Pointing the offset at unmapped kernel memory faults the in-kernel memcpy and crashes (or panics, on panic_on_oops=1) the target host instead. The attacker-controlled source-side offset pattern "nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every other Get Log Page handler in admin-cmd.c either ignores lpo (and silently starts every response at offset 0) or tracks a local destination offset with a fixed source pointer. Validate the host-supplied offset against the log page size, cap the copy length to what is actually available, and zero-fill any remainder of the host transfer buffer. The zero-fill matches the existing short-response pattern in nvmet_execute_get_log_changed_ns() (admin-cmd.c) and prevents leaking transport SGL contents when the host asks for more bytes than the log page contains. Fixes: `a07b4970f4` ("nvmet: add a generic NVMe target") Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 03:43:27 -07:00
Achkinazi, Igor	88bac2c1a7	nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a per-path namespace, bio_set_dev() clears BIO_REMAPPED. The remapped bio is then resubmitted through submit_bio_noacct() which calls bio_check_eod() because BIO_REMAPPED is not set. This races with nvme_ns_remove() which zeroes the per-path capacity before synchronize_srcu(): CPU 0 (IO submission) --------------------- srcu_read_lock() nvme_find_path() -> ns [NVME_NS_READY is set] CPU 1 (namespace removal) ------------------------- clear_bit(NVME_NS_READY) set_capacity(ns->disk, 0) synchronize_srcu() <- blocks CPU 0 (IO submission) --------------------- bio_set_dev(bio, ns->disk->part0) [clears BIO_REMAPPED] submit_bio_noacct(bio) -> bio_check_eod() sees capacity=0 -> bio fails with IO error The SRCU read lock prevents synchronize_srcu() from completing, but does not prevent set_capacity(0) from executing. The bio fails the EOD check before it reaches the NVMe driver, so nvme_failover_req() never gets a chance to redirect it to another path of multipath. IO errors are reported to the application despite another path being available. On older kernels (before commit `0b64682e78` "block: skip unnecessary checks for split bio"), the same race was also reachable through split remainders resubmitted via submit_bio_noacct(). Fix this by setting BIO_REMAPPED after bio_set_dev() in nvme_ns_head_submit_bio(). This skips bio_check_eod() on the per-path device; the EOD check already passed on the multipath head. NVMe per-path namespace devices are always whole disks (bd_partno=0), so the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op. The flag does not persist across failover and cannot go stale if the namespace geometry changes between attempts: nvme_failover_req() calls bio_set_dev() to redirect the bio back to the multipath head, which clears BIO_REMAPPED. When nvme_requeue_work() resubmits through submit_bio_noacct(), bio_check_eod() runs normally against the current capacity. Same approach as commit `3a905c37c3` ("block: skip bio_check_eod for partition-remapped bios"). Fixes: `a7c7f7b2b6` ("nvme: use bio_set_dev to assign ->bi_bdev") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 03:23:05 -07:00
liyouhong	4cf06977bd	nvme-multipath: require exact iopolicy names for module parameter The iopolicy module parameter uses strncmp prefix matching, so values like "numax" are accepted as "numa". The per-subsystem sysfs attribute already requires an exact match via sysfs_streq(). Parse both through a shared helper so invalid values are rejected consistently. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liyouhong <liyouhong@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 03:16:26 -07:00
John Garry	f078d1aa52	nvme-multipath: pass NS head to nvme_mpath_revalidate_paths() In nvme_mpath_revalidate_paths(), we are passed a NS pointer and use that to lookup the NS head and then use that same NS pointer as an iter variable. It makes more sense pass the NS head and use a local variable for the NS iter. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2026-06-02 03:15:08 -07:00
Jens Axboe	b30288887c	Merge tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-7.2/block Pull MD updates and fixes from Yu Kuai: "Bug Fixes: - Only requeue dm-raid bios when dm is suspending. (Benjamin Marzinski) - Reset raid10 read_slot when reusing r10bio for discard. (Chen Cheng) - Fix raid1/raid10 deadlock in read error recovery path. (Abd-Alrhman Masalkhi) - Fix raid1/raid10 error-path detection with md_cloned_bio(). (Abd-Alrhman Masalkhi) - Fix raid1/raid10 bio accounting for split md cloned bios. (Abd-Alrhman Masalkhi) - Fix raid1 nr_pending leak in REQ_ATOMIC bad-block path. (Abd-Alrhman Masalkhi) Improvements: - Skip redundant raid_disks updates when the value is unchanged. (Abd-Alrhman Masalkhi) Cleanups: - Update MAINTAINERS email addresses. (Yu Kuai, Li Nan) - Clean up raid1 read error handling. (Christoph Hellwig) - Move the exceed_read_errors condition out of fix_read_error(). (Christoph Hellwig) - Use str_plural() in raid0 dump_zones(). (Thorsten Blum)" * tag 'md-7.2-20260531' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid0: use str_plural helper in dump_zones raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path md/raid1: move the exceed_read_errors condition out of fix_read_error md/raid1: cleanup handle_read_error md/raid1,raid10: fix bio accounting for split md cloned bios md/raid1,raid10: fix error-path detection with md_cloned_bio() md/raid1,raid10: fix deadlock in read error recovery path md/raid10: reset read_slot when reusing r10bio for discard md: skip redundant raid_disks update when value is unchanged dm-raid: only requeue bios when dm is suspending MAINTAINERS: Update Li Nan's E-mail address MAINTAINERS: update Yu Kuai's email address	2026-06-01 12:52:20 -06:00
Christoph Böhmwalder	9310b955c8	MAINTAINERS: use new drbd-dev mailing list We are migrating from our own infrastructure to lists.linux.dev, so change the drbd-dev address to point to the new domain. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://patch.msgid.link/20260513065557.36042-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-06-01 08:23:43 -06:00
Rosen Penev	2e1b3f4c51	rbd: check snap_count against RBD_MAX_SNAP_COUNT snap_count is u32 but the comparison is against a SIZE_MAX-derived value (~2^61 on 64-bit), which clang flags as always false with -Wtautological-constant-out-of-range-compare. The proper check here should be that snap_count does not go over RBD_MAX_SNAP_COUNT. Assisted-by: Opencode:Big-pickle Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Alex Elder <elder@riscstar.com> Link: https://patch.msgid.link/20260530011255.52916-1-rosenp@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-31 19:48:53 -06:00
Haoze Xie	2957771379	rust: block: fix GenDisk cleanup paths GenDiskBuilder::build() still has fallible work after __blk_mq_alloc_disk(), but its error path only recovers the foreign queue data. That leaks the temporary gendisk and request_queue until later teardown. If the caller moved the last Arc<TagSet<T>> into build(), the leaked queue can retain blk-mq state after the tag set is dropped. Fix the pre-registration failure path by dropping the temporary gendisk reference with put_disk() before recovering queue_data, so disk_release() can tear down the owned queue. Also pair GenDisk::drop() with put_disk() after del_gendisk(). Once a Rust GenDisk has been added with device_add_disk(), del_gendisk() only unregisters it; the final gendisk reference still has to be dropped to complete the release path. Fixes: `3253aba340` ("rust: block: introduce `kernel::block::mq` module") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Signed-off-by: Haoze Xie <royenheart@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Link: https://patch.msgid.link/b70aff9a920cc42110fe5cf454c3099561863519.1780063368.git.royenheart@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-05-31 19:45:06 -06:00
Thorsten Blum	717359a168	md/raid0: use str_plural helper in dump_zones Replace the manual ternary "s" pluralization with str_plural() to simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260527141932.1243503-2-thorsten.blum@linux.dev Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:20 +08:00
Abd-Alrhman Masalkhi	909d9dc3b5	raid1: fix nr_pending leak in REQ_ATOMIC bad-block error path In raid1_write_request(), each per-mirror loop iteration begins by incrementing rdev->nr_pending. If a REQ_ATOMIC write encounters a badblock within the requested range, the code jumps to err_handle without dropping the reference taken for the current mirror. err_handle's cleanup loop will only decrements for k < i and r1_bio->bios[k] is non-NULL. The current slot is therefore skipped, leaving its nr_pending reference leaked permanently. The reference prevents the rdev from ever being removed, since raid1_remove_conf() refuses to remove an rdev with nr_pending > 0. Fix this by calling rdev_dec_pending() before jumping to err_handle. Fixes: `f2a38abf5f` ("md/raid1: Atomic write support") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://patch.msgid.link/20260530151411.4119-1-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:19 +08:00
Christoph Hellwig	6e3b0b9133	md/raid1: move the exceed_read_errors condition out of fix_read_error This condition much better fits into the only caller, limiting fix_read_error to actually fix up data devices after a read error. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260529054308.2720300-3-hch@lst.de Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:19 +08:00
Christoph Hellwig	fcba803132	md/raid1: cleanup handle_read_error Unwind the main conditional with duplicate conditions and initialize variables at initialization time where possible. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260529054308.2720300-2-hch@lst.de Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:19 +08:00
Abd-Alrhman Masalkhi	ba976e3501	md/raid1,raid10: fix bio accounting for split md cloned bios Use md_cloned_bio() to control bio accounting instead of relying on r1bio_existed in raid1 or the io_accounting flag in raid10. The previous logic does not reliably reflect whether a bio is an md cloned bio. When a failed bio is split and resubmitted via bio_submit_split_bioset() on the error path, this can lead to either double accounting for md cloned bios, or missing accounting for bios returned from bio_submit_split_bioset() Fix this by using md_cloned_bio() to detect md cloned bios and skip accounting accordingly. Fixes: `bb2a9acefa` ("md/raid1: switch to use md_account_bio() for io accounting") Fixes: `8204552383` ("md/raid10: switch to use md_account_bio() for io accounting") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Reviewed-by: Xiao Ni <xiao@kernel.org> Link: https://patch.msgid.link/20260501114652.590037-4-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:18 +08:00
Abd-Alrhman Masalkhi	811545e092	md/raid1,raid10: fix error-path detection with md_cloned_bio() Detect the error path using md_cloned_bio() instead of relying on r1_bio in raid1 or r10_bio->read_slot in raid10, which may be NULL or -1 after splitting and resubmitting a failed bio. As a result, the error path may not be recognized and memory allocations can incorrectly use GFP_NOIO instead of (GFP_NOIO \| __GFP_HIGH), which can lead to a deadlock under memory pressure. Fixes: `689389a06c` ("md/raid1: simplify handle_read_error().") Fixes: `545250f248` ("md/raid10: simplify handle_read_error()") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Reviewed-by: Xiao Ni <xiao@kernel.org> Link: https://patch.msgid.link/20260501114652.590037-3-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:18 +08:00
Abd-Alrhman Masalkhi	7b15c24f80	md/raid1,raid10: fix deadlock in read error recovery path raid1d and raid10d may resubmit a split md cloned bio while handling a read error. In this case, resubmitting the bio can lead to a deadlock if the array is suspended before md_handle_request() acquires an active_io reference via percpu_ref_tryget_live(). Since the cloned bio already holds an active_io reference, trying to acquire another reference via percpu_ref_tryget_live() can lead to a deadlock while the array is suspended. Fix this by using percpu_ref_get() for md cloned bios. Fixes: `bb2a9acefa` ("md/raid1: switch to use md_account_bio() for io accounting") Fixes: `8204552383` ("md/raid10: switch to use md_account_bio() for io accounting") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Reviewed-by: Xiao Ni <xiao@kernel.org> Reviewed-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/20260501114652.590037-2-abd.masalkhi@gmail.com Signed-off-by: Yu Kuai <yukuai@fygo.io>	2026-05-31 19:09:18 +08:00

1 2 3 4 5 ...

1444597 Commits