NVMe over TCP
From Wikipedia, the free encyclopedia
NVMe over TCP, often written NVMe/TCP, is a network transport protocol within the NVMe-oF specification. It extends the NVMe standard over TCP networks. This enables the transmission of NVMe-oF commands over standard Ethernet-based TCP/IP networks, providing scalable and efficient access to NVMe storage devices without the need for specialized hardware or networking interfaces. It is a component of the broader NVMe-oF standard.
This article is an orphan, as no other articles link to it. Please introduce links to this page from related articles; try the Find link tool for suggestions. (April 2025) |
Background
Summarize
Perspective
NVMe is a high-performance interface and protocol originally designed for direct-attached PCIe SSDs. While NVMe greatly improved local storage throughput and latency compared to older protocols like AHCI, it was initially limited by PCIe's short reach (within a server or a rack).[1] NVMe-oF was introduced to extend NVMe across network fabrics, enabling remote access to NVMe devices with minimal added latency. The NVMe-oF 1.0 specification was released in mid-2016 by NVM Express, supporting transports such as RDMA (e.g. RoCE, InfiniBand) and later Fibre Channel.[1] This approach preserves the NVMe command structure over a network, incurring as little as ~10 µs of additional latency in early demonstrations.[1]
As NVMe-oF gained traction, there was a push to leverage ubiquitous Ethernet/IP networks for NVMe transport. NVMe/TCP emerged as an NVMe-oF transport binding that uses standard TCP/IP as the fabric. Work on NVMe/TCP began by the late 2010s to offer a flexible, cost-effective alternative to specialized RDMA networks.[1][2] NVMe/TCP encapsulates NVMe commands and data inside TCP packets over Ethernet, avoiding the need for proprietary adapters or lossless networks. By 2018–2019 the NVMe/TCP specification was finalized under the NVMe-oF 1.1 standard, bringing NVMe-oF to any IP-based data center network.[3][4]
EE Times described NVMe/TCP as a means to address the "data tsunami" from AI and IoT by allowing tens of thousands of NVMe drives to appear local to hosts over Ethernet.[5] The ratified NVMe/TCP 1.0 specification was published in 2021.
Principles of operation
Summarize
Perspective
Architecture
NVMe over TCP operates within the NVMe-oF framework using a client–server architecture consisting of NVMe initiators (hosts) and NVMe targets (subsystems). Targets expose NVMe namespaces (storage volumes) over the network, allowing hosts to access remote NVMe controllers similarly to local storage devices.[1] NVMe/TCP encapsulates standard NVMe command and completion traffic directly within TCP/IP packets transmitted over Ethernet. Unlike RDMA's memory-based approach, NVMe/TCP employs message-based transfers comparable to NVMe over Fibre Channel. Each NVMe queue pair, comprising a submission and completion queue, is mapped onto TCP connections, enabling parallelism with support for up to 64K queues, each accommodating up to 64K commands.[2]
Data transfer
Data transfers in NVMe/TCP follow standard NVMe-oF operations. Commands and responses are encapsulated in capsules sent over TCP connections.[6] Small data transfers may be included directly within the capsules, while larger transfers use separate TCP segments. For write commands, the host transmits data segments following the command capsule; for read commands, the target sends data segments back to the host. NVMe/TCP relies on TCP's reliable, in-order delivery rather than remote DMA, eliminating the need for specialized hardware but slightly increasing CPU overhead and latency compared to RDMA transports.[7] Although NVMe/TCP generally introduces additional latency, typically tens of microseconds compared to RDMA, performance optimizations such as kernel bypass and hardware offloads can reduce this impact.[8] NVMe/TCP supports high throughput and scalability in standard IP networks without requiring lossless fabrics.[7]
NVMe-oF capsules
Capsules are the basic communication units within NVMe over Fabrics (NVMe-oF), used to transport NVMe commands and completions across networks. A capsule carries either an NVMe command from host to target or an NVMe completion response from target to host, along with optional data payloads and scatter-gather lists (SGLs). Small data transfers may be contained entirely within a capsule ("in-capsule data"), while larger transfers use separate data messages referenced by SGLs. Capsules abstract NVMe operations from underlying network transports, allowing them to span multiple packets or frames as required, independent of the specific transport protocol, such as Ethernet, TCP, or RDMA.[6]
NVMe/TCP Protocol Data Units
To transmit NVMe-oF capsules over TCP, NVMe/TCP defines protocol data units (PDUs) that frame the capsules and data within the TCP byte stream. Each PDU contains a header indicating its type (command capsule, response, or data) and length, allowing the receiver to parse the TCP stream into discrete NVMe messages. For example, a host transmits a Command Capsule PDU for NVMe commands, followed by Data PDUs for write operations. The target replies with a Response PDU containing NVMe completion status, accompanied by Data PDUs for read requests.[2]
PDUs also handle session management. After establishing a TCP connection, hosts and controllers perform an NVMe-oF Connect operation. The standard NVMe Keep Alive command is used to maintain connection liveness. Flow control relies primarily on TCP's built-in window mechanisms, with optional host-side queue-depth control.[9]
Applications
NVMe over TCP is frequently used in cloud and hyperscale data centers, supporting resource disaggregation and software-defined storage architectures.[5][10] NVMe/TCP enables pools of NVMe SSDs to be attached and shared over standard Ethernet networks, allowing storage capacity to scale independently of computing resources.[7]
In enterprise environments, NVMe/TCP is adopted to modernize SANs. The protocol is sometimes compared to iSCSI due to its use of IP networks, though specifically adapted for NVMe-based storage performance.[8]
Implementations
Multiple implementations of NVMe over TCP exist in both hardware and software forms. In the open-source domain, NVMe/TCP support was added to the Linux kernel in version 5.0, released in early 2019, which includes initiator (client) and target (server) drivers.[9] As a result, modern Linux distributions can function as NVMe/TCP hosts or targets using built-in kernel modules. The introduction of NVMe/TCP in Linux simplified configurations compared to previous RDMA-based NVMe-oF implementations.[9]
Standardisation
The NVMe over TCP protocol is defined and maintained by NVM Express, which oversees the NVMe family of specifications. NVMe/TCP was developed as an extension of the NVMe over Fabrics (NVMe-oF) specification. The initial NVMe-oF 1.0 specification, ratified in 2016, initially supported RDMA transports, and provided a foundation for later Fibre Channel integration.[1] NVMe/TCP was subsequently introduced in the NVMe-oF 1.1 specification, which was completed and released for member review in July 2019, was approved alongside NVMe-oF 1.1 later that year and issued as a standalone specification (Revision 1.0) in May 2021.[3][4]
Since 2020, NVMe/TCP has been integrated into the broader NVMe specification framework, specifically as the NVMe Transport Specification (TCP Transport) within the NVMe 2.x series. Under this structure, NVMe/TCP is versioned and maintained alongside other NVMe specifications. As of NVMe version 2.2 (2025), NVMe/TCP continues as an official transport method, undergoing periodic updates to improve functionality and add features.[11]
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.