Single instruction, multiple threads

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with zero-overhead multithreading, i.e. multithreading where the hardware is capable of switching between threads on a cycle-by-cycle basis. There are two models of multithreading involved. In addition to the zero-overhead multithreading mentioned, the SIMD execution hardware is virtualized to represent a multiprocessor, but is inferior to a SPMD processor in that instructions in all "threads" are executed in lock-step in the lanes of the SIMD processor which can only execute the same instruction in a given cycle across all lanes. The SIMT execution model has been implemented on several GPUs and is relevant for general-purpose computing on graphics processing units (GPGPU), e.g. some supercomputers combine CPUs with GPUs.

The processors, say a number $p$ of them, seem to execute many more than $p$ tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to SIMD lanes.^[1]

The simplest way to understand SIMT is to imagine a multi-core system, where each core has its own register file, its own ALUs (both SIMD and Scalar) and its own data cache, but that unlike a standard multi-core system which has multiple independent instruction caches and decoders, as well as multiple independent Program Counter registers, the instructions are synchronously broadcast to all SIMT cores from a single unit with a single instruction cache and a single instruction decoder which reads instructions using a single Program Counter.

The key difference between SIMT and SIMD lanes is that each of the SIMT cores may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas SIMD lanes are simply part of an ALU that knows nothing about memory per se.

However, the SIMT execution model is still only a way to present to the programmer what is fundamentally still a SIMD core. Programs must be designed with the SIMD architecture in mind. SIMT may allow threads to diverge by branching, but if possible this must be avoided. A branch will result in the equivalent of the execution of multiple SIMD instructions where certain SIMD lanes are masked to not participate and remain idle, which is of course not desirable. In other words, the multithreading aspect of SIMT is only a way to organize the flow of computation. It is not a feature that in and of itself the programmer should attempt to exploit to its full extend.

Also important to note is the difference between SIMT and SPMD - Single Program Multiple Data. SPMD, like standard multi-core systems, has multiple Program Counters.

[1]

NVIDIA CUDA	OpenCL	Hennessy & Patterson^[7]
Thread	Work-item	Sequence of SIMD Lane operations
Warp	Sub-group	Thread of SIMD Instructions
Block	Work-group	Body of vectorized loop
Grid	NDRange	Vectorized loop

Single instruction, multiple threads

History

Description

See also

References

Wikiwand - on