where can be constant (usually set to 1) or trainable.
The swish family was designed to smoothly interpolate between a linear function and the ReLU function.
When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in [2]:Eq 3. Variants of the swish function include Mish.[3]
Remove ads
Special values
Summarize
Perspective
For β=0, the function is linear: f(x) = x/2.
For β=1, the function is the Sigmoid Linear Unit (SiLU).
Thus, the swish family smoothly interpolates between a linear function and the ReLU function.[1]
Since , all instances of swish have the same shape as the default , zoomed by . One usually sets . When is trainable, this constraint can be enforced by , where is trainable.
Remove ads
Derivatives
Because , it suffices to calculate its derivatives for the default case.so is odd.so is even.
Remove ads
History
SiLU was first proposed alongside the GELU in 2016,[4] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.[5][1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.