Swish function

Special values

Summarize

Perspective

For β = 0, the function is linear: f(x) = x/2.

For β = 1, the function is the Sigmoid Linear Unit (SiLU).

With β → ∞, the function converges to ReLU.

Thus, the swish family smoothly interpolates between a linear function and the ReLU function.^[1]

Since $\operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta$ , all instances of swish have the same shape as the default $\operatorname {swish} _{1}$ , zoomed by $\beta$ . One usually sets $\beta >0$ . When $\beta$ is trainable, this constraint can be enforced by $\beta =e^{b}$ , where $b$ is trainable.

$\operatorname {swish} _{1}(x)={\frac {x}{2}}+{\frac {x^{2}}{4}}-{\frac {x^{4}}{48}}+{\frac {x^{6}}{480}}+O\left(x^{8}\right)$

${\begin{aligned}\operatorname {swish} _{1}(x)&={\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)+{\frac {x}{2}}\\\operatorname {swish} _{1}(x)+\operatorname {swish} _{-1}(x)&=x\tanh \left({\frac {x}{2}}\right)\\\operatorname {swish} _{1}(x)-\operatorname {swish} _{-1}(x)&=x\end{aligned}}$

Remove ads

Derivatives

Because $\operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta$ , it suffices to calculate its derivatives for the default case. $\operatorname {swish} _{1}'(x)={\frac {x+\sinh(x)}{4\cosh ^{2}\left({\frac {x}{2}}\right)}}+{\frac {1}{2}}$ so $\operatorname {swish} _{1}'(x)-{\frac {1}{2}}$ is odd. $\operatorname {swish} _{1}''(x)={\frac {1-{\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)}{2\cosh ^{2}\left({\frac {x}{2}}\right)}}$ so $\operatorname {swish} _{1}''(x)$ is even.

Remove ads

History

SiLU was first proposed alongside the GELU in 2016,^[4] then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning.^[5]^[1] The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β.

In 2017, after performing analysis on ImageNet data, researchers from Google indicated that using this function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions.^[1] It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.^[6]

Special values

Derivatives

History

See also

References

Wikiwand - on