Jenny Seidenschwarz
Technische Universität München
Seminar Course AutoML
Munich, 4th of July 2019
Searching for Activation Functios
Lehrstuhl für MusterverfahrenFakultät für MustertechnikTechnische Universität München
2Jenny Seidenschwarz (TUM) | Seminar Course AutoML
Activation Functions
• Gradient preserving property
• More easy to optimize
x1
xn
w1j
Linear
Trafo
w2j
wnj
x2 Activation
Function… … …
State of the art default scalar activation function: ReLU 𝒎𝒂𝒙(𝟎, 𝒙)
Figure: ReLU activation function
Figure: Basic structure neural network
3Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Research Goal
Find new scalar activation functions …
… using automated search technique …
… compare them systematically to existing activation functions …
… across multiple different challenging datasets!
Automated Search
Challenge: balance size and expressivity of search space
→ Simple binary expression tree [1]
→ Selection of unary and binary functions
5Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Search Space
x
x Unary
Unary
Binary
Binary
x
Unary: 𝑥, −𝑥, 𝑥 , 𝑥2, 𝑥3, 𝑥 , 𝛽𝑥, 𝑥 + 𝛽, log 𝑥 + 𝜖 ,
exp 𝑥 , sin 𝑥 , cos 𝑥 , sinh 𝑥 , cosh 𝑥 , tanh 𝑥 ,
𝑡𝑎𝑛−1 𝑥 , 𝑠𝑖𝑛ℎ−1 𝑥 , 𝑠𝑖𝑛𝑐 𝑥 ,max 0, 𝑥 ,
min 0, 𝑥 , 𝜎 𝑥 , log 1 + exp 𝑥 , exp −𝑥2 , erf x , β
BInary: 𝑥1 + 𝑥2, 𝑥1𝑥2, 𝑥1 − 𝑥2,𝑥1
𝑥2+𝜖, max(𝑥1, 𝑥2),
m𝑖𝑛 𝑥1, 𝑥2 , 𝜎 𝑥1 𝑥2, exp −𝛽(𝑥1 − 𝑥2)2 , exp(−𝛽|𝑥1 −
𝑥2|), β𝑥1 + (1 − 𝛽)𝑥2
core unit
Unary
Unary
Figure: Core Unit, adapted from [5]
6Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Search approach
small search space
big search space
exhaustive
search
reinforcement
learning-based
search
train child
network with
found
activation
functions
get list of best
performing
update search
algorithm and
find best
empirical
evaluation and
experiments
RNN-controller [2] with domain specific language [1]
Train batch of generated activation functions
• ResNet-20
• Image classification on CIFAR-10
• 10k steps
7Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller
core unit
Figure: RNN-Controller architecture [5]
8Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller update
RNN Controller with PPO:
→ Clipping ensures updates in „trust region“
→ Sample efficient
• Objective function
𝐿𝐶𝐿𝐼𝑃 𝜃 = 𝔼𝑡 min 𝜎𝑡𝐺𝑡 , 𝑐𝑙𝑖𝑝 𝜎𝑡 , 1 − 𝜀, 1 + 𝜀 𝐺𝑡 , 𝑤𝑖𝑡ℎ 𝜎𝑡 =𝜋𝜃 𝑎𝑡 𝑠𝑡)
𝜋𝜃𝑜𝑙𝑑 𝑎𝑡 𝑠𝑡)
RNN Controller with REINFORCE:
• Objective function ℒ𝜃𝑐 = 𝔼𝜋𝜃,𝜏𝐺𝑡 , where 𝐺𝑡 = σ𝑘=𝑡
𝑇−1 𝛾𝑘−𝑡𝑟𝑘+1
Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐
Figure: clipping function,
adapted from [3]
RNN Controller with PPO:
• Objective gradient
→ 𝐺𝑘 = accuracy of child network
→ 𝑏 = exponential moving average of rewards
RNN Controller with REINFORCE:
• Objective gradient
9Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller update
Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐
One child network: ∇ℒ𝜃𝑐 = σ𝑡=1𝑇 ∇𝜃𝑐 𝑙𝑜𝑔 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 (𝐺 − 𝑏)
Figure: clipping function, adapted from [3]
Findings on Activation Functions
1. 1-2 core units perform best
2. Top activation functions always take raw preactivation x as input to final binary function
3. Periodic functions (sin, cos, etc. ) used by some top performing activation functions
4. Activation functions that use division perform poorly
11Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Findings on Activation Functions
𝑥𝜎 𝛽𝑥
𝑥(𝑠𝑖𝑛ℎ−1 𝑥 )2
min 𝑥, sin 𝑥
(𝑡𝑎𝑛ℎ−1(𝑥))2−𝑥
max(𝑥, 𝜎 𝑥 )
cos 𝑥 − 𝑥
m𝑎𝑥 𝑥, tanh(𝑥)
𝑠𝑖𝑛𝑐 𝑥 + 𝑥
Figure: best 8 activation functions [5]
Experiments to ensure generalization to deeper networks
12Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Validation of Performance
Swish
Figure: Generalization to deeper architectures of 8 best activation functions found [5]
(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy
• Nonlinearly interpolation between ReLU and linear function
• Smooth function
• Non-monotinoc function
• Unbounded above and bounded below (like ReLU)
13Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swish𝑓 𝑥 = 𝑥 𝜎(𝛽𝑥)
Figure: Swish activation function for different 𝛽 values
Benchmark of Swish
Benchmarked Swish to ReLU and other baseline activation functions
• Different models
• Different challenging real world datasets
• Test with fixed β = 1 and trainable β
15Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Further Experiments with Swish
CIFAR 10 and 100: ResNet-164, Wide ResNet 28-10 and DenseNet 100-12
• Median of 5 runs for comparison
ImageNet: Inception-ResNet-v2, Inception-v4, Inception-v3, MobileNet and Mobile NASNet-A
• Fixed number of steps, 3 learning rates with RMSProp
• Epsecially good performance on mobile sized modelm slightly underperform Inception-v4
English-German-translation: 12 layer Base Transformer
• Two different learning rates, 300K steps with Adam optimizer
16Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Further Experiments with Swish
Figure: Overview performace in experiments [5]
Performance of Swish
Learnable 𝛽:
18Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swish – learnable parameter β
Figure: distribution of trained 𝛽 on Mobile NASNet-A [5]
Non-monotinic bump for 𝑥 < 0
19Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swich – Challenging current Belief
Figure: Preactivations for 𝛽 = 1 on
ResNet-32 [5]
Figure: Swish function for different 𝛽
values with non-monotonic bump
No gradient preserving characteristics of derivative:
20Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swich – Challenging current Belief
Figure: Derivative of swish function for
different 𝛽 values
Figure: Derivative of ReLU
Main contributions:
• Used a search space as in [1] to find activation functions with a RNN controller [2], that was
updated with PPO [3]
• Systematically compared activation functions
• Found new activation function that constantly outperform or is on par with ReLU
Critical aspects:
• Search space restricts results
• Search space designed after human intuition
• Restriction of training steps and training on small architectures might suppress even better
activation functions
Future research:
• Only two core units, but more unary and binary functions
• Also take non-scalar activation functions into account
21Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Conclusion
[1] Bello, I. & Zoph, B. & Vasudevan, V. & Le, Quoc V. (2017). Neural Optimizer Search with
Reinforcement Learning.
[2] Zoph, B., & Le, Quoc V. (2017). Neural Architecture Search with Reinforcement Learning.
ArXiv, abs/1611.01578.
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy
Optimization Algorithms. ArXiv, abs/1707.06347.
[4] Elfwing, S. & Uchibe, E. & Doya, K. (2018). Sigmoid-Weighted Linear Units for Neural
Network Function Approximation in Reinforcement Learning. Neural Networks. 107.
10.1016/j.neunet.2017.12.012.
[5] Ramachandran, P., Zoph, B., & Le, Q.V. (2018). Searching for Activation Functions. ArXiv,
abs/1710.05941.
22Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
References
Back-up
If you want to use Swish:
• Already implemented in tensorflow as tf.nn.swish(x)
• When using batch norm: set scale parameter
• Derivative of swish: 𝑓′ 𝑥 = 𝛽𝑓 𝑥 + 𝜎(𝛽𝑥)(1 − 𝛽𝑓 𝑥 )
24Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Things to note
25Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiment Results - CIFAR
Figure: Benchmark experiments of Swish function to baseline functions on CIFAR [5]
(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy
26Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Training curves of Mobile NASNet-Aon
ImageNet. Best viewed in color
(b) Mobile NASNet-A on ImageNet, with3 different
runs ordered by top-1 accuracy. Theadditional 2 GELU
experiments are still trainingat the time of submission.
27Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Inception-ResNet-v2 on ImageNetwith 3 different
runs. Note that the ELUsometimes has instabilities at
the start oftraining, which accounts for the first result
(b) MobileNet on ImageNet.
28Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Inception-v3 on ImageNet (b) Inception-v4 on ImageNet
29Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on Machine Translation
Figure: Benchmark experiments of Swish function to baseline functions on WTM
English→German (BLEU score) [5]