Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl...

Post on 17-Sep-2020

2 views 0 download

transcript

Jenny Seidenschwarz

Technische Universität München

Seminar Course AutoML

Munich, 4th of July 2019

Searching for Activation Functios

Lehrstuhl für MusterverfahrenFakultät für MustertechnikTechnische Universität München

2Jenny Seidenschwarz (TUM) | Seminar Course AutoML

Activation Functions

• Gradient preserving property

• More easy to optimize

x1

xn

w1j

Linear

Trafo

w2j

wnj

x2 Activation

Function… … …

State of the art default scalar activation function: ReLU 𝒎𝒂𝒙(𝟎, 𝒙)

Figure: ReLU activation function

Figure: Basic structure neural network

3Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Research Goal

Find new scalar activation functions …

… using automated search technique …

… compare them systematically to existing activation functions …

… across multiple different challenging datasets!

Automated Search

Challenge: balance size and expressivity of search space

→ Simple binary expression tree [1]

→ Selection of unary and binary functions

5Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Search Space

x

x Unary

Unary

Binary

Binary

x

Unary: 𝑥, −𝑥, 𝑥 , 𝑥2, 𝑥3, 𝑥 , 𝛽𝑥, 𝑥 + 𝛽, log 𝑥 + 𝜖 ,

exp 𝑥 , sin 𝑥 , cos 𝑥 , sinh 𝑥 , cosh 𝑥 , tanh 𝑥 ,

𝑡𝑎𝑛−1 𝑥 , 𝑠𝑖𝑛ℎ−1 𝑥 , 𝑠𝑖𝑛𝑐 𝑥 ,max 0, 𝑥 ,

min 0, 𝑥 , 𝜎 𝑥 , log 1 + exp 𝑥 , exp −𝑥2 , erf x , β

BInary: 𝑥1 + 𝑥2, 𝑥1𝑥2, 𝑥1 − 𝑥2,𝑥1

𝑥2+𝜖, max(𝑥1, 𝑥2),

m𝑖𝑛 𝑥1, 𝑥2 , 𝜎 𝑥1 𝑥2, exp −𝛽(𝑥1 − 𝑥2)2 , exp(−𝛽|𝑥1 −

𝑥2|), β𝑥1 + (1 − 𝛽)𝑥2

core unit

Unary

Unary

Figure: Core Unit, adapted from [5]

6Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Search approach

small search space

big search space

exhaustive

search

reinforcement

learning-based

search

train child

network with

found

activation

functions

get list of best

performing

update search

algorithm and

find best

empirical

evaluation and

experiments

RNN-controller [2] with domain specific language [1]

Train batch of generated activation functions

• ResNet-20

• Image classification on CIFAR-10

• 10k steps

7Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

RNN-Controller

core unit

Figure: RNN-Controller architecture [5]

8Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

RNN-Controller update

RNN Controller with PPO:

→ Clipping ensures updates in „trust region“

→ Sample efficient

• Objective function

𝐿𝐶𝐿𝐼𝑃 𝜃 = 𝔼𝑡 min 𝜎𝑡𝐺𝑡 , 𝑐𝑙𝑖𝑝 𝜎𝑡 , 1 − 𝜀, 1 + 𝜀 𝐺𝑡 , 𝑤𝑖𝑡ℎ 𝜎𝑡 =𝜋𝜃 𝑎𝑡 𝑠𝑡)

𝜋𝜃𝑜𝑙𝑑 𝑎𝑡 𝑠𝑡)

RNN Controller with REINFORCE:

• Objective function ℒ𝜃𝑐 = 𝔼𝜋𝜃,𝜏𝐺𝑡 , where 𝐺𝑡 = σ𝑘=𝑡

𝑇−1 𝛾𝑘−𝑡𝑟𝑘+1

Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐

Figure: clipping function,

adapted from [3]

RNN Controller with PPO:

• Objective gradient

→ 𝐺𝑘 = accuracy of child network

→ 𝑏 = exponential moving average of rewards

RNN Controller with REINFORCE:

• Objective gradient

9Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

RNN-Controller update

Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐

One child network: ∇ℒ𝜃𝑐 = σ𝑡=1𝑇 ∇𝜃𝑐 𝑙𝑜𝑔 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 (𝐺 − 𝑏)

Figure: clipping function, adapted from [3]

Findings on Activation Functions

1. 1-2 core units perform best

2. Top activation functions always take raw preactivation x as input to final binary function

3. Periodic functions (sin, cos, etc. ) used by some top performing activation functions

4. Activation functions that use division perform poorly

11Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Findings on Activation Functions

𝑥𝜎 𝛽𝑥

𝑥(𝑠𝑖𝑛ℎ−1 𝑥 )2

min 𝑥, sin 𝑥

(𝑡𝑎𝑛ℎ−1(𝑥))2−𝑥

max(𝑥, 𝜎 𝑥 )

cos 𝑥 − 𝑥

m𝑎𝑥 𝑥, tanh(𝑥)

𝑠𝑖𝑛𝑐 𝑥 + 𝑥

Figure: best 8 activation functions [5]

Experiments to ensure generalization to deeper networks

12Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Validation of Performance

Swish

Figure: Generalization to deeper architectures of 8 best activation functions found [5]

(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy

• Nonlinearly interpolation between ReLU and linear function

• Smooth function

• Non-monotinoc function

• Unbounded above and bounded below (like ReLU)

13Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swish𝑓 𝑥 = 𝑥 𝜎(𝛽𝑥)

Figure: Swish activation function for different 𝛽 values

Benchmark of Swish

Benchmarked Swish to ReLU and other baseline activation functions

• Different models

• Different challenging real world datasets

• Test with fixed β = 1 and trainable β

15Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Further Experiments with Swish

CIFAR 10 and 100: ResNet-164, Wide ResNet 28-10 and DenseNet 100-12

• Median of 5 runs for comparison

ImageNet: Inception-ResNet-v2, Inception-v4, Inception-v3, MobileNet and Mobile NASNet-A

• Fixed number of steps, 3 learning rates with RMSProp

• Epsecially good performance on mobile sized modelm slightly underperform Inception-v4

English-German-translation: 12 layer Base Transformer

• Two different learning rates, 300K steps with Adam optimizer

16Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Further Experiments with Swish

Figure: Overview performace in experiments [5]

Performance of Swish

Learnable 𝛽:

18Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swish – learnable parameter β

Figure: distribution of trained 𝛽 on Mobile NASNet-A [5]

Non-monotinic bump for 𝑥 < 0

19Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swich – Challenging current Belief

Figure: Preactivations for 𝛽 = 1 on

ResNet-32 [5]

Figure: Swish function for different 𝛽

values with non-monotonic bump

No gradient preserving characteristics of derivative:

20Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Swich – Challenging current Belief

Figure: Derivative of swish function for

different 𝛽 values

Figure: Derivative of ReLU

Main contributions:

• Used a search space as in [1] to find activation functions with a RNN controller [2], that was

updated with PPO [3]

• Systematically compared activation functions

• Found new activation function that constantly outperform or is on par with ReLU

Critical aspects:

• Search space restricts results

• Search space designed after human intuition

• Restriction of training steps and training on small architectures might suppress even better

activation functions

Future research:

• Only two core units, but more unary and binary functions

• Also take non-scalar activation functions into account

21Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Conclusion

[1] Bello, I. & Zoph, B. & Vasudevan, V. & Le, Quoc V. (2017). Neural Optimizer Search with

Reinforcement Learning.

[2] Zoph, B., & Le, Quoc V. (2017). Neural Architecture Search with Reinforcement Learning.

ArXiv, abs/1611.01578.

[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy

Optimization Algorithms. ArXiv, abs/1707.06347.

[4] Elfwing, S. & Uchibe, E. & Doya, K. (2018). Sigmoid-Weighted Linear Units for Neural

Network Function Approximation in Reinforcement Learning. Neural Networks. 107.

10.1016/j.neunet.2017.12.012.

[5] Ramachandran, P., Zoph, B., & Le, Q.V. (2018). Searching for Activation Functions. ArXiv,

abs/1710.05941.

22Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

References

Back-up

If you want to use Swish:

• Already implemented in tensorflow as tf.nn.swish(x)

• When using batch norm: set scale parameter

• Derivative of swish: 𝑓′ 𝑥 = 𝛽𝑓 𝑥 + 𝜎(𝛽𝑥)(1 − 𝛽𝑓 𝑥 )

24Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Things to note

25Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiment Results - CIFAR

Figure: Benchmark experiments of Swish function to baseline functions on CIFAR [5]

(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy

26Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on ImageNet

Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]

(a) Training curves of Mobile NASNet-Aon

ImageNet. Best viewed in color

(b) Mobile NASNet-A on ImageNet, with3 different

runs ordered by top-1 accuracy. Theadditional 2 GELU

experiments are still trainingat the time of submission.

27Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on ImageNet

Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]

(a) Inception-ResNet-v2 on ImageNetwith 3 different

runs. Note that the ELUsometimes has instabilities at

the start oftraining, which accounts for the first result

(b) MobileNet on ImageNet.

28Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on ImageNet

Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]

(a) Inception-v3 on ImageNet (b) Inception-v4 on ImageNet

29Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions

Experiments on Machine Translation

Figure: Benchmark experiments of Swish function to baseline functions on WTM

English→German (BLEU score) [5]