FTML optimizer implementation #9262

ZiyueHuang · 2017-12-30T14:07:15Z

Description

FTML optimizer implementation, requested in #9182

The default values of beta1, beta2, epsilon is the same with keras-team/keras-contrib#110.

How to add test to verify the correctness of implementation? @sxjscience

Here is the test for FTML in keras-contrib. Is that OK?

I have done only one experiment, FTML (val acc : 0.756210 at 10th epoch) can converge faster than momentum SGD (val acc : 0.684095 at 10th epoch) on cifar10, both using lr = 0.001, wd = 0 and resnet18_v1.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

sxjscience · 2017-12-30T21:22:45Z

python/mxnet/optimizer.py

+        sigma_t = d_t - self.beta1 * prev_d
+        z_t = self.beta1 * prev_z + (1 - self.beta1) * grad - sigma_t * weight
+        # update weight
+        weight[:] = - z_t / d_t - lr * wd * weight


I think we should merge the wd term into the gradient. @szhengac could you help check this?

The rest formulas look good.

sxjscience · 2017-12-30T21:35:15Z

Testing the optimizer is a difficult problem and we haven't found a good solution. Currently I think this kind of test, which optimizes a simple problem and checks the error, should be enough, https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_optimizer.py#L648-L672. Also, would you also add the C++ version? If C++ is added, we can test it against the python version.

szhengac · 2017-12-31T09:48:23Z

For weight decay, it may be correct, but for l2 regularizer, we can either incorporate the grad w.r.t. l2 regularizer into the complete grad or use following formula:

where \lambda_2 is the regularization parameter. If elastic net is considered, the following one can be used:

where lambda_1 is the regularization parameter for $\ell_1$ part.

Also, I think it is more efficient to update the powers of beta 1and beta 2 iteratively.

ZiyueHuang · 2017-12-31T16:31:10Z

Thanks for your comments.

c++ version is added.
I think weight decay is for l2 regularization in MXNet. l2 regularizer is now incorporated into the gradients.
Right, it is more efficient to update the powers of beta1 and beta 2 iteratively. But t = self._index_update_count[index] which is not always strictly increasing from 0.

piiswrong · 2017-12-31T19:35:20Z

python/mxnet/optimizer.py

@@ -529,6 +529,55 @@ def update_multi_precision(self, index, weight, grad, state):
        self._update_impl(index, weight, grad, state,
                          multi_precision=use_multi_precision)

+
+@register
+class FTML(Optimizer):


I thought we already have this

We have FTRL (Follow the regularized leader). This PR adds FTML (Follow the moving leader).

sxjscience · 2017-12-31T19:59:57Z

tests/python/unittest/test_optimizer.py

+        grad = grad * self.rescale_grad
+        if self.clip_gradient is not None:
+            grad = mx.nd.clip(grad, -self.clip_gradient, self.clip_gradient)
+        grad += wd * weight 


We should clip after adding the gradient of L2. This is consistent with other optimizers.

It seems that L2 term is out of clip in other optimizers, such as sgd in https://github.com/apache/incubator-mxnet/blob/master/src/operator/optimizer_op-inl.h#L76-L78.

In Adam, it’s clipped outside. So our current optimizes have such kinds of inconsist behavior. I think clip the gradient without we is wrong.

I mean without the WD part.

Got it. Thanks. Now WD part is added into gradients.

sxjscience · 2017-12-31T20:00:14Z

src/operator/optimizer_op-inl.h

+    using namespace mshadow_op;
+    const DType grad_i = clip_grad >= 0.0f
+        ? (clip::Map(rescale_grad * grad[i], clip_grad) + wd * weight[i])
+        : (rescale_grad * grad[i] + wd * weight[i]);


We should clip after adding the gradient of L2. This is consistent with other optimizers.

* ftml implemention * c++ version and test * merge WD into gradients

ftml implemention

1081ca3

sxjscience reviewed Dec 30, 2017

View reviewed changes

c++ version and test

957137a

ZiyueHuang changed the title ~~[WIP] FTML optimizer implementation~~ FTML optimizer implementation Dec 31, 2017

piiswrong reviewed Dec 31, 2017

View reviewed changes

sxjscience reviewed Dec 31, 2017

View reviewed changes

merge WD into gradients

945b6cc

sxjscience approved these changes Jan 2, 2018

View reviewed changes

piiswrong merged commit 12cb0d2 into apache:master Jan 3, 2018

yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018

FTML optimizer implementation (apache#9262)

778dc87

* ftml implemention * c++ version and test * merge WD into gradients

ZiyueHuang deleted the ftml branch January 30, 2018 11:31

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

FTML optimizer implementation (apache#9262)

9caa3f3

* ftml implemention * c++ version and test * merge WD into gradients

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

FTML optimizer implementation (apache#9262)

f2b280b

* ftml implemention * c++ version and test * merge WD into gradients

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FTML optimizer implementation #9262

FTML optimizer implementation #9262

ZiyueHuang commented Dec 30, 2017 •

edited

Loading

sxjscience Dec 30, 2017

sxjscience Dec 30, 2017

sxjscience commented Dec 30, 2017

szhengac commented Dec 31, 2017

ZiyueHuang commented Dec 31, 2017

piiswrong Dec 31, 2017

sxjscience Dec 31, 2017

sxjscience Dec 31, 2017

ZiyueHuang Jan 1, 2018

sxjscience Jan 1, 2018

sxjscience Jan 1, 2018

ZiyueHuang Jan 1, 2018

sxjscience Dec 31, 2017

FTML optimizer implementation #9262

FTML optimizer implementation #9262

Conversation

ZiyueHuang commented Dec 30, 2017 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Dec 30, 2017

szhengac commented Dec 31, 2017

ZiyueHuang commented Dec 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZiyueHuang commented Dec 30, 2017 •

edited

Loading