Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

flaky test: test_spatial_transformer_with_type #11839

Closed
larroy opened this issue Jul 20, 2018 · 15 comments
Closed

flaky test: test_spatial_transformer_with_type #11839

larroy opened this issue Jul 20, 2018 · 15 comments

Comments

@larroy
Copy link
Contributor

larroy commented Jul 20, 2018

Flaky test introduced in master:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1246/pipeline/

#7645

@szha
Copy link
Member

szha commented Jul 20, 2018

@ddavydenko

@ddavydenko
Copy link
Contributor

I just finished re-runing this test on P2.x8large with 1,000,000 iterations and have not seen single failure (MXNET_TEST_COUNT=1000000 nosetests --logging-level=DEBUG tests/python/gpu/test_operator_gpu.py:test_spatial_transformer_with_type).

I wonder whether this could be something related to the CI environment that causes flakyness as I am yet to see this failing, despite running 1M times.

@szha
Copy link
Member

szha commented Jul 21, 2018

Possibly. Here's a doc that @marcoabreu put together: https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results

@marcoabreu
Copy link
Contributor

======================================================================

FAIL: test_operator_gpu.test_spatial_transformer_with_type

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 175, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 708, in test_spatial_transformer_with_type

    check_consistency(sym, ctx_list, grad_req="add")

  File "/work/mxnet/python/mxnet/test_utils.py", line 1354, in check_consistency

    raise e

  File "/work/mxnet/python/mxnet/test_utils.py", line 1349, in check_consistency

    equal_nan=equal_nan)

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 229691.785771 exceeds tolerance rtol=0.000010, atol=0.000010.  Location of maximum error:(0, 2, 1, 0), a=-5.896435, b=-1.091782

 a: array([[[[ 0.86260644, -0.58269032, -0.9129366 , ...,  0.        ,

           0.        ,  0.        ],

         [-0.31708283, -0.38723805, -1.35227876, ...,  0.        ,...

 b: array([[[[ 0.69660542,  0.01495369, -0.62585958, ...,  0.        ,

           0.        ,  0.        ],

         [-0.10513522,  0.03817245, -1.14077367, ...,  0.        ,...

-------------------- >> begin captured stdout << ---------------------

Train Err: ctx 1 vs ctx 0 at data



--------------------- >> end captured stdout << ----------------------

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=460980419 to reproduce.

--------------------- >> end captured logging << ---------------------

@larroy
Copy link
Contributor Author

larroy commented Jul 21, 2018

@ddavydenko did you try to run the test in docker? using the ci/build.py tool and associated container? this should provide a very similar environment minus hardware differences. But you can always start the same instance type. The base OS is quite vanilla, just docker and nvidia docker installed.

@anirudh2290
Copy link
Member

@ddavydenko its failing for NOCUDNN stage. You will have to build with USE_CUDA=ON and USE_CUDNN=OFF.

@ddavydenko
Copy link
Contributor

Thanks, @anirudh2290, I built with USE_CUDNN=0 and indeed I got this test consistently failing with such MXNet build. I checked on the CI pipeline and this test seems to be failing under "Python3: MKLDNN-GPU-NOCUDNN". I don't know right now what CUDNN has that makes this test work and makes it fail when there is no CUDNN support in MXNet. In the meantime I have a question: does it make sense to remove this test (and maybe others failing for the same reason) from CI step that uses MXNet without CUDNN in it?

@larroy
Copy link
Contributor Author

larroy commented Jul 24, 2018

The output is very different, do we know the cause of this discrepancy? Doesn't seem just a tolerance issue.

@aaronmarkham
Copy link
Contributor

@anirudhacharya
Copy link
Member

jenkins log file - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-11856/runs/5/nodes/859/log/?start=0

======================================================================

FAIL: test_operator_gpu.test_spatial_transformer_with_type

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 172, in test_new

    orig_test(*args, **kwargs)

  File "/work/mxnet/tests/python/gpu/test_operator_gpu.py", line 707, in test_spatial_transformer_with_type

    check_consistency(sym, ctx_list)

  File "/work/mxnet/python/mxnet/test_utils.py", line 1354, in check_consistency

    raise e

  File "/work/mxnet/python/mxnet/test_utils.py", line 1349, in check_consistency

    equal_nan=equal_nan)

  File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 14350.833283 exceeds tolerance rtol=0.000010, atol=0.000010.  Location of maximum error:(0, 3, 2, 6), a=3.242524, b=2.710094

 a: array([[[[-23.61268397, -14.63544932,   0.74020521, ...,   8.87566184,

          -10.74963933, -11.83532109],

         [ 20.2245003 ,   3.22738566,  21.24695322, ...,   4.78354851,...

 b: array([[[[-21.55476512, -13.13690641,   0.56911162, ...,   8.07749741,

           -9.78960317, -10.73026276],

         [ 18.3851834 ,   3.02351875,  19.34346791, ...,   4.4130619 ,...

-------------------- >> begin captured stdout << ---------------------

Train Err: ctx 1 vs ctx 0 at data

@haojin2
Copy link
Contributor

haojin2 commented Jul 30, 2018

Another failure on my dev machine:

======================================================================
FAIL: test_operator_gpu.test_spatial_transformer_with_type
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/zhazha/tests/python/gpu/../unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/zhazha/tests/python/gpu/test_operator_gpu.py", line 704, in test_spatial_transformer_with_type
    check_consistency(sym, ctx_list)
  File "/home/ubuntu/zhazha/python/mxnet/test_utils.py", line 1354, in check_consistency
    raise e
  File "/home/ubuntu/zhazha/python/mxnet/test_utils.py", line 1349, in check_consistency
    equal_nan=equal_nan)
  File "/home/ubuntu/zhazha/python/mxnet/test_utils.py", line 493, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 56995.695913 exceeds tolerance rtol=0.000010, atol=0.000010.  Location of maximum error:(0, 3, 7, 9), a=0.688489, b=0.075500
 a: array([[[[ 13.85105315,   9.0159893 ,  -3.55307889, ...,  13.3817302 ,
            1.51110181, -27.07006942],
         [ 15.46217654,  29.52412869,   1.12123177, ...,   1.15918658,...
 b: array([[[[ 13.61693537,   8.89983682,  -3.54552055, ...,  13.13456345,
            1.48394349, -26.94436452],
         [ 14.9167828 ,  29.43000904,   1.25100417, ...,   1.76727081,...
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=914275233 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------

This is produced by a build with MKLDNN and without CUDNN.

larroy added a commit to larroy/mxnet that referenced this issue Jul 30, 2018
aaronmarkham added a commit to aaronmarkham/incubator-mxnet that referenced this issue Aug 7, 2018
* adding param for list of tags to display on website

* using new website display argument for artifact placement in version folder

* adding display logic

* remove restricted setting for testing

* update usage instructions

* reverted Jenkinsfile to use restricted nodes

[MXAPPS-581] Fixes for broken Straight Dope tests. (apache#11923)

* Update relative paths pointing to the data directory to point to the
  correct place in the testing temporary folder.

* Enable the notebooks that were previously broken because of relative
  file paths not pointing to the correct place.

* Move some notebooks we do not plan to test to the whitelist. These
  notebooks are not published in the Straight Dope book.

* Clean-up: Convert print statements to info/warn/error logging
  statements. Add some logging statements for better status.

Disable flaky test: test_spatial_transformer_with_type (apache#11930)

apache#11839

Add linux and macos MKLDNN Building Instruction (apache#11049)

* add linux and macos doc

* update doc

* Update MKL_README.md

* Update MKL_README.md

Add convolution code to verify mkldnn backend

* add homebrew link

* rename to MKLDNN_README

* add mkl verify

* trigger

* trigger

* set mac complier to gcc47

* add VS2017 support experimentally

* improve quality

* improve quality

* modify mac build instruction since prepare_mkldnn.sh has been rm

* trigger

* add some improvement

[MXNET-531] Add download util (apache#11866)

* add changes to example

* place the file to the util

* add retry scheme

* fix the retry logic

* change the DownloadUtil to Util

* Trigger the CI

[MXNET-11241] Avoid use of troublesome cudnnFind() results when grad_req='add' (apache#11338)

* Add tests that fail due to issue 11241

* Fix apache#11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED

* Force algo 1 when grad_req==add with large c.  Expand tests.

* Shorten test runtimes.

Improving documentation and error messages for Async distributed training with Gluon (apache#11910)

* Add description about update on kvstore

* add async check for gluon

* only raise error if user set update_on_kvstore

* fix condition

* add async nightly test

* fix case when no kvstore

* add example for trainer creation in doc

[MXNET-641] fix R windows install docs (apache#11805)

* fix R windows install docs

* addressed PR comments

* PR comments

* PR comments

* fixed line wrappings

* fixed line wrappings

a hot fix for mkldnn link (apache#11939)

re-enabling randomized test_l2_normalization (apache#11900)

[MXNET-651] MXNet Model Backwards Compatibility Checker (apache#11626)

* Added MNIST-MLP-Module-API models to check model save and load_checkpoint methods

* Added LENET with Conv2D operator training file

* Added LENET with Conv2d operator inference file

* Added LanguageModelling with RNN training file

* Added LamguageModelling with RNN inference file

* Added hybridized LENET Gluon Model training file

* Added hybridized LENET gluon model inference file

* Added license headers

* Refactored the model and inference files and extracted out duplicate code in a common file

* Added runtime function for executing the MBCC files

* Added JenkinsFile for MBCC to be run as a nightly job

* Added boto3 install for s3 uploads

* Added README for MBCC

* Added license header

* Added more common functions from lm_rnn_gluon_train and inference files into common.py to clean up code

* Added scripts for training models on older versions of MXNet

* Added check for preventing inference script from crashing in case no trained models are found

* Fixed indentation issue

* Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset

* Fixed indentation issue

* Removed training in models and added smaller models. Now we are simply checking a forward pass in the model with dummy data.

* Updated README

* Fixed indentation error

* Fixed indentation error

* Removed code duplication in the training file

* Added comments for runtime_functions script for training files

* Merged S3 Buckets for storing data and models into one

* Automated the process to fetch MXNet versions from git tags

* Added defensive checks for the case where the data might not be found

* Fixed issue where we were performing inference on state model files

* Replaced print statements with logging ones

* Removed boto install statements and move them into ubuntu_python docker

* Separated training and uploading of models into separate files so that training runs in Docker and upload runs outside Docker

* Fixed pylint warnings

* Updated comments and README

* Removed the venv for training process

* Fixed indentation in the MBCC Jenkins file and also separated out training and inference into two separate stages

* Fixed indendation

* Fixed erroneous single quote

* Added --user flag to check for Jenkins error

* Removed unused methods

* Added force flag in the pip command to install mxnet

* Removed the force-re-install flag

* Changed exit 1 to exit 0

* Added quotes around the shell command

* added packlibs and unpack libs for MXNet builds

* Changed PythonPath from relative to absolute

* Created dedicated bucket with correct permission

* Fix for python path in training

* Changed bucket name to CI bucket

* Added set -ex to the upload shell script

* Now raising an exception if no models are found in the S3 bucket

* Added regex to train models script

* Added check for performing inference only on models trained on same major versions

* Added set -ex flags to shell scripts

* Added multi-version regex checks in training

* Fixed typo in regex

* Now we will train models for all the minor versions for a given major version by traversing the tags

* Added check for validating current_version

[MXNET-531] NeuralStyle Example for Scala (apache#11621)

* add initial neuralstyle and test coverage

* Add two more test and README

* kill comments

* patch on memory leaks fix

* fix formatting issues

* remove redundant files

* disable the Gan example for now

* add ignore method

* add new download scheme to match the changes
@Ishitori
Copy link
Contributor

Ishitori commented Aug 8, 2018

I was able to reproduce the failing compiling MXNet from source with:

make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=0 

And then running it from project root folder:

export MXNET_TEST_SEED=914275233
nosetests -v tests/python/gpu/test_operator_gpu.py:test_spatial_transformer_with_type

If I use another seed value, then the test passes.

mbrookhart pushed a commit to NervanaSystems/ngraph-mxnet that referenced this issue Aug 17, 2018
* Enable control flow test (#11869)

* [MXAPPS-581] Nightly Straight Dope tests. (#11814)

* [MXAPPS-581] Nightly Straight Dope tests.

The Straight Dope notebooks will retrieved from the Github repo, run and
scanned for warnings and errors. Because we are not checking accuracy of
the training, we set the number of epochs to 1 to reduce the integration
test run time.
* Common functionality for running and testing notebooks has been
  factored into a common test util module.
* Support for running UTF-8 notebooks added (Python2 and 3 compatible).
* Notebooks requiring a single GPU and multi GPUs have been split
  into two different test suites so that they can be run on different
  hardware.
* Add test to make sure that all notebooks are tested.
* Comment out broken notebooks while they are being fixed (I will
  uncomment them in a follow up PR).

* [MXAPPS-581] Download notebooks in test setup.

* Moving logic to download the Straight Dope notebooks to the test
harness.
* Remove cache logic as it is unnecessary.

* [MXAPPS-581] Add a timeout for download of notebooks.

* [MXAPPS-581] Move notebooks requiring multi-gpus.

Move two notebooks requiring multi-GPUs out of the single GPU test suite.

* [MXNET-331] Single machine All Reduce Topology-aware Communication (Updated) (#11591)

* add multiroot all-reduce communication pattern

* fix bug with UpdateWeight

* fix PCI-E links appearing in weight matrix bug

* optimization to skip CopyFromTo in ReduceInner gains a bit of throughput

* remove unnecessary if statement

* Add tests

* add more tests, 6 tests left to add

* get rid of some dead code

* Add comments

* Add randomized tests for backtrack and kernighan-lin

* Fix Postprocess

* Add switch for first valid tree when num_gpus > 8, and for maximum weight when num_gpus <= 8

* Kernighan-Lin seems to find better trees

* get rid of printfs

* change defaults

* inherit from CommDevice instead of Comm

* Fix lint errors

* Add Python test using MXNET_KVSTORE_USETREE, fix CMake compilation problem, add header guard

* fix lint errors

* better header guard that works for tests

* get rid of unused variable warning

* retrigger jenkins

* resolve 2 comments

* address comment using Class to do test, get rid of extraneous test, use PCI-E as fallback for GPUs that are not linked by NVLink

* address comments

* fix a few bugs

* get rid of printfs

* get rid of print

* Comment out test for now

* fix 2 more bugs

* fix segfault

* change PrintVector, PrintTopo, PrintMatrix to LOG(INFO) instead of stdout

* Fix code alignment

* get rid of todo

* Make changes to env variable names to indicate they are TREE-related

* Add note saying when ARRAY_BOUND env var takes effect

* Fix file name creation for Windows (#11765)

* Fix file name creation for Windows

* Forcing build

* Force build again

* update vgg pretrained model (#11860)

* update vgg pretrained model

* Trigger CI

* Trigger CI

* Add verify_ssl option to gluon.utils.download (#11546)

* Add verify_ssl option to gluon.utils.download

Sometimes datasets may be hosted on servers that serve invalid SSL certificates.

* Add warning

* Add test

* Mock gluon.utils.download tests

* Add Py2 mock dependency to Jenkinsfile

* [MXNET-710] Change POM files to be able to regularly publish to Apache Release & Maven Central Repo (#11862)

* pom file changes for maven builds

* Enable three retries for Docker build commands (#11877)

This enabled retries for Docker build commands executed by our master and PR builds.

* Avoid Division by Zero (#11397)

* Return if iteration counter `N` is less than or equal to zero.

* Fix spelling.

* making AddTakeGrad as default for backward of embedding and take to avoid nan (#11795)

* [MXNET-563] Refactor R optimizers to fix memory leak (#11374)

* refactor R optimizers to fix memory leak

* add Adadelta and Adagrad

* fix comments

* fix comments

* fix comments

* add tests

* fix whitespaces

* fix whitespaces

* fix typo

* fix typo

* add doc on clipping

* Add logistic regression tutorial (#11651)

* Add logistic regression tutorial

* Code review fix

* Add F1 metric, fix code review comments

* Add Download buttons script

* Re-enabling randomized test_operator/test_operator_gpu.test_dot (#11888)

* Fix non-determinism of dot(csr.T, dns) = dns with tests (#11825)

* fix undeterminism of dot(csr.T, dns) = dns with tests

* address code reviews

* Support integer type in ImageIter (#11864)

* [MXNET-378] Adding depth_to_space and space_to_depth operator(Updated) (#11587)

* [MXNET-378] Adding depth_to_space and space_to_depth operator

* fixed lint and windows CPU errors

* compliance with C++ style guiide and address shortcomings in unittests

* fixed documentation and nitpicky suggestions

* added operator references in API docs and removed inplace optimization support

* Added references in symbol.md and ndarray.md. Improved test cases and added block_size check

* Fixing bugs in documentation. Tests now include tensors of random shapes.

* Fix mxnet ctc_loss bug (#11834)

* fix ctc_loss GPU bug

* add blank_label parameter for CTCLoss

* Revert "add blank_label parameter for CTCLoss"

This reverts commit aab11f7575580f88f5f27be14466d0deb4b4c456.

* [MXNET-344] Add more operators to onnx import (#11856)

* add more ops

* use dict.get

* add list comprehensive

* retrigger CI due to unrelated flaky test failure

* make skiptest work (#11889)

* Fix flaky test test_deconvolution (#11630)

* Replace cublassgemm with cublassgemmex for >= 7.5

* Add comment for cublassgemmex

* Remove fixed seed for test_sparse_nd_save_load (#11920)

* Remove fixed seed for test_sparse_nd_save_load

* Add comments related to the commit

* Corrections to profiling tutorial (#11887)

Corrected a race condition with stopping profiling. Added mx.nd.waitall to ensure all operations have completed, including GPU operations that might otherwise be missing.

Also added alternative code for context selection GPU vs CPU, that had error before on machines with nvidia-smi.

* Fix image classification scripts and Improve Fp16 tutorial (#11533)

* fix bugs and improve tutorial

* improve logging

* update benchmark_score

* Update float16.md

* update link to dmlc web data

* fix train cifar and add random mirroring

* set aug defaults

* fix whitespace

* fix typo

* [MXNET-711] Website build and version dropdown update (#11892)

* adding param for list of tags to display on website

* using new website display argument for artifact placement in version folder

* adding display logic

* remove restricted setting for testing

* update usage instructions

* reverted Jenkinsfile to use restricted nodes

* [MXAPPS-581] Fixes for broken Straight Dope tests. (#11923)

* Update relative paths pointing to the data directory to point to the
  correct place in the testing temporary folder.

* Enable the notebooks that were previously broken because of relative
  file paths not pointing to the correct place.

* Move some notebooks we do not plan to test to the whitelist. These
  notebooks are not published in the Straight Dope book.

* Clean-up: Convert print statements to info/warn/error logging
  statements. Add some logging statements for better status.

* Disable flaky test: test_spatial_transformer_with_type (#11930)

apache/mxnet#11839

* Add linux and macos MKLDNN Building Instruction (#11049)

* add linux and macos doc

* update doc

* Update MKL_README.md

* Update MKL_README.md

Add convolution code to verify mkldnn backend

* add homebrew link

* rename to MKLDNN_README

* add mkl verify

* trigger

* trigger

* set mac complier to gcc47

* add VS2017 support experimentally

* improve quality

* improve quality

* modify mac build instruction since prepare_mkldnn.sh has been rm

* trigger

* add some improvement

* [MXNET-531] Add download util (#11866)

* add changes to example

* place the file to the util

* add retry scheme

* fix the retry logic

* change the DownloadUtil to Util

* Trigger the CI

* [MXNET-11241] Avoid use of troublesome cudnnFind() results when grad_req='add' (#11338)

* Add tests that fail due to issue 11241

* Fix #11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED

* Force algo 1 when grad_req==add with large c.  Expand tests.

* Shorten test runtimes.

* Improving documentation and error messages for Async distributed training with Gluon (#11910)

* Add description about update on kvstore

* add async check for gluon

* only raise error if user set update_on_kvstore

* fix condition

* add async nightly test

* fix case when no kvstore

* add example for trainer creation in doc

* [MXNET-641] fix R windows install docs (#11805)

* fix R windows install docs

* addressed PR comments

* PR comments

* PR comments

* fixed line wrappings

* fixed line wrappings

* a hot fix for mkldnn link (#11939)

* re-enabling randomized test_l2_normalization (#11900)

* [MXNET-651] MXNet Model Backwards Compatibility Checker (#11626)

* Added MNIST-MLP-Module-API models to check model save and load_checkpoint methods

* Added LENET with Conv2D operator training file

* Added LENET with Conv2d operator inference file

* Added LanguageModelling with RNN training file

* Added LamguageModelling with RNN inference file

* Added hybridized LENET Gluon Model training file

* Added hybridized LENET gluon model inference file

* Added license headers

* Refactored the model and inference files and extracted out duplicate code in a common file

* Added runtime function for executing the MBCC files

* Added JenkinsFile for MBCC to be run as a nightly job

* Added boto3 install for s3 uploads

* Added README for MBCC

* Added license header

* Added more common functions from lm_rnn_gluon_train and inference files into common.py to clean up code

* Added scripts for training models on older versions of MXNet

* Added check for preventing inference script from crashing in case no trained models are found

* Fixed indentation issue

* Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset

* Fixed indentation issue

* Removed training in models and added smaller models. Now we are simply checking a forward pass in the model with dummy data.

* Updated README

* Fixed indentation error

* Fixed indentation error

* Removed code duplication in the training file

* Added comments for runtime_functions script for training files

* Merged S3 Buckets for storing data and models into one

* Automated the process to fetch MXNet versions from git tags

* Added defensive checks for the case where the data might not be found

* Fixed issue where we were performing inference on state model files

* Replaced print statements with logging ones

* Removed boto install statements and move them into ubuntu_python docker

* Separated training and uploading of models into separate files so that training runs in Docker and upload runs outside Docker

* Fixed pylint warnings

* Updated comments and README

* Removed the venv for training process

* Fixed indentation in the MBCC Jenkins file and also separated out training and inference into two separate stages

* Fixed indendation

* Fixed erroneous single quote

* Added --user flag to check for Jenkins error

* Removed unused methods

* Added force flag in the pip command to install mxnet

* Removed the force-re-install flag

* Changed exit 1 to exit 0

* Added quotes around the shell command

* added packlibs and unpack libs for MXNet builds

* Changed PythonPath from relative to absolute

* Created dedicated bucket with correct permission

* Fix for python path in training

* Changed bucket name to CI bucket

* Added set -ex to the upload shell script

* Now raising an exception if no models are found in the S3 bucket

* Added regex to train models script

* Added check for performing inference only on models trained on same major versions

* Added set -ex flags to shell scripts

* Added multi-version regex checks in training

* Fixed typo in regex

* Now we will train models for all the minor versions for a given major version by traversing the tags

* Added check for validating current_version

* [MXNET-531] NeuralStyle Example for Scala (#11621)

* add initial neuralstyle and test coverage

* Add two more test and README

* kill comments

* patch on memory leaks fix

* fix formatting issues

* remove redundant files

* disable the Gan example for now

* add ignore method

* add new download scheme to match the changes

* [MXNET-750] fix nested call on CachedOp. (#11951)

* fix nested call on cachedop.

* fix.

* extend reshape op to allow reverse shape inference (#11956)

* Improve sparse embedding index out of bound error message; (#11940)

* [MXNET-770] Remove fixed seed in flaky test (#11958)

* Remove fixed seed in flaky test

* Remove fixed seed in flaky test

* Update ONNX docs with the latest supported ONNX version (#11936)

* Reduced test to 3 epochs and made gpu only (#11863)

* Reduced test to 3 epochs and made GPU only

* Moved logger variable so that it's accessible

* Fix flaky tests for test_laop_4 (#11972)

* Updating R client docs (#11954)

* Updating R client docs

* Forcing build

* Fix install instructions for MXNET-R (#11976)

* fix install instructions for MXNET-R

* fix install instructions for MXNET-R

* fix default cuda version for MXNet-R

* [MXNET-751] fix ce_loss flaky (#11971)

* add xavier initializer

* remove comment line

* [MXNET-769] set MXNET_HOME as base for downloaded models through base.data_dir() (#11636)

* set MXNET_DATA_DIR as base for downloaded models through base.data_dir()
push joblib to save containers so is not required when running

* MXNET_DATA_DIR -> MXNET_HOME

* [MXNET-748] linker fixed on Scala issues (#11989)

* put force load back as a temporary solution

* use project.basedir as relative path for OSX linker

* [MXNET-772] Re-enable test_module.py:test_module_set_params (#11979)

* [MXNET-771] Fix Flaky Test test_executor.py:test_dot (#11978)

* use assert_almost_equal, increase rtol, reduce matrix size

* remove seed in test_bind

* add seed 0 to test_bind, it is still flaky

* add comments for tracking

* remove mod from arity 2 version of load-checkpoint in clojure-package (#11808)

* remove mod from arity 2 version of load-checkpoint

* load-checkpoint arity 2 test

* Add unit test stage for mxnet cpu in debug mode (#11974)

* Website broken link fixes (#12014)

* fix broken link

* fix broken link

* switch to .md links

* fix broken link

* removed seed from flaky test (#11975)

* Disable ccache log print due to threadunsafety (#11997)

* Added default tolerance levels for regression checks for MBCC (#12006)

* Added tolerance level for assert_almost_equal for MBCC

* Nudge to CI

* Disable flaky mkldnn test_requantize_int32_to_int8 (#11748)

* [MXNET-769] Usability improvements to windows builds (#11947)

* Windows scripted build
Adjust Jenkins builds to use ci/build_windows.py

Issues:

    #8714
    #11100
    #10166
    #10049

* Fix bug

* Fix non-portable ut

* add xunit

* Fix import statement (#12005)

array and multiply are undefined. Importing them from
ndarray

* Disable flaky test test_random.test_gamma_generator (#12022)

* [MXNET-770] Fix flaky test: test_factorization_machine_module (#12023)

* Remove fixed seed in flaky test

* Remove fixed seed in flaky test

* Update random seed to reproduce the issue

* Fix Flaky unit test and add a training test

* Remove fixed seed in flaky test

* Update random seed to reproduce the issue

* Fix Flaky unit test and add a training test

* Increase accuracy check

* disable opencv threading for forked process (#12025)

* Bug fixes in control flow operators (#11942)

* Fix data narrowing warning on graph_executor.cc (#11969)
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this issue Aug 29, 2018
@haojin2
Copy link
Contributor

haojin2 commented Sep 14, 2018

The corresponding fix has been merged, please close the issue.

@haojin2
Copy link
Contributor

haojin2 commented Sep 14, 2018

@anirudh2290 @eric-haibin-lin @nswamy @sandeep-krishnamurthy The corresponding fix has been merged, please close the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants