Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README about sidecar tensorboard and dependency timeout mechanism #622

Merged
merged 1 commit into from
Dec 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ The command line arguments are as follows:
| shell_env | no | --shell_env LD_LIBRARY_PATH=/usr/local/lib64/ | Specifies key-value pairs for environment variables which will be set in your python worker/ps processes. |
| conf_file | no | --conf_file tony-local.xml | Location of a TonY configuration file, also support remote path, like `--conf_file hdfs://nameservice01/user/tony/tony-remote.xml` |
| conf | no | --conf tony.application.security.enabled=false | Override configurations from your configuration file via command line
| sidecar_tensorboard_log_dir | no | --sidecar_tensorboard_log_dir /hdfs/path/tensorboard_log_dir | HDFS path to tensorboard log dir, it will enable sidecar tensorboard managed by TonY. More detailed example refers to tony-examples/mnist_tensorflow module |

## TonY configurations

Expand Down Expand Up @@ -211,3 +212,7 @@ For more information about TonY, check out the following:
2. How do I configure arbitrary TensorFlow job types?

Please see the [wiki](https://github.com/linkedin/TonY/wiki/TonY-Configurations#task-configuration) on TensorFlow task configuration for details.

3. My tensorflow's partial workers hang when chief finished. Or evaluator hang when chief and workers finished.

Please see the [PR#521](https://github.com/tony-framework/TonY/pull/621) on Tensorflow configuration to solve it.
17 changes: 15 additions & 2 deletions tony-examples/mnist-tensorflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,12 @@ We have tested this example with 3 Workers (4 GB RAM + 1 vCPU) using MultiWorke
### Tensorboard Usage
TonY supports two modes(custom and sidecar) to start tensorboard.
1. [Custom] Allow users to start tensorboard in code, more details can be found in mnist_distributed.py example.
2. [Sidecar] Using the built-in tensorboard, it will start extra executor to running tensorboard by TonY. Only one thing to do is specify the log dir in tony xml, like as follows
2. [Sidecar] Using the built-in sidecar tensorboard, the extra tensorboard task executor will be managed by TonY.
The failure of sidecar tensorboard will not affect the entire training job.
Only one thing for user to do is to specify the log dir in tony xml or in tony cli, like as follows.
Tips: the conf priority in tony cli is prior to in tony xml.

tony.xml
```
<configuration>
....
Expand All @@ -123,4 +127,13 @@ TonY supports two modes(custom and sidecar) to start tensorboard.
<value>/tmp/xxxxxxx</value>
</property>
</configuration>
```
```
tony cli

$ java -cp "`hadoop classpath --glob`:MyJob/*:MyJob/" \
com.linkedin.tony.cli.ClusterSubmitter \
-executes models/mnist_distributed.py \
-task_params '--input_dir /path/to/hdfs/input --output_dir /path/to/hdfs/output' \
-src_dir src \
-python_binary_path /home/user_name/python_virtual_env/bin/python
-sidecar_tensorboard_log_dir /path/to/hdfs/tensorboard_log_dir