Part 5: Training and Evaluating
The hard part is done. Now, all we have to do is kick off training and use TensorBoard to monitor the progress.
Starting Training and Evaluation
To start training and evaluating, simply use the following code.
################################
### Inside code/train.py ###
################################
# Training/Evaluation Loop
for e in range(params.train_epochs):
print('Epoch: ' + str(e))
estimator.train(input_fn=lambda: dataset_input_fn('train'))
estimator.evaluate(input_fn=lambda: dataset_input_fn('valid'))
Both train
and evaluate
run until input_fn
raises an OutOfRangeError
which happens at the end of the dataset since we are using the Dataset and Iterator APIs.
Using TensorBoard to Monitor Progress
The best way to check in on the training of your model is with TensorBoard – another tool built by Google for visualization of TensorFlow projects.
It used to be a bit of a pain to get TensorBoard working but with Estimator’s it’s quite easy. All the metrics we defined in the Part 4 model_fn
are automatically written to TensorBoard. Now all we have to do is run the TensorBoard program.
Tip: Right before the training loop I have the following line to make starting TensorBoard one step easier. Copy/paste what it prints out into a terminal to launch TensorBoard.
################################
### Inside code/train.py ###
################################
print('tensorboard --logdir=' + str(model_dir))
To see TensorBoard, open a web browser and go to localhost:6000
Tip: If you have multiple models at the same time you can run two instances of TensorBoard, you just need to pass separate port numbers with --port=XXXX
That’s it! Your model is now training, evaluating once at the end of every epoch, and reporting results intermittently to your live updating TensorBoard.
Training Remotely, Monitoring Locally
Often times you will be running your training on a remote server, say you department’s or company’s cluster. If you start a training job on a remote instance and still want to use TensorBoard to track training progress live, I recommend mounting the remote output directory locally with SSHFS. SSHFS gives you a local copy of a remote directory and continually updates the files in that directory as they change. SSHFS should be installed already if you are using Linux. If you using a Mac, you probably need to install SSHFS/Fuse here.
I have a directory that’s only for mounting with SSHFS.
mkdir ~/mnt
Let’s say you are training a model on a remote host remote.school.edu
, your username is user
, and the output_dir
is ~/Documents/project/results/2017-12-04_14-19-29/
. You can mount the output directory to your local mount director with:
sshfs user@remote.school.edu:~/Documents/project/results/2017-12-04_14-19-29/ ~/mnt/ -oauto_cache,reconnect,defer_permissions,noappledouble,negative_vncache,volname=MySSHFSMount
I needed the extra flags for connecting between my mac and the particular server I was working with but your setup may be slightly different.
Running Example: the complete (up to this point) train.py file can be found here.
Continue Reading
In Part 6 we will export and load the model back in with a few different methods.