## Schlüsselwortarchiv

Du betrachtest das Archiv des Tags TensorBoard.

• ## Using TensorFlow’s Supervisor with TensorBoard summary groups

One of TensorFlow’s more awe­some parts is def­i­nite­ly Ten­sor­Board, i.e. the capa­bil­i­ty of col­lect­ing and visu­al­iz­ing data from the Ten­sor­Flow graph as the net­work is run­ning while also being able to dis­play and browse the graph itself. Com­ing from Caffe, where I even­tu­al­ly wrote my own tool­ing just to visu­al­ize the train­ing loss from logs of the raw con­sole out­put and hat to copy-paste the graph’s pro­totxt to some online ser­vice in order to visu­al­ize it, this is a mas­sive step in the best pos­si­ble direc­tion. To get some of Caffe’s check­point­ing fea­tures back, you can use TensorFlow’s Super­vi­sor. This blog post is about using both Ten­sor­Board and the Super­vi­sor for fun and prof­it.
TL;DR: Scroll to the end for an exam­ple of using grouped sum­maries with the Super­vi­sor.

Apart from just stor­ing scalar data for Ten­sor­Board, the his­togram fea­ture turned out to be espe­cial­ly valu­able to me for observ­ing the per­for­mance of a prob­a­bil­i­ty infer­ence step.

Here, the left half shows the dis­tri­b­u­tion of ground truth prob­a­bil­i­ty val­ues in the train­ing and val­i­da­tion sets over time, where­as the right half shows the actu­al inferred prob­a­bil­i­ties over time. It’s not hard to see that the net­work is get­ting bet­ter, but there is more to it:

• The his­togram of the ground truth val­ues (here on the left) allows you to ver­i­fy that your train­ing data is indeed cor­rect. If the data is not bal­anced, you might learn a net­work that is biased towards one out­come.
• If the net­work does indeed obtain some biased view of the data, you’ll cleary see pat­terns emerg­ing in the inferred his­togram that do not match the expect­ed ground truth dis­tri­b­u­tion. In this exam­ple, the right his­tograms approach the left his­tograms, so it appears to be work­ing fine.
• How­ev­er, if you only mea­sure net­work per­for­mance in accu­ra­cy, as ratio of cor­rect guess­es over all exam­ples, you might be get­ting the wrong impres­sion: If the input dis­tri­b­u­tion is skewed towards 95% pos­i­tive and 5% neg­a­tive exam­ples, a net­work guess­ing “pos­i­tive” 100% of the time is pro­duc­ing only 5% error. If your total accu­ra­cy is an aggre­gate over mul­ti­ple dif­fer­ent val­ues, you will def­i­nite­ly miss this, espe­cial­ly since ran­dom­ized mini-batch­es only fur­ther obscure this issue.
• Worse, if the learned coef­fi­cients run into sat­u­ra­tion, learn­ing will stop for them. Again, this might not be obvi­ous if the total loss and accu­ra­cy is actu­al­ly an aggre­gate of dif­fer­ent val­ues.

### Influence of the learning rate

Let’s take the exam­ple of a vari­able learn­ing rate. If at some point the train­ing slows down, it’s not imme­di­ate­ly clear if this is due to the fact that

• a para­me­ter space opti­mum has been found and train­ing is done,
• the algo­rithm found a plateau in para­me­ter space and would con­tin­ue to fall after a few more hun­dreds or thou­sands of iter­a­tions or
• the train­ing is actu­al­ly diverg­ing because the learn­ing rate is not small enough in order to enter a local opti­mum in the first place.

Now opti­miz­ers like Adam are tai­lored to over­come the prob­lems of fixed learn­ing rates but they too can only go so far: If the learn­ing rate is too big to begin with, it’s still too big after fine-tun­ing. Or worse, after a cou­ple of iter­a­tions the adjust­ed weights could end up in sat­u­ra­tion and no fur­ther change would be able to do any­thing to change this.

To rule out at least one part, you can make the learn­ing rate a change­able para­me­ter of the net­work, e.g. a func­tion of the train­ing iter­a­tion. I had some suc­cess in using Caffe’s “mul­ti-step” approach of chang­ing the learn­ing rate at fixed iter­a­tion num­bers — say, reduc­ing it one decade at iter­a­tion 1000, 5000 and 16000 — where I deter­mined these val­ues over dif­fer­ent train­ing runs of the net­work.

So instead of bak­ing the learn­ing rate into the graph dur­ing con­struc­tion, you would define a place­hold­er for it and feed the learn­ing rate of the cur­rent epoch/iteration into the opti­miza­tion oper­a­tion each time you call it, like so:

with tf.Graph().as_default() as graph:
p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
t_loss = tf.reduce_mean(...)
.minimize(t_loss)

with tf.Session(graph=graph) as sess:
init = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
sess.run(init)

for _ in range(0, epochs):
learning_rate = 0.1
loss, _ = sess.run([t_loss, op_minimize],
feed_dict={p_lr: learning_rate)


Alter­na­tive­ly, you could make it a non-learn­able Vari­able and explic­it­ly assign it when­ev­er it needs to be changed; let’s assume we don’t do that.
The first thing I usu­al­ly do is then to also add a sum­ma­ry node to track the cur­rent learn­ing rate (as well as the train­ing loss):

with tf.Graph().as_default() as graph:
p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
t_loss = tf.reduce_mean(...)
.minimize(t_loss)

tf.summary.scalar('learning_rate', p_lr)
tf.summary.scalar('loss', t_loss)

# histograms work the same way
tf.summary.histogram('probability', t_some_batch)

s_merged = tf.summary.merge_all()

writer = tf.summary.FileWriter('log', graph=graph)
with tf.Session(graph=graph) as sess:
init = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
sess.run(init)

for _ in range(0, epochs):
learning_rate = 0.1
loss, summary, _ = sess.run([t_loss, s_merged, op_minimize],
feed_dict={p_lr: learning_rate)


Now, for each epoch, the val­ues of the t_loss and p_lr ten­sors are stored in a pro­to­col buffer file in the log sub­di­rec­to­ry. You can then start Ten­sor­Board with the --logdir para­me­ter point­ing to it and get a nice visu­al­iza­tion of the train­ing progress.

And one exam­ple where doing this mas­sive­ly helped me track­ing down errors is exact­ly the net­work I took the intro­duc­tion his­togram pic­ture from; here, I set the learn­ing rate to 0.1 for about a two hun­dred iter­a­tions before drop­ping it to 0.01. It turned out that hav­ing the learn­ing rate this high for my par­tic­u­lar net­work did result in sat­u­ra­tion and learn­ing effec­tive­ly stopped. The his­togram helped notic­ing the issue and the scalar graph helped deter­min­ing the “cor­rect” learn­ing rates.

### Training and validation set summaries

Sup­pose now you want to have dif­fer­ent sum­maries that may or may not appear on dif­fer­ent instances of the graph. The learn­ing rate, for exam­ple, has no influ­ence on the out­come of the val­i­da­tion batch, so includ­ing it in val­i­da­tion runs is only eat­ing up time, mem­o­ry and stor­age. How­ev­er, the tf.summary.merge_all() oper­a­tion doesn’t care where the sum­maries live per se — and since some sum­maries depend on nodes from the train­ing graph (e.g. the learn­ing rate place­hold­er), you sud­den­ly cre­ate a depen­den­cy on nodes you didn’t want to trig­ger — with effects of very vary­ing lev­els of fun.

It turns out that sum­mar­ries can be bun­dled into col­lec­tions — e.g. “train” and “test” — by spec­i­fy­ing their mem­ber­ship upon con­struc­tion, so that you can lat­er obtain only those sum­maries that belong to the spec­i­fied col­lec­tions:

with tf.Graph().as_default() as graph:
p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
t_loss = tf.reduce_mean(...)
.minimize(t_loss)

tf.summary.scalar('learning_rate', p_lr, collections=['train'])
tf.summary.scalar('loss', t_loss, collections=['train', 'test'])

# merge summaries per collection
s_training = tf.summary.merge_all('train')
s_test = tf.summary.merge_all('test')

writer = tf.summary.FileWriter('log', graph=graph)
with tf.Session(graph=graph) as sess:
init = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
sess.run(init)

for _ in range(0, epochs):
# during training
learning_rate = 0.1
loss, summary, _ = sess.run([t_loss, s_training, op_minimize],
feed_dict={p_lr: learning_rate)

# during validation
loss, summary = sess.run([t_loss, s_test])


In com­bi­naion with lib­er­al uses of tf.name_scope(), it could then look like on the fol­low­ing image. The graphs shows three dif­fer­ent train­ing runs where we now got the abil­i­ty to rea­son about the choice(s) of the learn­ing rate.

This works, but we can do bet­ter.

### Using the Supervisor

One cur­rent­ly (doc­u­men­ta­tion wise) very under­rep­re­sent­ed yet pow­er­ful fea­ture of TensorFlow’s Python API is the Super­vi­sor, a man­ag­er that basi­cal­ly takes care of writ­ing sum­maries, tak­ing snap­shots, run­ning queues (should you use them, which you prob­a­bly do), ini­tial­iz­ing vari­ables and also grace­ful­ly stop­ping train­ing.

In order to use the Super­vi­sor you basi­cal­ly swap out your own ses­sion with a man­aged one, skip vari­able ini­tial­iza­tion and tell it when you want which of your cus­tom sum­maries to be stored. While not being required, but appar­ent­ly being a good prac­tice is the addi­tion of a global_step vari­able to the graph; should the Super­vi­sor find such a vari­able, it will auto­mat­i­cal­ly use it for inter­nal coor­di­na­tion. If you bind the vari­able to the opti­miz­er it will also be auto­mat­i­cal­ly incre­ment­ed for each opti­miza­tion step, free­ing you from hav­ing to keep track of the iter­a­tion your­self. Here’s an exam­ple of how to use it:

with tf.Graph().as_default() as graph:
p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
t_loss = tf.reduce_mean(...)

global_step = tf.Variable(0, name='global_step', trainable=False)
.minimize(t_loss, global_step=global_step)

tf.summary.scalar('learning_rate', p_lr, collections=['train'])
tf.summary.scalar('loss', t_loss, collections=['train', 'test'])

s_training = tf.summary.merge_all('train')
s_test = tf.summary.merge_all('test')

# create the supervisor and obtain a managed session;
# variable initialization will now be done automatically.
sv = tf.train.Supervisor(logdir='log', graph=graph)
with sv.managed_session() as sess:

# run until training should stop
while not sv.should_stop():
learning_rate = 0.1
loss, s, i, _ = sess.run([t_loss, s_training,
global_step, op_minimize],
feed_dict={p_lr: learning_rate)

# hand over your own summaries to the Supervisor
sv.summary_computed(sess, s, global_step=i)

loss, s = sess.run([t_loss, s_test])
sv.summary_computed(sess, s, global_step=i)

# ... at some point, request a stop
sv.request_stop()


The Super­vi­sor will also add addi­tion­al sum­maries to your graph for free, e.g. an insight over the num­ber of train­ing steps per sec­ond. This could allow you to fine-tune mini­batch sizes, for exam­ple, because they cur­rent­ly tend to have a big impact on the host to device trans­mis­sion on the data.
Dif­fer­ent from Caffe’s behav­ior, the Super­vi­sor will by default keep only the last five snap­shots of the learned weights; unless you fear of miss­ing the val­i­da­tion loss opti­mum, leav­ing the train­ing run­ning for days is now not an issue any­more — disk­wise, at least.