2x1=10

because numbers are people, too
Persönliches
Fotografie
Programmierung
    • Using TensorFlow’s Supervisor with TensorBoard summary groups

      One of TensorFlow’s more awe­some parts is def­i­nite­ly Ten­sor­Board, i.e. the capa­bil­i­ty of col­lect­ing and visu­al­iz­ing data from the Ten­sor­Flow graph as the net­work is run­ning while also being able to dis­play and browse the graph itself. Com­ing from Caffe, where I even­tu­al­ly wrote my own tool­ing just to visu­al­ize the train­ing loss from logs of the raw con­sole out­put and hat to copy-paste the graph’s pro­totxt to some online ser­vice in order to visu­al­ize it, this is a mas­sive step in the best pos­si­ble direc­tion. To get some of Caffe’s check­point­ing fea­tures back, you can use TensorFlow’s Super­vi­sor. This blog post is about using both Ten­sor­Board and the Super­vi­sor for fun and prof­it.
      TL;DR: Scroll to the end for an exam­ple of using grouped sum­maries with the Super­vi­sor.

      Apart from just stor­ing scalar data for Ten­sor­Board, the his­togram fea­ture turned out to be espe­cial­ly valu­able to me for observ­ing the per­for­mance of a prob­a­bil­i­ty infer­ence step.

      Here, the left half shows the dis­tri­b­u­tion of ground truth prob­a­bil­i­ty val­ues in the train­ing and val­i­da­tion sets over time, where­as the right half shows the actu­al inferred prob­a­bil­i­ties over time. It’s not hard to see that the net­work is get­ting bet­ter, but there is more to it:

      • The his­togram of the ground truth val­ues (here on the left) allows you to ver­i­fy that your train­ing data is indeed cor­rect. If the data is not bal­anced, you might learn a net­work that is biased towards one out­come.
      • If the net­work does indeed obtain some biased view of the data, you’ll cleary see pat­terns emerg­ing in the inferred his­togram that do not match the expect­ed ground truth dis­tri­b­u­tion. In this exam­ple, the right his­tograms approach the left his­tograms, so it appears to be work­ing fine.
      • How­ev­er, if you only mea­sure net­work per­for­mance in accu­ra­cy, as ratio of cor­rect guess­es over all exam­ples, you might be get­ting the wrong impres­sion: If the input dis­tri­b­u­tion is skewed towards 95% pos­i­tive and 5% neg­a­tive exam­ples, a net­work guess­ing “pos­i­tive” 100% of the time is pro­duc­ing only 5% error. If your total accu­ra­cy is an aggre­gate over mul­ti­ple dif­fer­ent val­ues, you will def­i­nite­ly miss this, espe­cial­ly since ran­dom­ized mini-batch­es only fur­ther obscure this issue.
      • Worse, if the learned coef­fi­cients run into sat­u­ra­tion, learn­ing will stop for them. Again, this might not be obvi­ous if the total loss and accu­ra­cy is actu­al­ly an aggre­gate of dif­fer­ent val­ues.

      Influence of the learning rate

      Let’s take the exam­ple of a vari­able learn­ing rate. If at some point the train­ing slows down, it’s not imme­di­ate­ly clear if this is due to the fact that

      • a para­me­ter space opti­mum has been found and train­ing is done,
      • the algo­rithm found a plateau in para­me­ter space and would con­tin­ue to fall after a few more hun­dreds or thou­sands of iter­a­tions or
      • the train­ing is actu­al­ly diverg­ing because the learn­ing rate is not small enough in order to enter a local opti­mum in the first place.

      Now opti­miz­ers like Adam are tai­lored to over­come the prob­lems of fixed learn­ing rates but they too can only go so far: If the learn­ing rate is too big to begin with, it’s still too big after fine-tun­ing. Or worse, after a cou­ple of iter­a­tions the adjust­ed weights could end up in sat­u­ra­tion and no fur­ther change would be able to do any­thing to change this.

      To rule out at least one part, you can make the learn­ing rate a change­able para­me­ter of the net­work, e.g. a func­tion of the train­ing iter­a­tion. I had some suc­cess in using Caffe’s “mul­ti-step” approach of chang­ing the learn­ing rate at fixed iter­a­tion num­bers — say, reduc­ing it one decade at iter­a­tion 1000, 5000 and 16000 — where I deter­mined these val­ues over dif­fer­ent train­ing runs of the net­work.

      So instead of bak­ing the learn­ing rate into the graph dur­ing con­struc­tion, you would define a place­hold­er for it and feed the learn­ing rate of the cur­rent epoch/iteration into the opti­miza­tion oper­a­tion each time you call it, like so:

      with tf.Graph().as_default() as graph:
          p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
          t_loss = tf.reduce_mean(...)
          op_minimize = tf.train.AdamOptimizer(learning_rate=p_lr)\
                                .minimize(t_loss)
      
      with tf.Session(graph=graph) as sess:
          init = tf.group(tf.global_variables_initializer(),
                          tf.local_variables_initializer())
          sess.run(init)
      
          for _ in range(0, epochs):
              learning_rate = 0.1
              loss, _ = sess.run([t_loss, op_minimize],
                                 feed_dict={p_lr: learning_rate)
      

      Alter­na­tive­ly, you could make it a non-learn­able Vari­able and explic­it­ly assign it when­ev­er it needs to be changed; let’s assume we don’t do that.
      The first thing I usu­al­ly do is then to also add a sum­ma­ry node to track the cur­rent learn­ing rate (as well as the train­ing loss):

      with tf.Graph().as_default() as graph:
          p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
          t_loss = tf.reduce_mean(...)
          op_minimize = tf.train.AdamOptimizer(learning_rate=p_lr)\
                                .minimize(t_loss)
      
          tf.summary.scalar('learning_rate', p_lr)
          tf.summary.scalar('loss', t_loss)
      
          # histograms work the same way
          tf.summary.histogram('probability', t_some_batch)
      
          s_merged = tf.summary.merge_all()
      
      writer = tf.summary.FileWriter('log', graph=graph)
      with tf.Session(graph=graph) as sess:
          init = tf.group(tf.global_variables_initializer(),
                          tf.local_variables_initializer())
          sess.run(init)
      
          for _ in range(0, epochs):
              learning_rate = 0.1
              loss, summary, _ = sess.run([t_loss, s_merged, op_minimize], 
                                          feed_dict={p_lr: learning_rate)
              writer.add_summary(summary)
      

      Now, for each epoch, the val­ues of the t_loss and p_lr ten­sors are stored in a pro­to­col buffer file in the log sub­di­rec­to­ry. You can then start Ten­sor­Board with the --logdir para­me­ter point­ing to it and get a nice visu­al­iza­tion of the train­ing progress.

      And one exam­ple where doing this mas­sive­ly helped me track­ing down errors is exact­ly the net­work I took the intro­duc­tion his­togram pic­ture from; here, I set the learn­ing rate to 0.1 for about a two hun­dred iter­a­tions before drop­ping it to 0.01. It turned out that hav­ing the learn­ing rate this high for my par­tic­u­lar net­work did result in sat­u­ra­tion and learn­ing effec­tive­ly stopped. The his­togram helped notic­ing the issue and the scalar graph helped deter­min­ing the “cor­rect” learn­ing rates.

      Training and validation set summaries

      Sup­pose now you want to have dif­fer­ent sum­maries that may or may not appear on dif­fer­ent instances of the graph. The learn­ing rate, for exam­ple, has no influ­ence on the out­come of the val­i­da­tion batch, so includ­ing it in val­i­da­tion runs is only eat­ing up time, mem­o­ry and stor­age. How­ev­er, the tf.summary.merge_all() oper­a­tion doesn’t care where the sum­maries live per se — and since some sum­maries depend on nodes from the train­ing graph (e.g. the learn­ing rate place­hold­er), you sud­den­ly cre­ate a depen­den­cy on nodes you didn’t want to trig­ger — with effects of very vary­ing lev­els of fun.

      It turns out that sum­mar­ries can be bun­dled into col­lec­tions — e.g. “train” and “test” — by spec­i­fy­ing their mem­ber­ship upon con­struc­tion, so that you can lat­er obtain only those sum­maries that belong to the spec­i­fied col­lec­tions:

      with tf.Graph().as_default() as graph:
          p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
          t_loss = tf.reduce_mean(...)
          op_minimize = tf.train.AdamOptimizer(learning_rate=p_lr)\
                                .minimize(t_loss)
      
          tf.summary.scalar('learning_rate', p_lr, collections=['train'])
          tf.summary.scalar('loss', t_loss, collections=['train', 'test'])
      
          # merge summaries per collection
          s_training = tf.summary.merge_all('train')
          s_test = tf.summary.merge_all('test')
      
      writer = tf.summary.FileWriter('log', graph=graph)
      with tf.Session(graph=graph) as sess:
          init = tf.group(tf.global_variables_initializer(),
                          tf.local_variables_initializer())
          sess.run(init)
      
          for _ in range(0, epochs):
              # during training
              learning_rate = 0.1     
              loss, summary, _ = sess.run([t_loss, s_training, op_minimize], 
                                          feed_dict={p_lr: learning_rate)
              writer.add_summary(summary)
      
              # during validation
              loss, summary = sess.run([t_loss, s_test])
              writer.add_summary(summary)
      

      In com­bi­naion with lib­er­al uses of tf.name_scope(), it could then look like on the fol­low­ing image. The graphs shows three dif­fer­ent train­ing runs where we now got the abil­i­ty to rea­son about the choice(s) of the learn­ing rate.

      This works, but we can do bet­ter.

      Using the Supervisor

      One cur­rent­ly (doc­u­men­ta­tion wise) very under­rep­re­sent­ed yet pow­er­ful fea­ture of TensorFlow’s Python API is the Super­vi­sor, a man­ag­er that basi­cal­ly takes care of writ­ing sum­maries, tak­ing snap­shots, run­ning queues (should you use them, which you prob­a­bly do), ini­tial­iz­ing vari­ables and also grace­ful­ly stop­ping train­ing.

      In order to use the Super­vi­sor you basi­cal­ly swap out your own ses­sion with a man­aged one, skip vari­able ini­tial­iza­tion and tell it when you want which of your cus­tom sum­maries to be stored. While not being required, but appar­ent­ly being a good prac­tice is the addi­tion of a global_step vari­able to the graph; should the Super­vi­sor find such a vari­able, it will auto­mat­i­cal­ly use it for inter­nal coor­di­na­tion. If you bind the vari­able to the opti­miz­er it will also be auto­mat­i­cal­ly incre­ment­ed for each opti­miza­tion step, free­ing you from hav­ing to keep track of the iter­a­tion your­self. Here’s an exam­ple of how to use it:

      with tf.Graph().as_default() as graph:
          p_lr = tf.placeholder(tf.float32, (), name='learning_rate')
          t_loss = tf.reduce_mean(...)
      
          # adding the global_step and telling the optimizer about it
          global_step = tf.Variable(0, name='global_step', trainable=False)
          op_minimize = tf.train.AdamOptimizer(learning_rate=p_lr)\
                                .minimize(t_loss, global_step=global_step)
      
          tf.summary.scalar('learning_rate', p_lr, collections=['train'])
          tf.summary.scalar('loss', t_loss, collections=['train', 'test'])
      
          s_training = tf.summary.merge_all('train')
          s_test = tf.summary.merge_all('test')
      
      # create the supervisor and obtain a managed session;
      # variable initialization will now be done automatically.
      sv = tf.train.Supervisor(logdir='log', graph=graph)
      with sv.managed_session() as sess:
      
          # run until training should stop
          while not sv.should_stop():
              learning_rate = 0.1     
              loss, s, i, _ = sess.run([t_loss, s_training, 
                                        global_step, op_minimize], 
                                        feed_dict={p_lr: learning_rate)
      
              # hand over your own summaries to the Supervisor
              sv.summary_computed(sess, s, global_step=i)
      
              loss, s = sess.run([t_loss, s_test])
              sv.summary_computed(sess, s, global_step=i)
      
              # ... at some point, request a stop
              sv.request_stop()
      

      The Super­vi­sor will also add addi­tion­al sum­maries to your graph for free, e.g. an insight over the num­ber of train­ing steps per sec­ond. This could allow you to fine-tune mini­batch sizes, for exam­ple, because they cur­rent­ly tend to have a big impact on the host to device trans­mis­sion on the data.
      Dif­fer­ent from Caffe’s behav­ior, the Super­vi­sor will by default keep only the last five snap­shots of the learned weights; unless you fear of miss­ing the val­i­da­tion loss opti­mum, leav­ing the train­ing run­ning for days is now not an issue any­more — disk­wise, at least.

      Januar 21st, 2017 GMT +2 von
      Markus
      2017-01-21T18:45:45+02:00 2017-02-2T03:34:54+02:00 · 2 Kommentare
      Supervisor Learning Rate TensorBoard
      Machine Learning TensorFlow Caffe

      2 Kommentare auf „Using TensorFlow’s Supervisor with TensorBoard summary groups“

      1. Toke sagt:
        Montag, Januar 30th, 2017 02:58 pm GMT +2 um 14:58 Uhr

        Some images of how this looks in Ten­sor­Board would be awe­some, and take this guide over the top 🙂 thank you for your great work

        Antworten
        • Markus sagt:
          Donnerstag, Februar 2nd, 2017 03:35 am GMT +2 um 03:35 Uhr

          Thanks alot man! (Also you’re right, I added anoth­er screen­shot.)

          Antworten

      Hinterlasse einen Kommentar

      Hier klicken, um das Antworten abzubrechen.

    1. « newer
    2. 1
    3. 2
    4. 3
    5. 4
    6. 5
    7. 6
    8. 7
    9. 8
    10. “Compiler crashed with code 1” on Mono">9
    11. …
    12. 43
    13. older »
    • Kategorien

      • .NET
        • ASP.NET
        • Core
        • DNX
      • Allgemein
      • Android
      • Data Science
      • Embedded
      • FPGA
      • Humor
      • Image Processing
      • Kalman Filter
      • Machine Learning
        • Caffe
        • Hidden Markov Models
        • ML Summarized
        • Neural Networks
        • TensorFlow
      • Mapping
      • MATLAB
      • Robotik
      • Rust
      • Signal Processing
      • Tutorial
      • Version Control
    • Neueste Beiträge

      • Summarized: The E-Dimension — Why Machine Learning Doesn’t Work Well for Some Problems?
      • Use your conda environment in Jupyter Notebooks
      • Building OpenCV for Anaconda Python 3
      • Using TensorFlow’s Supervisor with TensorBoard summary groups
      • Getting an image into and out of TensorFlow
    • Kategorien

      .NET Allgemein Android ASP.NET Caffe Core Data Science DNX Embedded FPGA Hidden Markov Models Humor Image Processing Kalman Filter Machine Learning Mapping MATLAB ML Summarized Neural Networks Robotik Rust Signal Processing TensorFlow Tutorial Version Control
    • Tags

      .NET Accelerometer Anaconda Bitmap Bug Canvas CLR docker FPGA FRDM-KL25Z FRDM-KL26Z Freescale git Gyroscope Integration Drift Intent J-Link Linear Programming Linux Magnetometer Matlab Mono Naismith OpenCV Open Intents OpenSDA Optimization Pipistrello Player/Stage PWM Python Sensor Fusion Simulink Spartan 6 svn tensorflow Tilt Compensation TRIAD ubuntu Windows Xilinx Xilinx SDK ZedBoard ZYBO Zynq
    • Letzte Kommetare

      • Lecke Mio bei Frequency-variable PWM generator in Simulink
      • Vaibhav bei Use your conda environment in Jupyter Notebooks
      • newbee bei Frequency-variable PWM generator in Simulink
      • Markus bei Using TensorFlow’s Supervisor with TensorBoard summary groups
      • Toke bei Using TensorFlow’s Supervisor with TensorBoard summary groups
    • Blog durchsuchen

    • Januar 2017
      M D M D F S S
      « Dez   Aug »
       1
      2345678
      9101112131415
      16171819202122
      23242526272829
      3031  
    • Self

      • Find me on GitHub
      • Google+
      • Me on Stack­Ex­change
      • Ye olde blog
    • Meta

      • Anmelden
      • Beitrags-Feed (RSS)
      • Kommentare als RSS
      • WordPress.org
    (Generiert in 0,822 Sekunden)

    Zurück nach oben.