Toolformer: Language Models Can Teach Themselves to Use Tools


LLMs have all kinds of failure modes that can’t be solved with scale - but many could by accessing external tools (like a human!)
Existing methods drawbacks:
  • Human annotation required
  • Only task-specific settings
Objective = let model decide for itself when to use what tools.
notion image
How does it work?
Given just a handful of human-written examples of how an API can be used:
“we let a LM annotate a huge language modelling dataset with potential API calls.
We then use a self-supervised loss to determine which of these API calls actually help the model in predicting future tokens.
Finally, we fine-tune the LM itself on the API calls that it considers useful”
notion image


  • Require API calls to be represented as text sequences
  • Special tokens to mark beginning and end of API calls. Between these tokens…
  • Each API call is a tuple , where is the name of the API and is the input (both are strings).
  • is added as a result, where is a special token and is the response string

Sampling API calls

  • Generate top positions where API call likely to start (based on LM probs), and possible calls for each.

Executing API calls

May be another NN, python script, retrieval system


Interrupt decoding when appears; then stop and use tool; then restart


notion image
For the QA stuff it has learned to use the tool in 98% of cases, and 97% for the calculator!


  • We don’t need an external oracle or even a second dataset. We simply judge on how well tools help us at the LM task. This is very cool!
  • Idea of pretraining on mass-scale LM tasks then fine-tuning on tool-use aug dataset looks like a paradigm I can see sticking (tools in-the loop = expensive! Maybe there’s a case for pretraining in the way too, but this looks neat)
  • Limit API calls to those that can be represented as text - think about this…
  • Rely on prompting to construct initial set of API calls - probably doesn’t lead to rich understanding of what the APIs can do. We aren’t really learning how to call these APIs effectively - just which generalisations from the prompts work, and which don’t.
  • Could fine-tuning be harmful because of repeated data? (probably don’t repeat enough?)
  • The stop decoding - use tool - start decoding loop probably major target for software speedups.
  • Tool as a MoE “expert”? This is almost what happens when the tool is a NN!
  • Code model toolformer that writes its own tools! (generally code models that run their own code is a cool, if terrifying idea)
  • Some tools (calc & cal) very restricted. Was this necessary?
  • Results great - smashes non-tool models, even on things like SQuAD which you might expect non-tool models to be quite good at
  • I want to see BigBench!
  • As they say, because API calls only happen at decoding time, no ability for the model to interact with the API response, rewrite it, etc. Probably makes for some weird outputs, particularly from wikipedia!
  • From scaling laws, seems like larger models would be much more capable at getting the most out of more complex APIs. Would love to see this use APIs zero-shot given some docs
  • As they say - no tool-chaining also limits model