Toolformer

LLMs have all kinds of failure modes that can’t be solved with scale - but many could by accessing external tools (like a human!)

Existing methods drawbacks:

Objective = let model decide for itself when to use what tools.

How does it work?

Given just a handful of human-written examples of how an API can be used:

“we let a LM annotate a huge language modelling dataset with potential API calls.

We then use a self-supervised loss to determine which of these API calls actually help the model in predicting future tokens.

Finally, we fine-tune the LM itself on the API calls that it considers useful”

Each API call is a tuple , where is the name of the API and is the input (both are strings).

Generate top positions where API call likely to start (based on LM probs), and possible calls for each.

May be another NN, python script, retrieval system

Interrupt decoding when appears; then stop and use tool; then restart

For the QA stuff it has learned to use the tool in 98% of cases, and 97% for the calculator!

We don’t need an external oracle or even a second dataset. We simply judge on how well tools help us at the LM task. This is very cool!

Idea of pretraining on mass-scale LM tasks then fine-tuning on tool-use aug dataset looks like a paradigm I can see sticking (tools in-the loop = expensive! Maybe there’s a case for pretraining in the way too, but this looks neat)

Rely on prompting to construct initial set of API calls - probably doesn’t lead to rich understanding of what the APIs can do. We aren’t really learning how to call these APIs effectively - just which generalisations from the prompts work, and which don’t.

Could fine-tuning be harmful because of repeated data? (probably don’t repeat enough?)

The stop decoding - use tool - start decoding loop probably major target for software speedups.

Code model toolformer that writes its own tools! (generally code models that run their own code is a cool, if terrifying idea)

Results great - smashes non-tool models, even on things like SQuAD which you might expect non-tool models to be quite good at

As they say, because API calls only happen at decoding time, no ability for the model to interact with the API response, rewrite it, etc. Probably makes for some weird outputs, particularly from wikipedia!

From scaling laws, seems like larger models would be much more capable at getting the most out of more complex APIs. Would love to see this use APIs zero-shot given some docs