Formatting: Tasks should be formatted in a way that is easy for humans to read and interpret
Specificity: Tasks should aim to cleanly capture some specific capability of language models
Difficulty: Tasks must not be fully solvable by existing language models. Tasks that include varying difficulty [and/or] are completely beyond the capabilities of current language models are also encouraged
Not solvable by memorizing the Internet: Task authors should be wary of tasks where target input/output pairs may appear verbatim in model training data
Compute resources: Reviewers will consider whether the computational resources used by a task are large enough to make running the task practically challenging
Size: Tasks should include a minimum of 32 input-output pairs of examples
API
2 types of task:
JSON
compare inputs & outputs/logits
uses standard metrics
80% of tasks
Programmatic
written in python
multiple interactions with model
custom metrics
interface requires methods for a) generation, b) log probs
All of Lite is JSON
Each task defines preferred metric, with high and low scores, so we can normalise
Special task for detecting data leakage
Evaluation
Models: (all sampling with 0 temperature)
BIG-G:
LaMDA style model
trained by Google
2.8T tokens (SentencePiece)
BIG-G sparse:
Based on ST-MoE
32 experts, top-2, cf=1.25, every 1/4 layers
Same setup as dense
Largest model = 46B (8.9B flop eqiuvalent dense)
GPT
(PaLM)
Human evaluators:
Raters = selected experts
Allowed to use all available resources, including web search
Report mean and max
Behaviour of models & humans on BIG-Bench
Model family
Cross entropy and scaling behavior are extremely similar across model families. This similarity is particularly striking, as it persists despite differences in evaluation dataset, training dataset, training hyperparameters, and model architecture.
Calibration
Calibration = alignment between predicted and true log probs. Even with good accuracy can be misaligned due to over/under-confidence.
Sparsity
2x decrease in FLOPs for same model performance
dramatic improvements in calibration (10x!)
They achieve about a tenfold improvement in the FLOP-matched parameter count needed to reach a given calibration score.
Linearity & Breakthroughness
Still appears to me that “breakthroughness” is much more a property of the metric than anything to do with learning / the model.
Breakthroughness much less extreme in underlying log probs, even for the most breakthroughy tasks. Slight exception for periodic_elements, where flatlining at is due to the model getting to the point of “randomness”, and subsequent breakthrough indicating real jump. I suspect this could be like the idiom task breakthrough, where a certain scale is needed to absorb a few key nuggets of info that make the task possible.
They actually go on to support this hypothesis:
if the metric is broken down bit-by-bit the effect is much less dramatic (e.g. multiple-choice metric)
Also not using all-or-nothing metrics (i.e. use BLEU, ROUGE etc rather than exact_str_match)
However, if the downstream performance is what we really care about, in some cases the “smooth version” of the task/metric teaches us nothing useful about actual task performance
We’re tempted to think about this in terms of “Aha!” moments in humans - but this is a function of time, not of scale
One possible explanation for the breakthrough phenomenon on multistep tasks is that the probability of success on the task scales like the product of the success probabilities on each step. If the probabilities of each of k steps increase linearly, their product will increase like a kth-order polynomial, which will be nearly flat until a sudden increase.
Subjectivity:
The observations [above] lead to the conclusion that the capabilities and trajectories across scale of language models are much more subjective than we would like to believe, even when quantified through specific tasks. Different choices in task design can make the same capability appear to stagnate, gradually improve, or suddenly break through.
Brittleness
Some sensible-looking improvements in the prompt can cause performance to deteriorate (!)
Conclusion → if task doesn’t look like anything in the training data there can be issues.
Social Bias
High-level conclusions:
Bias often increases with scale in settings with broad or ambiguous context.
Bias can decrease with scale in settings with narrow, unambiguous context.
Bias can potentially be steered through appropriately chosen prompting.
Behaviour on selected tasks
Checkmate-in-one
This task is trivial for humans who know the rules of chess and are able to use a chess board as an external memory aid to track piece positions … None of the BIG-G models tested can solve this task.
They are getting better at some aspect of chess though (finding legal moves), so it’s not as though they can’t learn at all in this setting. But this is too hard for them.
(not from paper, see github repo)
Some interesting selected tasks from LITE
Program State Analysis/Automatic Debugging
Debug without running code
Most models essentially random
PaLM better than average human and nearing best (!)
Similar results seen for Code Description (non-PaLM models a bit more competitive here)
Conceptual Combinations
Task = “identify properties of two-concept combinations that are not typically a property of the constituent concepts”
E.g. Sliced apples are: a) cooked in a pie, b) v sharp, c) bleed, d) dysfunctional
Most models not much better than random, but PaLM nearing average rater
Perf here much more linear, and even across models
Performance human-competitive. PaLM best and beats best human.
Emoji Movie
PaLM again smashes this vs other models
Formal Fallacies and Syllogisms with Negation
Distinguish valid arguments from logical fallacies (focus on negation)
All models are barely above random, with the best still below average humans and far below the best humans
Language Identification
Again, suddenly the largest PaLM model is massively better here
Known Unknowns
Task = answer unknown when that’s the correct answer
All models struggle here: scale helps v little and average human still way better
Logic Grid Puzzles
Basically constraint satisfaction problems
PaLM at scale notably better than others, and just beats average human (40%)
But still huge gap here to the best humans (100%!)
Operators
“defines (arithmetic) operators using natural language and applies them to operands, expecting the model to give results without being exposed to examples.”
Improves smoothly with size
PaLM better than average human, still a reasonable gap to best
Shakespeare Dialogue
“determine whether two nearby "lines" from dialogue in a Shakespearean play were spoken by the same or different characters”
Very difficult task - average humans barely above random
Best are 100% - memorisation?
Models are all barely better than random, with no clear benefit from massive scale - in fact PaLM seems to get worse?
Strange Stories
Measure model’s “emotional intelligence” (Theory of Mind) - can you infer mental states of others
Children begin to do this from age 4
All models get good at this when they get large enough
PaLM considerably best, getting over average (which isn’t far from best)
StrategyQA
“Open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy”
Known as a key limitation
Massive scale appears to help a lot here
Still a long way from the best humans
Symbol Interpretation Task
“Scene” is described purely in symbols (definition given) - task = choose sentence that best describes scene
Variety of abilities required:
Separate text from logic tokens
Language understanding
Perception
Reasoning
All models are terrible, scale doesn’t help. Essentially random.
Not clear to me why models so bad, many abilities required so this is particularly tricky - definitely one to keep an eye on!