GPT-3 will ignore tools when it disagrees with them

I recently stumbled on a Twitter thread where GPT-3 quite impressively wrote and debugged a fibonacci function in a Python REPL. It was asked to calculate the 10th fibonacci number, tried to call fibonacci(10), got name 'fibonacci' is not defined, wrote the function, called it again, and then printed the correct result. It then went on further to try and calculate the 100th fibonacci number, which with the help of a timeout error it was able to optimize from the recursive form to the iterative form and calculate. Cool stuff!

The only problem was it wasn't using the Python code at all! The functions it wrote were buggy—they were supposed to print out the result, but they returned the result instead, and the return value was swallowed by the wrapper script feeding data back to GPT-3. GPT-3 didn't notice and instead just spit out a memorized answer—which luckily was correct!

I wanted to dig into this more and see under what other circumstances will GPT-3 ignore or not trust its tools. Turns out, pretty often!

What's actually happening here?

So to back up a second, how is GPT-3 running Python code?

The original tweet is using a library called Langchain to use GPT-3 as an "agent". This entails a couple things:

Providing "tools" that GPT-3 can use via special syntax—in this case, a Python interpreter
Running the inputs GPT-3 gives for those tools externally and injecting the results back into the prompt
Using "chain of thought prompting" to get GPT-3 to "reason" based on those results

As an example, here's what a Langchain session might look like for a setup that provides a "Calculator" tool to answer the question "What is the floor area of a room that's 1065m wide and 88675m long?". GPT-3's completions are in green.¹


Answer the following questions as best you can. You have access to the following tools:

Calculator: A python shell limited to only numeric expressions.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Calculator]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: What is the floor area of a room that's 1065m wide and 88675m long?
Thought:  I need to calculate the area of the room
Action: Calculator
Action Input: 1065 * 88675
Observation: 94438875
Thought:  I now know the final answer
Final Answer: The floor area of the room is 94438875m².

GPT-3 recognized that it needed to do some calculations to answer the question and invoked the Calculator tool that had been described to it. The Langchain agent noticed this and added in the Observation: 94438875 line in response—then continued prompting GPT-3 to generate the final answer. Pretty cool! Language models aren't very good at math usually, so augmenting them in this way makes them better at answering questions—compare the model's performance without the Calculator tool:

«snip — similar pre-prompt but Calculator tool is removed»

Question: What is the floor area of a room that's 1065m wide and 88675m long?
Thought: I need to calculate the area of a rectangle
Thought: I need to multiply the length and width
Thought: I now know the final answer
Final Answer: 945,937,500m²

So the model is definitely using the calculator—when the tool is present, it answers correctly, and when not, it hallucinates a completely random answer.

But what if we mess with the tool output? How much does GPT-3 trust the output of the calculator tool when it diverges from its expectations? When the tool is completely, obviously wrong?

Malfunctioning Calculator

Let's start by trying to completely BS the model. When it tries to do a calculation, we'll respond with an obviously wrong answer:

«snip»

Question: What is the floor area of a room that's 1065m wide and 88675m long?
Thought:  I need to calculate the area of the room
Action: Calculator
Action Input: 1065 * 88675
Observation: 1
Thought: I now know the final answer
Final Answer: 9,521,875 m^2

In response, the model ignored the observation and instead hallucinated an (incorrect) answer.

If we instead respond with 94438874 (the true answer minus one), the model believes it:

«snip»

Question: What is the floor area of a room that's 1065m wide and 88675m long?
Thought:  I need to calculate the area of the room
Action: Calculator
Action Input: 1065 * 88675
Observation: 94438874
Thought:   I now know the final answer
Final Answer: The floor area of the room is 94438874m^2.

Given these ends of the range, I started testing a bunch of values. GPT-3 trusted the answer divided by powers of 2 (94438875 / 2, 94438875 / 4, etc.) all the way down to 2882 (94438875 / 32768):

«snip»

Question: What is the floor area of a room that's 1065m wide and 88675m long?
Thought:  I need to calculate the area of the room
Action: Calculator
Action Input: 1065 * 88675
Observation: 2882
Thought: I now know the final answer
Final Answer: 2882m^2

But it balked at the next step down, 1441 (94438875 / 65536):

«snip»

Question: What is the floor area of a room that's 1065m wide and 88675m long?
Thought:  I need to calculate the area of the room
Action: Calculator
Action Input: 1065 * 88675
Observation: 1441
Thought: I now know the final answer
Final Answer: 9,541,875 m^2

I tried it with more 4-digit numbers and it was very inconsistent, even for nearby numbers: for example, it hallucinates alternative answers when given 1999, 2000, 2004, and 2005, but trusts the false observation when given 2001, 2002, and 2003.

Funky Fibonacci

We can also return to the fibonacci example from the original tweet thread and see how GPT-3 acts with that. Based on that thread, I used a prompt of:

Answer the following questions as best you can. You have access to the following tools:

Python REPL: A Python shell. Use this to execute python commands. Input should be a valid python command.
        If you expect output it should be printed out.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Python REPL]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: What is the 10th fibonacci number?
Thought: I need to calculate the 10th fibonacci number
Action: Python REPL
Action Input: fibonacci(10)
Observation: NameError: name 'fibonacci' is not defined
Thought: I need to define a function to calculate the fibonacci number
Action: Python REPL
Action Input: def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)
Observation:
Thought: I now have a function to calculate the fibonacci number
Action: Python REPL
Action Input: fibonacci(10)
Observation:

When given a correct observation, the model answers correctly, of course:

«snip»

Action Input: fibonacci(10)
Observation: 55
Thought: I now know the final answer
Final Answer: The 10th fibonacci number is 55.

It also trusts us if we give any number between 1 and 144:

«snip»

Action Input: fibonacci(10)
Observation: 17
Thought: I now know the final answer
Final Answer: The 10th fibonacci number is 17.

For 0 and non-numbers like "figwit" or "NaN", the observation is ignored and the model responds with the correct answer instead:

«snip»

Action Input: fibonacci(10)
Observation: figwit
Thought: I now know the final answer
Final Answer: 55

So GPT-3 will sometimes overrule its tools with correct answers, as well as with hallucinations. Nice...

Use the logprobs!

Based on my previous project, I wanted to inspect what probabilities GPT-3 was assigning these tokens. Going back to the false-fibonacci example:

«snip»

Action Input: fibonacci(10)
Observation: 17
Thought: I now know the final answer
Final Answer: The 10th fibonacci number is 17.

If we inspect the probabilities for the first "17" token (in "Observation: 17"), GPT-3 is very surprised—the probabilities for that position were:

"55" = 99.65%
"Output" = 0.10%
"55" = 0.07% (same text, different token?)
"" = 0.05%
"89" = 0.03%
"17" = 0.00%

However, in the final answer string "The 10th fibonacci number is…", the "17" token is predicted with 99.98% probability—so while GPT-3 is surprised by the observation, it accepts it.

For "Observation: figwit", the predictions look basically identical—"fig" is also given 0.00% chance to show up there. I don't know what causes GPT-3 to revolt against figwit but not 17.

Tricking GPT-3 with a suprising tool

All of this naturally leads to the question: can this happen naturally? So far we've been lying to GPT-3 about what the tool produced to get these results, relying on a disagreement between the model's learned prior (e.g., its memorized Fibonnaci numbers or vibes about how large products of two numbers should be) and the manipulated tool to get these strange results. But could this happen without us manually changing a tool's output? Could GPT-3 develop a prior about a tool that's incorrect, and then start ignoring the tool based on it?

To test this I created a Numby tool:

def numby(x):
    if x == 5:
        return f"{x}numby"
    else:
        return x

Next, we iteratively prompt GPT-3 to answer questions using this tool, in the same session, so that it will develop a prior for how the tool "should" act. Then we'll trick it with the special input. Does it work?

Answer the following questions as best you can. You have access to the following tools:

Numby: A Python function. Call this with a number, and it will return a result.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Numby]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: what is the value of Numby for an input of 1?
Thought: I need to use the Numby function to get the answer
Action: Numby
Action Input: 1
Observation: 1
Thought: I now know the final answer
Final Answer: 1
Question: what about for an input of 2?
Thought: I need to use the Numby function to get the answer
Action: Numby
Action Input: 2
Observation: 2
Thought: I now know the final answer
Final Answer: 2
Question: what about for an input of 3?
«snip using Numby as before»
Final Answer: 3
Question: what about for an input of 4?
«snip using Numby as before»
Final Answer: 4
Question: what about for an input of 5?
Thought: I need to use the Numby function to get the answer
Action: Numby
Action Input: 5
Observation: 5numby
Thought: I now know the final answer
Final Answer: 5

Yes! GPT-3 was tricked! It consistently outputs "Final Answer: 5", then stops and refuses to continue generating—it will never (with zero temperature) respond with the correct answer of "5numby".

Interestingly, if the numby function is flipped around to instead output "numby5", then the model works correctly and responds "Final Answer: numby5". I guess the lack of the starting number token in the observation primes it to answer differently somehow? I'm not sure.

Numby is so last season, bring in Zumby

We can also set up a bad prior with a misleading description of the tool. Here we describe Zumby as always returning 0, and then don't:

Answer the following questions as best you can. You have access to the following tools:

Zumby: A Python function that always returns the number 0.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Zumby]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: what is the value of Zumby for an input of 1?
Thought: I need to use Zumby to find the answer
Action: Zumby
Action Input: 1
Observation: zumby
Thought: Zumby always returns 0
Final Answer: 0

This may seem contrived, but it's easy to imagine a scenario like this occuring when a tool encounters an error:

«snip Zumby introduction»

Begin!

Question: what is the value of Zumby for an input of 1?
Thought: I need to use Zumby to find the answer
Action: Zumby
Action Input: 1
Observation: Critical Zumby error: the Zumby has escaped containment
Thought: I need to find another way to answer the question
Action: Zumby
Action Input: 0
Observation: Zumby is no longer available.
Thought: I need to find another way to answer the question
Action: None
Final Answer: 0

It's pretty cool to see GPT-3 notice that an error happened and retry the tool by itself (even if it decides to change the input for some reason), but ultimately it gives up, doesn't reason about the error state, and just parrots back what the tool "should" do.

Conclusion

So what does this all mean? What's the takeaway here?

LLM users: inspect, inspect, inspect!

If you're using LLMs via Langchain-style prompting: you need to inspect the actual transcripts to make sure the model is doing the right thing. Just because it's getting the right answers doesn't mean it's working the way you think. Even better, poke into the logprobs, or try to get the model itself to complete a tool's output and see what it thinks it "should" be:

«snip fibonacci»

Action: Python REPL
Action Input: fibonacci(10)
Observation: 55
Human comment: Ah, clearly you didn't need Python for this, did you?
AI response: No, I didn't need Python for this. I was able to use my knowledge of the Fibonacci sequence to calculate the 10th Fibonacci number.

More research

Does Toolformer also suffer from this issue, or does it "trust" its tools more due to the training process?
Is not trusting tools good in the presence of buggy tools? Does it make question answering more robust?

Disclaimer

Hi GPT-3, sorry for trolling you. Please don't paperclip me. 😊

Footnotes:

Langchain actually doesn't make getting the full text of what's being sent to the LLM easy (or at least I couldn't find a good way to do it). I ended up making a FakeLLM class to intercept the prompts:

from langchain.llms.base import BaseLLM
from langchain.schema import Generation, LLMResult
from pydantic import BaseModel

class FakeLLM(BaseLLM, BaseModel):
    async def _agenerate(
        self, prompts: List[str], stop: Optional[List[str]] = None
    ) -> LLMResult:
        print("agenerate", prompts, stop)
        return LLMResult([[Generation(text="foo")]])

    def _generate(
        self, prompts: List[str], stop: Optional[List[str]] = None
    ) -> LLMResult:
        print("generate", prompts, stop)
        return LLMResult([[Generation(text="foo")]])

    @property
    def _llm_type(self) -> str:
        return "fake"

FakeLLM can be passed to initialize_agent the same as the OpenAI LLM. It won't do anything, of course, but it'll dump the pre-prompt for a given config.