Module 2: Reflection Design Pattern

2.1 Reflection to improve outputs of a task

0:03
The reflection design pattern is something I've used in many applications, and it's surprisingly easy to implement.
0:07
Let's take a look.
0:08
Just as humans will sometimes reflect their own output and find a way to improve it, so can LLMs.
0:14
For example, I might write an email like this, and if I'm typing quickly, I might end up with a first draft that's not great.
0:21
And if I read over it, I might say,
0:24
huh, next month isn't that clear for what dates Tommy might be free for dinner, and there's such a typo that I had, and also forgot to sign my name.
0:33
And this would let me revise the draft to be more specific in saying, hey, Tommy, are you free for dinner on the 5th to the 7th?
0:40
A similar process lets LLMs also improve their outputs.
0:44
You can prompt an LLM to write the first draft in email, and given email version 1, email v1, you can pass it to maybe the same model,
0:53
the same large language model, but with a different prompt, and tell it to reflect and write an improved second draft to then get you the final output, email v2.
1:02
Here, I have just hard-coded this workflow of prompting the LLMa once, and then prompting them again to reflect and improve, and that gives email v2.
1:12
It turns out that a similar process can be used to improve other types of outputs.
1:18
For example, if you are having an LLM write code, you might prompt an LLM to write code to do a certain task, and it may give you v1 of the code,
1:28
and then pass it to the same LLM or maybe a different LLM to ask it to check for bugs and write an improved second draft of the code.
1:36
Different LLMs have different strengths, and so sometimes I would choose different models for writing the first draft and for reflecting and trying to improve it.
1:46
For example, it turns out reasoning models, sometimes also called thinking models, are pretty good at finding bugs,
1:53
and so I'll sometimes write the first draft of the code by direct generation, but then use a reasoning model to check for bugs.
2:00
Now, rather than just having an LLM reflect on the code, it turns out that if you can get external feedback, meaning new information from outside the LLM,
2:11
reflection becomes much more powerful.
2:14
In the case of code, one thing you can do is just execute the code to see what the code does,
2:20
and by examining the output, including any error messages of the code, this is incredibly useful information for the LLM to reflect and to find a way to improve his code.
2:30
So in this example, the LLM generated the first draft of the code, but when I run it, it generates a syntax error.
2:36
When you pass this code output and error logs back into the LLM and ask it to reflect on the feedback and write a new draft,
2:44
this gives it a lot of very useful information to come up with a much better version 2 of the code.
2:50
So the reflection design pattern isn't magic.
2:53
It does not make an LLM always get everything right 100% of the time, but it can often give it maybe a modest bump in performance.
3:01
But one design consideration to keep in mind is reflection is much more powerful when there is new additional external information that you can ingest into the reflection process.
3:13
So in this example, if you can run the code and have that code output or error messages as an additional input to the reflection step,
3:20
that really lets the LLM reflect much more deeply and figure out what may be going wrong, if anything,
3:26
and results in a much better second version of the code than if there wasn't this external information that you can ingest.
3:32
So one thing to keep in mind, whenever reflection has an opportunity to get additional information, that makes it much more powerful.
3:41
Now with that, let's go on to the next video where I want to share with you a more systematic comparison of using reflection versus direct generation or something we sometimes call zero shot prompting.
3:54
Let's go on to the next video.

2.2 why not just direct generation?

0:04
Let's take a look at why we might prefer to use a reflection workflow rather than just
0:04
prompting an LLM once and having it directly generate the answer and be done with it.
0:09
With direct generation, you just prompt the LLM with an instruction and let it generate an answer.
0:14
So you can ask an LLM to write an essay about black holes and have it just generate the text,
0:19
or have it write the Python functions to calculate
0:22
compound interest and have it just write the code directly.
0:25
The prompt examples you see here are also called zero-shot prompting.
0:31
Let me explain what zero-shot means. In contrast to zero-shot prompting,
0:35
a related approach is to include one or more examples of what you want the output to look like
0:41
in your prompt. And this is known as one-shot prompting, if in the prompt you include
0:45
one example of a desired input-output pair, or two-shot or few-shot prompting,
0:50
depending on how many such examples you include in your prompt.
0:53
And so zero-shot prompting refers to if you include zero examples and if you don't include
0:57
any examples of the desired outputs that you want. But don't worry if you aren't yet familiar
1:02
with these terms. The important thing is that in the examples you see here, you're just prompting
1:07
the LLM to directly generate an answer in one go, which I'm also calling zero-shot prompting
1:13
because we include zero examples. It turns out that multiple studies have
1:17
shown that reflection improves on the performance of direct generation on a variety of tasks.
1:23
This diagram is adapted from the research paper by Madaan and others, and this shows a range of
1:30
different tasks being implemented with different models and with and without reflection.
1:35
The way to read this diagram is to look at these pairs of adjacent light followed by
1:41
dark-colored bars, where the light bar shows zero-shot prompting and the dark bar
1:46
shows the same model but with reflection. And the colors blue, green, and red show
1:52
experiments run with different models, such as GPT-3.5 and GPT-4. And what you see is,
1:58
for many different applications, the dark bar that is with reflection is quite a bit higher
2:04
than the light bar. But of course, your knowledge may vary depending on your specific application.
2:10
Here are some more examples where reflection might be helpful.
2:14
If you are generating structured data, such as an HTML table, sometimes it may have incorrect
2:20
formatting of the output. So a reflection prompt to validate the HTML code could be helpful.
2:26
If it's basic HTML, this may not help that much, since LLMs are pretty good at basic HTML. But
2:31
especially if you have more complex structured outputs, like maybe a JSON data structure with
2:36
a lot of nesting, then reflection may be more likely to spot bugs. Or if you ask an LLM to
2:41
generate a sequence of steps that comprise a set of instructions to do something, such as how to
2:45
brew a perfect cup of tea, sometimes the LLM may miss steps and a reflection prompt to ask to check
2:51
instructions for coherence and completeness might help spot errors. Or something that I've actually
2:56
worked on was using an LLM to generate domain names, but sometimes the names it generates has
3:01
an unintended meaning or may be really hard to pronounce. And so I've used reflection prompts
3:06
to double check if the domain name has any problematic connotations or problematic meanings,
3:11
or if the name is hard to pronounce. And we actually used this at one of my team's AI fund
3:16
to help brainstorm domain names for startups that we're working on. I want to show you a couple of
3:21
examples of reflection prompts. For brainstorming domain names, you might ask it to review the
3:26
domain names you suggested, and then ask it to check if each name is easy to pronounce. Check
3:30
if each name might mean something negative in English or other languages, and then output a
3:35
short list of only the names that satisfy these criteria. Or to improve an email, you can write a
3:41
reflection prompt to tell it to review the email first draft, check the tone, verify all fact states
3:46
and promises are accurate. This would make sense in the context of the LLM having been fed a number
3:51
of facts and dates and so on in order to write the email drafts. All this would be provided as part
3:56
of the LLM context. And then based on any problems it may find, write the next draft of the email.
4:02
So some tips for writing reflection prompts. It helps to clearly indicate that you want it to
4:07
review or to reflect on the first draft of the output. And if you can specify a clear set of
4:13
criteria, such as whether the domain name is easy to pronounce and whether it may have negative
4:17
connotations or for email, check the tone and verify the facts. Then that guides the LLM better
4:22
in reflecting and critiquing on the criteria that you care the most about. I found that one of the
4:28
ways I've learned to write better prompts is to read a lot of other prompts that other people
4:34
have written. Sometimes I'll actually download open source software and go and find the prompts
4:40
in a piece of software that I think is especially well done to just go and read the prompts that
4:45
the authors have written. So that I hope you have a sense of how to write a basic reflection prompt
4:51
and that maybe you even try it out in your own work to see if it helps give you better performance.
4:57
In the next video, I'd like to share with you a fun example where we'll start to look at
5:02
multi-modal inputs and outputs. We'll have an algorithm reflect
5:06
on an image being generated or a chart being generated. Let's go take a look.

2.3 chart generation workflow

0:02
In the coding lab that you see in this module, you play with a chart generation workflow where
0:05
you use an agent to generate nice-looking diagrams. It turns out reflection can significantly improve
0:11
the quality of this output. Let's take a look. In this example, I have data from a coffee machine
0:18
showing when different drinks, such as a latte coffee or hot chocolate or a cappuccino coffee
0:24
and so on, were sold and for what price. And we want to have an agent create a plot
0:29
comparing Q1 or first quarter coffee sales in 2024 and 2025. So one way to do it would be to
0:35
write a prompt that asks an LLM to create a plot comparing Q1 coffee sales in 2024 and 2025
0:41
using the data stored in a spreadsheet as a CSV file or comma-separated values as a spreadsheet
0:48
file. And an LLM might write Python code like this to generate the plot. And with this v1 of the code,
0:55
if you execute it, it may generate a plot like this. When I ran the code to the LLM output,
1:01
it actually generated this the first time. And this is a stacked bar plot, which is not a very
1:07
easy way to visualize things and it just doesn't look a very good plot. But what you can do is then
1:12
give it v1 of the code as well as the plot that this code generated and feed it into a
1:19
multimodal model that is an LLM that can also accept image inputs and ask it to examine the
1:26
image that was generated by this code and then to critique the image, find a way to come up with
1:31
better visualization, and update the code to just generate a clearer, better plot. Multimodal LLMs
1:37
can use visual reasoning, so it can actually look visually at this figure to find ways to improve it.
1:44
And when I did this, it actually generated a bar graph that isn't this stacked bar graph,
1:49
but a more regular bar graph that separates out the 2034 and 2035 coffee sales in what I thought
1:54
was a more pleasing and clearer way. When you get to the coding lab, please feel free to mess around
2:00
with the problems and see if you can get maybe even better looking graphs in these. Because
2:05
different LLMs have different strengths and weaknesses, sometimes I'll use different LLMs
2:09
for the initial generation and for the reflection. So, for example, you may use one LLM to generate
2:14
the initial code, maybe open it as GPT-4o or GPT-5 or some model like that, and just prompt it like a
2:21
prompt like this to write Python code to generate visualization and so on. And then the reflection
2:26
prompts might be something like this, where you tell the LLM to play the role of an expert data
2:31
analyst that provides constructive feedback and then give it the version 1 of the code,
2:36
the part that was generated, maybe also the computational history from how the code was
2:40
generated, and ask it to critique it for specific criteria. Remember, when you give it specific
2:45
criteria like readability, clarity, and completeness, it helps the LLM better figure out what to do.
2:50
And then ask it to write new code to implement your improvements. One thing you may find is that
2:56
sometimes using a reasoning model for reflection may work better than a non-reasoning model. So
3:02
when you're trying out different models for the initial generation and the reflection, these are
3:07
different configurations that you might toggle or try different combinations of.
3:12
So when you get to the coding lab, I hope you have fun visualizing coffee sales. Now, when you're
3:17
building an application, one thing you may be wondering is, does reflection actually improve
3:22
performance on your specific application? From various studies, reflection improves performance
3:28
by a little bit on some, by a lot on some others, and maybe barely any at all on some other
3:34
applications. And so it'll be useful to understand its impact on your application and also give you
3:39
guidance on how to tune either the initial generation or the reflection prompt to try to get
3:45
better performance. In the next video, let's take a look at evals or evaluations
3:50
for reflection workflow. Let's go on to the next video.

2.4 Evaluating the impact of reflection

0:04
Reflection often improves the performance of the system, but before I commit to keeping it,
0:04
I would usually want to double check how much it actually improves the performance, because
0:09
it does slow down the system a little bit by needing to take an extra step.
0:13
Let's take a look at evals for reflection workflows.
0:16
Let's look at an example of using reflection to improve the database query that an LLM writes
0:23
to fetch data to answer questions. Let's say you run a retail store,
0:27
and you may get questions like, which color product has the highest total sales?
0:32
To answer a question like this, you might have an LLM generate a database query.
0:36
If you've heard of database languages like SQL, SQL, it may generate a query in that type of
0:42
language. But if you're not familiar with SQL, don't worry about it.
0:45
But after writing a database query, instead of using that directly to fetch information from
0:51
the database, you may have an LLM, the same or different LLM, reflect on the version one database
0:57
query and update it to maybe an improved one, and then execute that database query against the
1:02
database to fetch information to finally have an LLM answer the question.
1:07
So the question is, does using a second LLM to reflect and improve
1:12
on the database or SQL query actually improve the final output?
1:16
In order to evaluate this, I might collect a set of questions or set of prompts together with
1:23
ground truth answers. So maybe one would be, how many items are sold in May 2025?
1:28
What's the most expensive item in the inventory? How many styles are carried in my store?
1:33
And I write down for maybe 10, 15 prompts, the ground truth answer.
1:40
Then you can run this workflow without reflection. So without reflection would mean to take the SQL
1:46
query generated by the first LLM and to just see what answer it gives. And with reflection would
1:51
mean to take the database query generated after the second LLM has reflected on it to see what
1:57
answer that fetches from the database. And then we can measure the percentage of correct answers
2:03
from no reflection and with reflection. In this example, no reflection gets the answers
2:08
right 87% of the time, with reflection gets it right 95% of the time. And this would suggest that
2:14
reflection is meaningfully improving the quality of the database queries I'm able to get to pull
2:21
out the correct answer. One thing that developers often end up doing as well is rewrite the reflection
2:26
prompt. So for example, do you want to add to reflection prompt an instruction to make the
2:32
database query run faster or make it clearer? Or you may just have different ideas for how to
2:38
rewrite either the initial generation prompt or the reflection prompt. Once you put in place
2:44
evals like this, you can quickly try out different ideas for these prompts and measure the percentage
2:50
correct your system has as you change the prompts in order to get a sense of which prompts work
2:55
best for your application. So if you're trying out a lot of prompts, building evals is important.
3:02
It really helps you have a systematic way to choose between the different prompts you might
3:07
be considering. But this example is one of when you can use objective evals because there is a
3:14
right answer. The number of items sold was 1,301 and the answer is either right or wrong. How about
3:21
applications where you need more subjective rather than objective evaluations? In the plotting
3:27
example that we saw in the last video, without reflection we had the stack bar graph, with reflection
3:32
we had this graph. But how do we know which plot is actually better? I know I like the latter one
3:38
better, but with different graphs varying on different dimensions, how do we figure out which
3:43
one is better? And measuring which of these plots is better is more of a subjective criteria rather
3:51
than a purely black and white objective criteria. So for these more subjective criteria, one thing
3:58
you might do is use an LLM as a judge. And maybe a basic approach to do this might be to feed both
4:04
plots into an LLM, a multi-modal LLM that can accept two images as input, and just ask it which image
4:10
is better. It turns out this doesn't work that well. I'll share an even better idea in a second. But one
4:16
thing you could do might be to also give it some criteria by which to evaluate the two plots, such
4:21
as clarity, how nice looking they are, and so on. But it turns out that there's some known issues of
4:26
using LLMs to compare two inputs to tell you which one is better. First, it turns out the answers are
4:32
often not very good. It could be sensitive to the exact wording of the prompt of the LLM as a judge,
4:37
and sometimes the rank ordering doesn't correspond that well to human expert judgment. And one
4:43
manifestation of this is many LLMs will have a position bias. Many LLMs, it turns out, will often
4:48
pick the first option more often than the second option. And in fact, I've worked a lot of LLMs
4:54
where given two choices, whichever choice I present first, it will say the first choice is better.
5:01
And maybe some LLMs prefer the second option, but I think most LLMs prefer the first option.
5:06
Instead of asking an LLMs to compare a pair of inputs, grading with a rubric can give more
5:11
consistent results. So, for example, you might prompt an LLM to tell it, given a single image,
5:18
assess the attached image against the quality rubric, and the rubric or grading criteria may
5:23
have clear criteria like does the plot have a clear title, are the access labels present,
5:27
is it an appropriate chart type, and so on, with a handful of criteria like this. And it turns out
5:32
that instead of asking the LLM to grade something on a scale of 1 to 5, which it tends not to be
5:38
well calibrated on, if you instead give it, say, 5 binary criteria, 5-0-1 criteria, and have it give
5:45
5 binary scores, and you add up those scores to get the number from 1 to 5 or 1 to 10 if you have
5:51
10 binary criteria, that tends to give more consistent results. And so if we're to gather a
5:58
handful, say 10-15 user queries for different visualizations that the user may want to have
6:04
of the coffee machine sales, then you can have it generate images without reflection or generate
6:11
images with reflection, and use a rubric like this to score each of the images to then check
6:17
the degree to which or whether or not the images generated with reflection are really better than
6:23
the ones without reflection. And then once you've built up a set of evals like this, if ever you
6:29
want to change the initial generation prompt or you want to change the reflection prompt, you can
6:33
also rerun this eval to see if, say, updating one of your prompts allows the system to generate images
6:40
that scores more points according to this rubric. And so this too gives you a way to keep on tuning
6:47
your prompts to get better and better performance. What you may find when building evaluations for
6:53
reflection or for other agentic workflows is that when there is an objective criteria, code-based
6:58
evaluation is usually easier to manage. And in the example that we saw with the database query, we
7:04
built up a database of ground truth examples and ground truth outputs and just wrote code to see
7:10
how often the system generated the right answer in a really objective evaluation metric. In contrast,
7:17
for small subjective tasks, you might use an element as a judge but it usually takes a little
7:22
bit more tuning, such as having to think through what rubric you may want to use to get the LLM
7:27
as a judge to be well calibrated or to output reliable evals. So I hope that gives you a sense
7:33
of how to build evals to evaluate reflections or more generally even to evaluate different
7:38
agentic workflows. Knowing how to do evals well is really important for how you build agentic
7:45
workflows effectively and you hear me say more about this in later videos as well. But now that
7:51
you have a sense of how to use reflection, what I hope to do in the next video is a deep dive into
7:57
one aspect of it, which is when you can get additional information from outside and this
8:03
turns out to make reflection work much better. So in the final video of this module, let's take a
8:08
look at that technique for making your reflection workflows work much better. I'll see you in the
8:14
next video.

2.5 Using external feedback

0:03
Reflection with external feedback, if you can get it, is much more powerful than reflection
0:05
using the LLM as the only source of feedback. Let's take a look.
0:09
When I'm building an application, and if I'm just prompt engineering for direct generation
0:14
of a zero-shot prompting, this is what performance might look like over time,
0:18
where initially, as I tune the prompt, the performance improves for a while,
0:21
but then after a while, it sort of plateaus or flattens out, and despite further engineering
0:27
the prompt, it's just hard to get that much better level of performance.
0:31
So instead of wasting all this time on tuning the prompt, sometimes it'd be better if only
0:35
earlier on in the process, I had started adding reflection, and sometimes that gives a bump in
0:41
performance. Sometimes it's smaller, sometimes a bigger bump, but that adds complexity.
0:46
But if I had started adding in reflection, maybe at this point in the process,
0:49
and then started tuning the reflection prompt, then maybe I end up with a performance that
0:54
looks like this. But it turns out that if I'm able to get external feedback,
0:58
so that the only source of new information isn't just an LLM reflecting on the same
1:03
information as it had before, but some new external information, then sometimes,
1:08
as I continue to tune the prompts and tune the external feedback, you end up with
1:12
an even much higher level of performance. So something to consider if you are working
1:17
on prompt engineering, and you feel that your efforts are seeing diminishing returns,
1:23
that you're tuning a lot of prompts, but it's just not getting that much better,
1:26
then maybe consider if there's reflection, or even better, if there's some external feedback
1:31
you can interject to bump the performance curve off this fattening out red line to maybe
1:36
some higher trajectory of performance improvement. Just as a reminder, we saw earlier,
1:42
one source of feedback for if you're writing code would be if you were to just execute the code
1:47
and see what output it generates, output or error messages, and feed that output back to the LLM
1:53
to let it have that new information to reflect, and then use that information to write a new
1:58
version of the code. Here are a few more examples of when software codes or tools can create new
2:05
information to help the reflection process. If you're using LLM to write emails, and it
2:10
sometimes mentions competitors' names, then if you write codes or build a software tool
2:15
to just carry out pattern matching, maybe via regular expression pattern matching to search
2:20
for competitors' names in the output, then whenever you find a competitor's name, you just feed that
2:25
back to the LLM as a criticism or as input. That's very useful information to tell it to just rewrite
2:32
the text without mentioning those competitors. Or as another example, you might use web search
2:39
or look at other trusted sources in order to fact-check an essay. So if you're a research
2:44
agent that says the Taj Mahal was built in 1648, technically the Taj Mahal was actually commissioned
2:51
in 1631, and it was finished in 1648. So maybe this isn't exactly incorrect, but it doesn't
2:58
capture the accurate history either. In order to more accurately represent when this beautiful
3:04
building was built, if you do a web search to cuddle the snippet explaining exactly the period
3:11
that the Taj Mahal was built and give that as additional input to your reflection agent, then
3:16
it may be able to use that to write a better version of the text on the history of the Taj Mahal.
3:21
One last example, if you're using an LLM to write copy, maybe for a blog post or for a research
3:27
paper abstract, but what it writes is sometimes over the word limit. LLMs are still not very good
3:32
at following exact word limits. Then if you implement a word count tool, just write code to
3:37
count the exact number of words, and if it exceeds the word limit, then feed that word count back to
3:44
the LLM and ask it to try again. Then this helps it to more accurately hit the desired length of
3:51
the output you wanted to generate. So in each of these three examples, you can write a piece of
3:57
code to help find additional facts about the initial output to then give those facts, be it
4:04
that you found the competitor's name or information web search or the exact word count, to feed into
4:09
the reflection LLM in order to help it do a better job thinking about how to improve the output.
4:17
Reflections are powerful too, and I hope you find it useful in a lot of your own work. In the next
4:23
module, we'll build on this to talk about tool use, where in addition to the handful of tool examples
4:29
you saw, you learn how to systematically get your LLM to call different functions, and this will make
4:35
your agenting applications much more powerful. I hope you enjoyed learning about reflection.
4:41
I'm going to now reflect on what you just learned. I hope to see you in the next video.