Han ZL
·
2025-10-17
Module 4: Practical Tips for Building Agentic AI
4.1 Evaluations(evals)
0:05
In this module, I'd like to share with you practical tips for building agentic AI workflows.
0:04
I hope that these tips will enable you to be much more effective than the typical developer
0:10
at building these types of systems. I find that when developing an agentic AI system,
0:16
it's difficult to know in advance where it will work and where it won't work so well,
0:21
and thus where you should focus your effort. So very common advice is to try to build even a
0:27
quick and dirty system to start, so you can then try it out and look at it to see where it may not
0:34
yet be working as well as you wish, to then have much more focused efforts to develop it even
0:40
further. In contrast, I find that it's sometimes less useful to sit around for too many weeks
0:46
theorizing and hypothesizing how to build it. It's often better to just build a quick system in a
0:53
safe, reasonable way that doesn't leak data, kind of do it in a responsible way, but just build
0:58
something quickly so you can look at it and then use that initial prototype to prioritize and try
1:03
further development. Let's start with an example of what might happen after you've built a prototype.
1:10
I want to use as our first example the invoice processing workflow that you've seen previously,
1:16
with the task to extract four required fields and then to save it to a database record. After having
1:22
built such a system, one thing you might do is find a handful of invoices, maybe 10 or 20 invoices,
1:28
and go through them and just take a look at their output and see what went well and if there were
1:33
any mistakes. So let's say you look through 20 invoices, you find that invoice 1 is fine, the
1:38
output looks correct. For invoice 2, maybe it confused the date of the invoice, that is when
1:43
was the invoice issued, with the due date of the invoice, and in this task we want to extract the
1:48
due date so we can issue payments on time. So then I might note down in a document or in a
1:53
spreadsheet that for invoice 2, the dates were mixed up. Maybe invoice 3 was fine, invoice 4 was
1:58
fine, and so on. But as I go through this example, I find that there are quite a lot of examples where I had
2:03
mixed up the dates. So it is based on going through a number of examples like this, that in this case
2:10
you might conclude that one common error mode is that it is struggling with the dates. In that case,
2:17
one thing you might consider would be to of course figure out how to improve your system to make it
2:23
extract due dates better, but also maybe write an eval to measure the accuracy with which it is
2:29
extracting due dates. In comparison, if you had found that it was extracting the biller address
2:35
incorrectly, who knows, maybe you have billers with unusual sounding names and so maybe it
2:40
struggles with billers, or especially if you have international billers whose names may not even all
2:45
be English letters, then you might instead focus on building an eval for the biller address. So one
2:51
of the reasons why building a quick and dirty system and looking at the output is so helpful
2:56
is it even helps you decide what do you want to put the most effort into evaluating. Now if you've
3:03
decided that you want to modify your system to improve the accuracy with which it is extracting
3:09
the due date of the invoice, then to track progress it might be a good idea to create an
3:14
evaluation or an eval to measure the accuracy of date extraction. There are probably multiple ways
3:20
one might go about this, but let me share with you how I might go about this. To create a test set or
3:25
an evaluation set, I might find 10 to 20 invoices and manually write down what is the due date. So
3:33
maybe one invoice has a due date of August 20th, 2025, and I write it down as a standard year, month,
3:39
date format. And then to make it easy to evaluate in code later, I would probably write the prompt
3:46
to the LLM to tell it to always format the due date in this year, month, date format. And with that,
3:51
I can then write code to extract out the one date that the LLM has output, which is the due date,
3:56
because that's the one day we care about. So this is a regular expression, pattern matching, you know,
4:01
four numbers of the year, two for the month, two for the date, and extract that out. And then I can
4:06
just write code to test if the extract date is equal to the actual date, that is the ground
4:11
truth annotation I had written down. So with an eval set of, say, 20 or so invoices, I build and
4:18
make changes to see if the percentage of time that it gets the extracted date correct is hopefully
4:24
going up as I tweak my prompts or tweak other parts of my system. So just to summarize what
4:29
we've seen so far, we build a system, then look at outputs to discover where it may be behaving in
4:35
an unsatisfactory way, such as due dates are wrong. Then to drive improvements to this important
4:40
output, put in place a small eval with, say, just 20 examples to help us track progress.
4:46
And this lets me go back to two prompts, try different algorithms, and so on to see if I can
4:50
move up this metric of due date accuracy. So this is what improving an Agentic AI workflow will often
4:57
feel like. Look at the output, see what's wrong, then if you know how to fix it, just fix it. But
5:01
if you need a longer process of improving it, then put in place an eval and use that to drive
5:05
further development. One other thing to consider is if after working for a while, if you think
5:10
those 20 examples you had initially aren't good enough, maybe they don't cover all the cases you
5:15
want, or maybe 20 examples is just too few, then you can always add to the eval set over time to
5:19
make sure it better reflects your personal judgments on whether or not the system's performance is
5:25
sufficiently satisfactory. This is just one example. For the second example, let's look at
5:30
building a marketing copy assistant for writing captions for Instagram, where to keep things
5:35
succinct, let's say our marketing team tells us that they want captions that are at most 10 words
5:40
long. So we would have an image of a product, say a pair of sunglasses that we want to market,
5:45
and then have a user query, like please write a caption to sell these sunglasses, and then have a
5:52
LLM, or large multimodal model, analyze the image and the query and generate a description of the
5:58
sunglasses. And there are lots of different ways that a marketing copy assistance may go wrong,
6:03
but let's say that you look at the output and you find that the copy or the text generated mostly
6:08
sounds okay, but maybe it's just sometimes too long. So for the sunglasses input, generate 17
6:13
words, if you have a coffee machine, it's okay, stylish is okay, blue shirt, 14 words, blender,
6:18
11 words. So it looks like in this example, the LLM is having a hard time adhering to the length
6:24
guideline. So again, there are lots of things that could have gone wrong with a marketing copy
6:28
assistant. But if you find that it's struggling with the length of the output, they might build
6:33
an eval to track this so that you can make improvements and make sure it's getting better
6:39
at adhering to the length guideline. So to create an eval, to measure the text length, what you
6:44
might do is create a set of test stars, so mark a pair of sunglasses, a coffee machine, and so on,
6:49
and maybe create 10 to 20 examples. Then you would run each of them through your system and write
6:56
code to measure the word count of the output. So this is Python code to measure the word count of a
7:02
piece of text. Then lastly, you would compare the length of the generated text to the 10 word target
7:10
limit. So if word count is equal to 10, now I'm correct, plus equals one. One difference between
7:15
this and the previous invoice processing example is that there is no per example ground truth.
7:22
The target is just 10, same for every single example. Whereas in contrast, for the invoice
7:26
processing example, we had to generate a custom target label that is the correct due date of the
7:32
invoice, and we're testing the outputs against that per example ground truth. I know I used a
7:38
very simple workflow for generating these captions, but these types of evals can be applied to much
7:43
more complex generation workflows as well. Let me touch on one final example in which we'll revisit
7:49
the research agents we've been looking at. If you look at the output of the research agents on
7:55
different input prompts, let's say that when you ask it to write an article on recent breakthroughs
8:01
in black hole science, you find that it missed some high profile result and a loss of news coverage.
8:07
So this is an unsatisfactory result. Or if you asked it to research renting versus buying a
8:11
home in Seattle, well, it seems to do a good job. Or robotics for harvesting fruits. Well,
8:16
it didn't mention a leading equipment company. So based on this evaluation, it looks like
8:22
sometimes it misses a really important point that a human expert writer would have captured. So then
8:29
I would create an eval to measure how often it captures the most important points. For example,
8:34
you might come up with a number of example prompts on black holes, robotic harvesting,
8:40
and so on. And for each one, come up with, let's say, three to five gold standard discussion points
8:45
for each of these topics. Notice that here we do have a per example annotation because the
8:51
gold standard talking points, that is the most important talking points, they are different for
8:56
each of these examples. With these ground truth annotations, you might then use an LLMs judge to
9:01
count how many of the gold standard talking points were mentioned. And so an example prompt might be
9:07
to say, determine how many of the five gold standard talking points are present in the
9:12
provided essay. You have the optional prompts, the essay text, gold standard points, and so on,
9:16
and have it return a JSON object with two Gs that scores how many of the points, zero to five,
9:21
to the score, as well as an explanation. And this allows you to get a score for each prompt in your
9:28
evaluation set. In this example, I'm using LLM-as-a-judge to count how many of the talking points
9:35
were mentioned because there's so many different ways to talk about these talking points, and so a
9:40
regular expression or a code for simple pattern matching might not work that well, which is why
9:46
you might use an LLM-as-a-judge and treat this as a slightly more subjective evaluation for whether
9:52
or not, say, event horizons were adequately mentioned. So this is your third example of how
9:57
you might build evals. In order to think about how to build evals for your application, the evals
10:04
you build will often have to reflect whatever you see or you're worried about going wrong in your
10:10
application. And it turns out that broadly, there are two axes of evaluation. On the top axis is the
10:18
way you evaluate the output. In some cases, you evaluate it by writing code with objective evals,
10:26
and sometimes you use an LLM-as-a-judge for more subjective evals. On the other axis is whether
10:34
you have a per-example ground truth or not. So for checking invoice date extraction, we were writing
10:43
code to evaluate if we got the actual date, and that had a per-example ground truth because each
10:49
invoice has a different actual date. But in the example where we checked marketing copy length,
10:55
every example had a length limit of 10, and so there was no per-example ground truth for that
11:02
problem. In contrast, for counting gold standard talking points, there was a per-example ground
11:07
truth because each article had different important talking points. But we used an LLM-as-a-judge to
11:13
read the essay to see if those topics were adequately mentioned because there's so many
11:17
different ways to mention the talking points. And the last of the four quadrants would be LLM-as-a-judge
11:23
with no per-example ground truth. And one place where we saw that was if you are grading
11:30
charts with a rubric. This is when we're looking at visualizing the coffee machine sales, and if
11:35
you ask it to create a chart according to a rubric, such as whether it's clear access labels and so on,
11:40
there is the same rubric for every chart, and that would be using an LLM-as-a-judge but without a
11:46
per-example ground truth. So I find this two-by-two grid as maybe a useful way to think about the
11:51
different types of evals you might construct for your application. And by the way, these are
11:56
sometimes also called end-to-end evals because one end is the input end, which is the user query
12:01
prompt, and the other end is the final output. And so all of these are evals for the entire end-to-end
12:08
system's performance. So just to wrap up this video, I'd like to share a few final tips for
12:13
designing end-to-end evals. First, quick and dirty evals is fine to get started. I feel like I see
12:20
quite a lot of teams that are almost paralyzed because they think building evals is this
12:25
massive multi-week effort, and so they take longer than would be ideal to get started. But I think
12:32
just as you iterate on an agentic workflow and make it better over time, you should plan to
12:37
iterate on your evals as well. So if you put in place 10, 15, 20 examples as your first cut at
12:44
evals and write some code or try prompting an LLM-as-a-judge, just do something to start to get some
12:49
metrics that can complement the human eye at looking at the output, and then there's a blend
12:54
of the two that can drive your decision making. And as the evals become more sophisticated over
12:58
time, you can then shift more and more of your trust to the metric-based evals rather than
13:03
needing to read over hundreds of outputs every time you tweak a prompt somewhere. And as you
13:08
go through this process, you'll likely find ways to keep on improving your evals as well. So if you
13:15
had 20 examples to start, you may then run into places where your evals fail to capture your
13:22
judgment about what system is better. So maybe you update the system and you look at it and you feel
13:28
like this has got to work much better, but your eval fails to show the new system achieving a
13:34
higher score. If that's the case, that's often an opportunity to go maybe collect a larger eval set
13:40
or change the way you evaluate the output to make it correspond better to your judgment as to what
13:45
system is actually working better. And so your evals will get better over time. And lastly,
13:50
in terms of using evals to gain inspiration as to what to work on next, a lot of agentic workflows
13:56
are being used to automate tasks that, say, humans can do. And so I find for such applications,
14:02
I'll look for places where the performance is worse than that of an expert human, and that
14:06
often gives me inspiration for where to focus my efforts or what are the types of examples that I
14:11
maybe get my agentic workflow to work better than it is currently. So I hope that after you've built
14:18
that quick and dirty system, you think about when it would make sense to start putting in some evals
14:23
to track the potentially problematic aspects of the system, and that that will then help you
14:28
drive improvements in the system. In addition to helping you drive improvements, it turns out that
14:34
there's a method of evals that helps you hone in of your entire agentic system. What are the
14:40
components most worth focusing your attention on? Because agentic systems often have many pieces.
14:47
So which piece is going to be most productive for you to spend time working to improve? It turns
14:53
out being able to do this well is a really important skill for driving efficient development
14:58
of agentic workflows. In the next video, I'd like to deep dive into this topic. So let's go on to
15:03
the next video.
4.2 Error analysis and prioritizing next steps
0:07
Let's say you've built an agentic workflow and if it's not yet working as well as you wish,
0:04
and this happens to me all the time by the way, I'll often build a quick and dirty system and
0:09
it doesn't do as well as I wish it would, the question is where do you focus your efforts
0:14
to make it better? Turns out agentic workflows have many different components and working on
0:19
some of the components could be much more fruitful than working on some other components. So your
0:25
skill at choosing where to focus your efforts makes a huge difference in the speed with which
0:30
you can make improvements to your system. And I found that one of the biggest predictors for how
0:35
efficient and how good a team is, is whether or not they're able to drive a disciplined error
0:41
analysis process to tell you where to focus your efforts. So this is an important skill. Let's take
0:47
a look at how to carry out error analysis. In the research agent example, we had carried out an error
0:54
analysis in the previous video and we saw that it was often missing key points and a human expert
1:00
would have made in writing essays on certain topics. So now you've spotted this problem that
1:06
is sometimes missing key points, how do you know what to work on? It turns out that of the many
1:12
different steps in this workflow, almost any of them could have contributed to this problem of
1:17
missing key points. For example, maybe the first LLM was generating search terms that weren't great, so
1:23
it was just searching for the wrong things and did not discover the right articles. Or maybe use a
1:28
web search engine that just wasn't very good. There are multiple web search engines out there, in fact
1:33
actually quite a few that I tend to use for my own base applications and some are better than others.
1:39
Or maybe web search was just fine but when we gave the list of web search results in LLM, maybe it
1:44
didn't do a good job choosing the best handful to download. Maybe web fetch has fewer problems in
1:51
this case, assuming you can fetch web pages accurately. But after dumping the web pages in LLM,
1:56
maybe the LLM is ignoring some of the points in the documents we had fetched. So it turns out that
2:02
there are teams that sometimes look at this and go by gut to pick one of these components to work on
2:09
and sometimes that works and sometimes that leads to many months of work with very little progress
2:13
in the overall performance of the system. So rather than going by gut to decide which of these
2:20
many components to work on, I think it's much better to carry out an error analysis to better
2:26
understand each step in the workflow. And in particular, I'll often examine the traces and that
2:32
means the intermediate output after each step in order to understand which component's performance
2:38
is subpar, meaning say much worse than what a human expert would do, because that points to where
2:45
there may be room for security improvement. Let's look at an example. If we ask the research agent
2:50
to write an essay about recent news in black hole science, maybe the output search terms like these,
2:56
search for black hole theories Einstein, Event Horizon Telescope Radio, and so on. And I would
3:01
then have a human expert look at these and see are these reasonable web search terms for writing
3:07
about recent discoveries in black hole science. And maybe in this case an expert says these web
3:13
searches look okay, they're pretty similar to what I would do as a human. Then I look at the outputs of
3:20
the web search and look at the URLs returned. So web search would return many different web pages
3:27
and maybe one web page returns is that an elementary school student claims to track a
3:32
30-year-old black hole mystery from Astro Kid News. And this doesn't look like the most rigorous
3:38
peer-reviewed article. And maybe examining all of the articles that web search returns causes you to
3:44
conclude that it's returning too many blog or popular press types of articles and not enough
3:50
scientific articles to write a research report of the quality that you are looking for. It'd be good
3:56
to just look through the outputs of the other steps as well. Maybe the LLM finds the best five
4:00
sources you can, you end up with Astro Kid News, SpaceBot 2000, Space Fun News and so on. And it is
4:06
by looking at these intermediate outputs that you can then try to get a sense of the quality of the
4:12
output of each of these steps. To introduce some terminology, the overall set of outputs of all of
4:18
the intermediate steps is often called the trace of a run of this agent. And then some terminology
4:24
you see in other sources as well is the output of a single step is sometimes called a span.
4:30
This is terminology from the computer observability literature where people try to
4:35
figure out what computers are doing. And in this course, I use the word trace quite a bit. I'll use
4:41
the word span a little bit less, but you may see both of these terms on the internet. So by reading
4:46
the traces, you start to get an informal sense of where might be the most problematic components.
4:52
In order to do this in a more systematic way, it turns out to be useful to focus your attention
4:57
on the cases that the system is doing poorly on. Maybe you write some essays just fine and the
5:02
output is completely satisfactory. So I would put those aside and try to come up with a set of
5:06
examples where for whatever reason, the final output of your research agent is not quite
5:11
satisfactory and just focus on those examples. So this is one of the reasons we call error analysis
5:17
because we want to focus on the cases where the system made an error and we want to go through
5:22
to figure out which components were most responsible for the error in the research agent
5:28
output. In order to make this more rigorous, rather than reading and getting an informal sense,
5:33
you might actually build up a spreadsheet to more explicitly count up where the errors are. And by
5:40
error, I mean when a step outputs something that performs significantly worse than maybe what a
5:47
human expert would have given a similar input as that component. So I'll often do this myself in a
5:52
spreadsheet. So I might build a spreadsheet like this. And so for the first query, I'll look at
5:57
recent developments in black hole science. And I see that the search results has too many blog
6:01
posts, popular press articles, not enough scientific papers. And then based on this,
6:07
it is true that the five best sources aren't great. But here I won't say that the five best sources
6:12
did a bad job because if the inputs to LLM for selecting the five best sources were all
6:18
non-rigorous articles, then I can't blame this picking the five best sources for not picking
6:23
better articles because it did the best it could have or as what did nearly as well as any human
6:28
might have given the same selection to choose from. And then you might go through this for
6:32
different prompts. Renting versus buying in Seattle. Maybe it missed a well-known blog.
6:37
Robotics for harvesting fruit. Maybe in this case, we look at it and say,
6:41
oh, the search terms are too generic and the search results also weren't good and so on.
6:45
And then based on this, I would count up in my spreadsheet how often I observe errors in the
6:52
different components. So in this example, I'm dissatisfied with the search terms 5% of the time,
6:57
but I'm dissatisfied with the search results 45% of the time. And if I actually see this,
7:01
I might just take a careful look at the search terms to make sure that the search terms really
7:05
were okay and that poor choice of search terms were not what led to poor search results. But
7:10
if I really think the search terms are fine, but the search results are not, then I would take a
7:14
careful look at the web search engine I'm using and if there are any parameters I can tune to make
7:18
it bring back more relevant or higher quality results. There's this type of analysis that
7:23
tells me in this example that maybe I really should focus my attention on fixing the search
7:28
results and not on the other components of this agentic workflow. So to wrap up this video,
7:34
I find that it's useful to develop a habit of looking at traces. After you build an agentic
7:40
workflow, go ahead and look at the intermediate outputs to get a feel for what it is actually
7:44
doing at every step so that you can better understand if different steps are performing
7:48
better or worse. And a more systematic error analysis, maybe done with a spreadsheet,
7:54
can let you gather statistics or count up which component performs poorly most frequently. And
7:59
so by looking at what components are doing poorly, as well as where I have ideas for
8:05
efficiently improving different components, then that will let you prioritize what component to
8:10
work on. So maybe a component is problematic, but I don't have any ideas for improving it,
8:15
so that would suggest maybe not prioritizing that as high. But if there is a component that is
8:20
generating a lot of errors, and if I have ideas how to improve that, then that would be a good
8:25
reason to prioritize working on that component. And I just want to emphasize that error analysis
8:31
is a very helpful output for you to decide where to focus your efforts, because in any complex
8:37
system, there are just so many things you could work on. It's too easy to pick something to work
8:42
on and work on it for weeks or even months, only to discover later that that did not result in
8:47
improved performance in your overall system. And so using error analysis to decide where to focus
8:52
your effort turns out to be incredibly useful for improving your efficiency. In this video,
8:58
we went over error analysis with the research agent example, but I think error analysis is
9:04
such an important topic, I want to go over some additional examples with you.
9:08
So let's go on to the next video, where we'll look at more examples of error analysis.
4.3 More error analysis examples
0:05
I found that for many developers, it's only by seeing multiple examples that you can then
0:05
get practice and hone your intuitions about how to carry out error analysis.
0:09
So let's take a look at two more examples, and we'll look at invoice processing
0:14
and responding to customer emails.
0:16
Here's the workflow that we had for invoice processing, where we had a clear process to
0:21
follow an agentic workflow of identifying the four required fields and then recording
0:27
them in a database.
0:28
In the example from the first video of this module, we said that the system was often
0:32
making a mistake in the due date of the invoice.
0:36
So we can carry out error analysis to try to figure out which of the components it may
0:40
have been due to.
0:41
So for example, did the PDF to text make a mistake, or did the LLM extract the wrong date
0:47
out of whatever was output from the PDF to text component?
0:51
To carry out an error analysis, I would try to find a number of examples where the data
0:56
extracted is incorrect.
0:58
So same as the last video, it's useful to focus on the examples where the performance
1:02
is subpar to try to figure out what went wrong with those examples.
1:05
So ignore the examples that got the date right, but try to find somewhere between 10 and 100
1:10
invoices where it got the date wrong.
1:12
And then I would look through to try to figure out was the cause of the problem that PDF
1:18
to text got the date wrong, or was it that the LLM, given the PDF to text output, pulled
1:24
out the wrong date.
1:25
And so you might build up a little spreadsheet like this and go through 20 invoices and just
1:30
count up how often did PDF to text extract the dates or the text incorrectly so that
1:35
even a human couldn't tell what is the due date versus the PDF to text look good enough,
1:40
but the LLM, when asked to pull the dates, somehow pulled out the wrong date, like maybe
1:44
identifying the invoice date rather than the due date of the invoice.
1:48
So in this example, it looks like the LLM data extraction was responsible for a lot
1:52
more errors.
1:53
So this tells me that maybe I should focus my efforts on the LLM data extraction component
1:57
rather than on PDF to text.
1:59
And this is important because if not for this error analysis, I can imagine some teams spending
2:05
weeks or months trying to tune the PDF to text only to discover after that time that
2:10
it did not make much of an impact to the final system's performance.
2:14
Oh, and by the way, these percentages here at the bottom can add up not to 100% because
2:20
these errors are not mutually exclusive.
2:22
To look at one last example, let's go back to the agentic workflow for responding to
2:27
customer emails, where the LLM, given a customer email like this, asking for an order, would
2:34
pull up the order details, fetch the information from the database, then draft a response for
2:39
a human to review.
2:40
So again, I would find a number of examples where, for whatever reason, the final output
2:46
is unsatisfactory and then try to figure out what had gone wrong.
2:50
And so some things that could go wrong.
2:52
Maybe the LLM had written an incorrect database query.
2:56
So when the query was sent to the database, it just did not successfully pull up the customer
3:01
info.
3:02
Or maybe the database has corrupted data.
3:05
So even though the LLM wrote a completely appropriate database query, maybe in SQL or some other
3:10
query language, the database did not have the correct information.
3:13
Or maybe given the correct information about the customer order, the LLM wrote an email
3:17
that was somehow not quite right.
3:20
So again, I would look through a handful of emails where the final output was unsatisfactory
3:25
and try to figure out what had gone wrong.
3:26
So maybe in email one, we find that the LLM had asked for the wrong table in the query,
3:31
just asked for the wrong data in the way it created the database.
3:34
In email two, maybe I find that the database actually has an error.
3:38
And maybe given that input, the LLM somehow wrote a subalternate email as well, and so
3:44
on.
3:44
And in this example, after going through many emails, maybe I find that the most common
3:50
error is in the way the LLM is writing a database query, say a SQL query, in order to fetch
3:57
the relevant information.
3:58
Whereas the database is mostly correct, although there's a little bit of data errors there.
4:02
And the way the LLM writes the email also has some errors.
4:05
Maybe it doesn't quite where they write 30% of the time.
4:08
And this tells me that it'd be most worthwhile maybe for me to improve the way the LLM is
4:13
writing queries.
4:14
Second most important would be maybe improve the prompting for how I write the final email.
4:20
That an analysis like this can tell you that 75% of the errors, maybe the system gets lots
4:25
of things right, but of all the things it gets not quite right, 75% of the problems
4:29
is from the database query.
4:31
This is incredibly helpful information to tell you where to focus your efforts.
4:36
When I'm developing Agentic AI workflows, I'll often use this type of error analysis
4:40
to tell me where to focus my attention in terms of what to work on next.
4:45
When you've made that determination, it turns out that to complement the end-to-end
4:49
evals that we spoke about earlier in this module, it's often useful to evaluate not
4:54
just the entire end-to-end system, but also individual components, because that can make
4:59
you more efficient in how you improve the one component that, say, error analysis has
5:05
caused you to decide to focus your attention on.
5:08
So let's go on to the next video to learn about component-level evals.
4.4 Component-level evaluations
0:04
Let's take a look at how to build and use component-level evals.
0:04
In our example of a research agent, we said that the research agent was sometimes missing
0:09
key points. But if the problem was web search, if every time we change the web search engine,
0:15
we need to rerun the entire workflow, that can give us a good metric for performance,
0:20
but that type of eval is expensive. Moreover, this is a pretty complicated workflow,
0:26
so even if web search made things a little bit better, maybe noise introduced by the randomness
0:31
of other components would make it harder to see little improvements to the web search quality.
0:38
So as an alternative to only using end-to-end evals, what I would do is consider building an
0:43
eval just to measure the quality of the web search component. For example, to measure the
0:48
quality of the web search results, you might create a list of gold standard web resources.
0:53
So for a handful of queries, have an expert say, these are the most authoritative sources that if
0:58
someone was searching the internet, they really should find these web pages or any of these web
1:03
pages would be good. And then you can write code to capture how many of the web search outputs
1:09
correspond to the gold standard web resources. The standard metrics from information retrieval,
1:15
the F1 score, don't worry about the details if you don't know what that means, but there are
1:18
standard metrics that allow you to measure of a list of web pages returned by web search,
1:23
how much does that overlap with what an expert determined are the gold standard web resources.
1:29
With this, you're now armed with a way to evaluate just the quality of the web search component.
1:34
And so as you vary the parameters or hyperparameters of how you care about web search,
1:40
such as if you swap in and out different web search engines, so maybe try Google and Bing
1:45
and Dr. Go and Tivoli and U.com and others, or as you vary the number of results or as you vary
1:50
the date range that you ask the web search engines to search over, this can very quickly let you
1:54
judge if the quality of the web search component is going up and does make more incremental
2:01
improvements. And then of course, before you call the job done, it would be good to run an
2:05
end-to-end eval to make sure that after tuning your web search system for a while that you are
2:10
improving the overall system performance. But during that process of tuning these hyperparameters
2:16
one at a time, you could do so much more efficiently by evaluating just one component
2:20
rather than needing to rerun end-to-end evals every single time. So component level evals can
2:27
provide a clearer signal for specific errors. It actually lets you know if you're improving
2:32
the web search component or whatever component you're working on and avoid the noise in the
2:37
complexity of the overall end-to-end system. And if you're working on a project where you have
2:43
different teams focused on different components, it can also be more efficient for one team to just
2:48
have his own very clear metric to optimize without needing to worry about all of the other components.
2:53
And so this lets the team work on a smaller, more targeted problem faster. So when you've decided to
3:00
work on improving a component, consider if it's worth putting in place a component-wise eval and
3:05
if that will let you go faster on improving the performance of that component. Now the one thing
3:11
you may be wondering is, if you decided to improve a component, how do you actually go about making
3:16
that one component work better? Let's take a look at some examples of that in the next video.
4.5 How to address problems you identify
0:06
An agentic workflow may comprise many different types of components, and so your tools for
0:05
improving different components will be pretty different. But I'd like to share with you some
0:09
general patterns I've seen. Some components in your agentic workflow will be non-LLM-based,
0:15
so it may be something like a web search engine or a text retrieval component,
0:20
if that's part of your RAG or Retrieval Augmented Generating System, something for code execution,
0:24
or maybe with a separately trained machine learning model, maybe for speech recognition
0:28
or detecting people in pictures, and so on. So sometimes these non-LLM-based components will
0:34
have parameters or hyperparameters that you can tune. So for web search, you can tune things like
0:39
the number of results or maybe the date range that you ask the web search engine to consider.
0:44
For a RAG text retrieval component, you might change the similarity threshold that determines
0:49
what pieces of text it considers similar, or the chunk size. Often RAG systems will take text and
0:55
chop it up into smaller chunks for matching, so the main hyperparameters you could use. Or for
1:00
people detection, you might change the detection threshold, so how sensitive it is and how likely
1:05
it is to declare this found a person, and this will trade off the false positives and false
1:08
negatives. If they follow all the details of the hyperparameters I just discussed, don't worry
1:12
about it. The details aren't that important, but often the components were parameters that
1:16
you can tune. And then of course, you can also try to replace the component. I do this a lot
1:21
in my agentic workflows, where I'll swap in different RAG search engines or swap in different
1:26
RAG providers and so on, just to see if some other provider might work better. Because of the
1:32
diversity of non-LLM-based components, I think the techniques for how to improve it will be
1:37
more diverse and dependent on exactly what that component is doing. For an LLM-based component,
1:43
here are some options you might consider. One would be to try to improve your prompts. So maybe
1:49
try to add more explicit instructions. Or if you know what few-shot prompting is, that refers to
1:55
adding one or more concrete examples of an example of an input and a desired output. And so few-shot
2:01
prompting, which you can learn about from some deep learning short courses as well, is a technique
2:06
that can give your LLM some examples to hopefully help it get better performing outputs written. Or
2:12
you can also try a different LLM. So with AI Suite or other tools, it could be pretty easy to try
2:19
multiple LLMs and then you can use evals to pick the best model for your application. Sometimes,
2:25
if a single step is too complex for one LLM to do, you can consider if you want to decompose
2:30
the task into smaller steps. Or maybe decompose it into a generation step and then a reflection
2:35
step. But more generally, if you have instructions that are very complex all within one step, maybe a
2:40
single LLM has a hard time following all those instructions. And you can break the task down
2:45
to smaller steps that may be easier for, say, two or three calls in a row to carry out accurately.
2:51
And lastly, something to try when the other methods aren't working well enough is to consider
2:56
fine-tuning a model. This tends to be quite a bit more complex than the other options, so it can be
3:02
quite a bit more expensive as well in terms of developer time to implement. But if you have some
3:06
data that you can use to fine-tune an LLM on, that could give you much better performance than
3:14
prompting alone. So I tend not to fine-tune a model until I've really exhausted the other
3:19
options, because fine-tuning tends to be quite complex. But for applications where after trying
3:25
everything else, if I'm still at, say, 90% performance or 95% performance, and I really
3:30
need to eke out those last few percentage points of improvement, then sometimes fine-tuning my own
3:35
custom model is a great technique to use. I tend to do this only on the more mature applications
3:41
because of how costly it is. It turns out that when you're trying to choose an LLM to use,
3:47
one thing that's very hopeful for you as a developer is if you have good intuitions about
3:52
how intelligent or how capable different large language models are. One thing you can do is just
3:56
try a lot of models and see what works best. But I find that as I work with different models,
4:00
I start to hone intuitions about which models work best for what types of tasks. And when you hone
4:06
those intuitions, you can be more efficient as well in writing good prompts for the model as
4:10
well as choosing good models for your tasks. So I'd like to share with you some thoughts on how to
4:16
hone your intuition on what models will work well for your application. Let's illustrate this with
4:22
an example of using an LLM to follow instructions to remove or to redact PII or personally
4:29
identifiable information. So you're now to remove private sensitive information. For example,
4:34
if you are using an LLM to summarize customer calls, then maybe one summary is on July 14th,
4:41
2023, Jessica Alvarez with a social security number, a certain address, a business support
4:46
ticket, and so on. So this piece of text has a lot of sensitive, personally identifiable
4:52
information. Now, let's say we want to remove all PII from such summaries because we want to use the
4:58
data for downstream statistical analysis of what customers are calling about. And to protect customer
5:03
information, we want to strip out that PII before we do that downstream statistical analysis. So you
5:08
might prompt an LLM with instructions to identify all cases of PII in the text below and then return
5:15
the redacted text with redacted colon and so on. It turns out that the larger frontier models tend
5:23
to be much better at following instructions, whereas the smaller models tend to be pretty
5:29
good at answering simple factual questions, but are just not as good at following instructions.
5:34
If you run this prompt on the smaller model, the OpenWay Llama 3.1 model with 8 billion parameters,
5:40
then it may generate an output like this. It says the identified PII is social security number and
5:45
address, and then it redacts it as follows and so on. And it actually makes a few errors. It didn't
5:50
follow the instructions properly. It showed the list, then redacted the text, then returned another
5:56
list, which it wasn't supposed to. And in this list of PII, it missed the name. And then I think
6:02
it also didn't redact partly the address. So details aren't important, but it didn't follow these instructions
6:07
perfectly, and maybe it missed a little bit of PII. In contrast, if you use a more intelligent model,
6:12
one that's better at following instructions, you may get a better result like this, where it's
6:17
actually correctly listed all the PII and correctly redacted all of the PII. And so I find that as
6:24
different LLM providers specialize on different tasks, different models really are better for
6:30
different tasks. Some are better at coding, some are better at following instructions, some are better
6:34
at certain niche types of facts. And if you can hold your intuition for what models are more or
6:40
less intelligent, and what type of instructions they're more or less able to follow, then you'll
6:44
be able to make better decisions as to what models to use. So to share a couple tips on how to do this,
6:50
I encourage you to play with different models often. So whenever I do a new model release,
6:55
I'll often go try it out and try out different queries on it, both closed-weight proprietary
7:00
models as well as open-weight models. And I find that sometimes having a personal set of evals might
7:06
also be helpful, where there's a set of things you ask a lot of different models that might help you
7:10
calibrate how well they do on different types of tasks. One other thing that I do a lot that I hope
7:16
will be useful to you is I spend a lot of time reading other people's prompts. So sometimes
7:22
people will publish their prompts on the internet, and I'll often go and read them to understand
7:27
what best practices in prompting look like. Or I'll often chat to my friends at various companies,
7:32
including some of the frontier model companies, and share my prompts with them, take a look at
7:36
how they prompt. And sometimes I'll also go to open-source packages written by people I really
7:42
respect and download the open-source package and dig through that open-source package to find the
7:48
prompts the authors have written in order to read it, in order to hold my intuition about how to
7:53
write good prompts. This is one technique that I encourage you to consider, is by reading lots of
7:58
other people's prompts that will help you get better at writing prompts yourself. And I certainly
8:04
do this a lot, and I encourage you to do so too. And this will hone your intuition about what types
8:09
of instructions models are good at following, and when to say certain things to different models.
8:14
In addition to playing with models and reading other people's prompts, if you try out lots of
8:19
different models in your agentic workflows, that also lets you hone your intuition. So you see
8:24
which models work best for which types of tasks, and either looking at traces to get an informal
8:30
sense, or looking at either component-wise or end-to-end evals can help you assess how well
8:36
different models are working for different parts of your workflow. And then you start to hone
8:40
intuitions about not just performance, but maybe also price and speed trade-offs for the use of
8:46
different models. And one of the reasons I tend to develop my agentic workflows with AI Suite is
8:51
because it then makes it easy to quickly swap out and try out different models. And this makes me
8:56
more efficient in terms of trying out and assessing which models work best for my workflow. So we've
9:02
talked a lot about how to improve the performance of different components to hopefully improve the
9:08
overall performance of your end-to-end system. In addition to improving the quality of the output,
9:15
one other thing you might want to do in your workflows is to optimize the latency as well as
9:21
cost. I find that for a lot of teams, when you start developing, usually the number one thing
9:25
to worry about is just are the outputs sufficiently high quality. But then when the system is working
9:31
well and you put in the production, then there's often value to make it run faster as well as run
9:37
at lower cost as well. So in the next video, let's take a look at some ideas for improving
9:42
cost and latency for agentic workflows.
4.6 Latency, cost optimization
0:06
When building agentic workflows, I'll often advise teams to focus on getting high-quality outputs
0:05
and to optimize cost and latency only later. It's not that cost and latency don't matter,
0:11
but I think getting the performance or the output quality to be high is usually the hardest part,
0:18
and then only when it's really working, then maybe focus on the other things.
0:22
One thing that's happened to me a few times was my team built an agentic workflow,
0:26
and we shipped it to users, and then we were fortunate enough to have so many users use it
0:31
that the cost actually became a problem, and then we had to, you know, stramble to bring the cost
0:36
back down. But that's a good problem to have, so I tend to worry about cost, usually less.
0:42
Not that I ignore it completely, but it's just lower down my list of things to worry about,
0:47
and so we have so many users that we really need to bring the cost down per user.
0:51
And then latency, I tend to worry a bit about it, but again, not as much as just making sure
0:57
the output quality is high. But when you do get there, it will be useful to have tools to optimize
1:03
latency and cost. Let's take a look at some ideas on how to do that.
1:06
If you want to optimize the latency of an agentic workflow, one thing I will often do is then
1:13
benchmark or time the workflow. So in this research agent, it takes a number of steps,
1:20
and if I were to time each of the steps, maybe LLM takes 7 seconds to generate the search terms.
1:26
Web search takes 5 seconds, this takes 3 seconds, this takes 11 seconds,
1:30
and then writing the final essay takes 18 seconds on average. And it is then by looking at this
1:35
overall timeline that I can get a sense of which components have the most room for making faster.
1:43
In this example, there may be multiple things you could try. If you haven't already taken
1:47
advantage of parallelism for some steps, like maybe web fetch, maybe it's worth considering
1:53
doing some of these operations in parallel. Or if you find that some of the LLMs sets are
1:59
taking too long, so if this first step takes 7 seconds, this last LLMs set takes 18 seconds,
2:04
I might also consider trying a smaller, maybe slightly less intelligent model to see if it
2:09
still works well enough for that, or if I can find a faster LLM provider. There are lots of APIs online
2:14
for different LLMs interfaces, and some companies have specialized hardware to allow them to serve
2:20
certain LLMs much faster, so sometimes it's worth trying different LLMs providers to see which ones
2:26
can return tokens the fastest. But at least doing this type of timing analysis can give you a sense
2:32
of which components to focus on in terms of reducing latency. In terms of optimizing costs,
2:39
a similar calculation where you calculate the cost of each step would also let you benchmark
2:43
and decide which steps to focus on. Many LLMs providers charge per token based on the input
2:49
and output length. Many API providers charge per API call, and the computational steps may have
2:55
different costs based on how you pay for server capacity and how much the service costs. And so
3:00
for a process like this, you might decide in this example that the tokens for this LLMs step on
3:06
average cost 0.04 cents, each web search API maybe costs 1.6 cents, tokens cost this much,
3:14
API call costs this much, PDF to text costs this much, tokens for the final SA generation cost this
3:19
much, and this would maybe again give you a sense of are there cheaper components you could use or
3:23
cheaper LLMs you could use to see where the biggest opportunity is for optimizing costs. And I found
3:29
that these benchmarking exercises can be very clarifying, and sometimes they'll clearly tell
3:34
me that certain components are just not worth worrying about because they're not that material
3:38
or contributor to either cost or to latency. So I find that when either cost or latency becomes
3:45
an issue, by simply measuring the cost and or latency of each step, that often gives you a
3:51
basis with which to decide which components to focus on optimizing. So we're nearly at the end
3:58
of this module. I know we've covered a lot, but thank you for sticking with me. Let's go on to
4:02
the final video of this module to wrap up.
4.7 Development process summary
0:06
We've gone through a lot of tips for driving a disciplined, efficient process for building
0:05
Agentic AI systems. I'd like to wrap up by sharing with you what it feels like to be going through
0:11
this process. When I'm building these workflows, I feel like there are two major activities I'm
0:16
often spending time on. One is building, so writing software, trying to write code to improve
0:21
my system. And the second, which sometimes doesn't feel like progress, but I think is equally
0:27
important, is analysis to help me decide where to focus my build efforts next. And I often go
0:32
back and forth between building and analyzing, including things like error analysis. So for
0:38
example, when building a new agentic workflow, I'll often start by quickly building an end-to-end
0:43
system, maybe even a quick and dirty implementation. And this lets me then start to examine the final
0:49
outputs of the end-to-end system, or also read through traces to get a sense of where it's doing
0:54
well, where it's doing poorly. Based on even just looking at traces, sometimes this will give me a
0:59
gut sense of which individual components I might want to improve. And so I might go tune some
1:05
individual components or keep tuning the overall end-to-end system. As my system starts to mature
1:11
a little bit more, then beyond just manually examining a few outputs and reading through
1:15
traces, I might start to build evals and have a small data set, maybe just 10-20 examples,
1:21
to compute metrics, at least on end-to-end performance. And this then further helps me
1:27
have a more refined perspective on how to improve the end-to-end system or how to improve individual
1:32
components. As it matures even further, my analysis then becomes maybe even more disciplined, where I
1:38
start to do error analysis and look through the components and try to count up how frequently
1:42
individual components led to subpar outputs. And this more rigorous analysis then lets me be even
1:49
more focused in deciding what components to work on next or inspire ideas for improving the overall
1:54
end-to-end system. And then eventually, when it's even more mature to drive more efficient improvements
2:00
at the component level, that's when I might also build component-level evals. And so the workflow
2:06
of building an agentic system often goes back and forth. It's not a linear process. We sometimes
2:11
tune the end-to-end system, then do some error analysis, then improve a component for a bit,
2:15
then tune the component-level evals. And I tend to bounce back and forth between these two types
2:20
of techniques. And what I see less experienced teams often do is spend a lot of time building
2:27
and probably much less time analyzing with error analysis, building evals, and so on. That would be
2:32
ideal because this is analysis that helps you really focus where to spend your time building.
2:38
And just one more tip. There are actually quite a few tools out there to help with monitoring traces,
2:44
logging runtime, computing costs, and so on. And those tools can be helpful. I sometimes use a few
2:49
of them, and quite a few of DeepLearning.ai short course partners offer those tools, and they do
2:54
work well. I find that for agentic workflows I end up working on, most agentic workflows are pretty
3:00
custom. And so I end up building pretty custom evals myself because I want to capture the things
3:07
that work incorrectly with my system. So even though I do use some of those tools, I also end
3:12
up building a lot of custom evals that are well fit to my specific application and the issues I
3:18
see with it. So thanks for sticking with me this far to the end of the fourth of five modules.
3:25
If you're able to implement even a fraction of the ideas from this module, I think you'll be
3:32
well ahead of the vast majority of developers in terms of your sophistication at implementing
3:38
agentic workflows. Hope you found these materials useful, and I look forward to seeing you in the
3:43
final module. We'll talk about some more advanced design patterns for building
3:48
highly autonomous agents. I'll see you in the last module of this course.```