「吴恩达Agentic AI 模块1」Agentic AI工作流

Agentic AI：构建自主式 AI 工作流 2025-10-17

Andrew Ng: Agentic AI

Module 1: Introduction to Agentic Workflows

1.0 Introduction

0:02
Welcome to this course on Agentic AI. When I coined the term agentic to describe what I saw
0:05
as an important and rapidly growing trend in how people were building on-base applications,
0:10
what I did not realize was that a bunch of marketers would get hold of this term
0:14
and use it as a sticker and put this on almost everything in sight. And that has caused hype
0:20
on Agentic AI to skyrocket. The good news though is that ignoring the hype, the number of truly
0:27
valuable and useful applications built using Agentic AI has also grown very rapidly, even if
0:33
not quite as rapidly as the hype. And in this course, what I'd like to do is show you best
0:38
practices for building Agentic AI applications. And this will open up a lot of new opportunities
0:44
to you in terms of what you can now build. Today, agentic workflows are being used to build
0:49
applications like customer support agents, or to do deep research to help write deeply insightful
0:55
research reports, or to process tricky legal documents, or to look at patient input and render
1:02
or to suggest possible medical diagnoses. On many of my teams, a lot of the projects we built just
1:08
would be impossible without agentic workflows. And so knowing how to build applications with them
1:14
is one of the most important and valuable skills in AI today. It turns out that one of the biggest
1:20
differences I've seen between people that really know how to build agentic workflows compared to
1:24
people that are less effective at it is the ability to drive a disciplined development process,
1:30
specifically one focused on evals and error analysis. And in this course, I'll tell you what
1:35
that means and show you what allows you to be really good at building these agentic workflows.
1:41
Being able to do this is one of the most important skills in AI today and will open up a lot more
1:46
opportunities, be it job opportunities or opportunities to just build amazing software
1:51
yourself. With that, let's go on to the next video to dive more into what are agentic workflows.

1.1 What is an agentic AI

0:03
So what is Agentic AI and why are Agentic AI workflows so powerful? Let's take a look.
0:05
The way that many of us use large language models or LLMs today is by prompting it to say,
0:11
write an essay for us on a certain topic X. And I think of that as akin to going to a human,
0:17
or in this case, going to an AI and asking it to please type out an essay for me by writing
0:22
from the first word to the last word all in one go and without ever using backspace.
0:27
It turns out that we as people, we don't do our best writing like that by being forced to write
0:32
in this completely linear order and nor do AI models. But despite the difficulty of being
0:38
constrained to write in this way, our LLMs do surprisingly well. In contrast, with an agentic
0:43
workflow, this is what the process might look like. You may ask it to first write an essay
0:48
outline on a certain topic, then ask if it needs to do any web research. And after doing some web
0:53
research and maybe downloading some web pages, then to write the first draft and then to read
0:57
the first draft and see what parts need revision or more research and then revise the draft and so
1:02
on. And this type of workflow is more akin to doing some thinking and some research and then
1:06
doing some revision and then doing some more thinking and so on. And with this iterative
1:12
process, it turns out that an agentic workflow can take longer, but it delivers a much better
1:18
work product. So an agentic AI workflow is a process where an LLM based app executes multiple
1:25
steps to complete a task. In this example, you might use an LLM to write the first essay outline
1:31
and then you might use an LLM to decide what search terms to type into a web search engine or
1:37
really what search terms to call a web search API with in order to get back relevant web pages.
1:44
Based on that, you can feed the downloaded web pages into an LLM to have it write the first draft
1:51
and then maybe use another LLM to reflect and decide what needs more revision. And then depending
1:57
on how you design this workflow, perhaps you may even add a human in the loop step where the LLM
2:03
has the option to request human review, maybe of some key facts. And based on that, it may then
2:10
revise the draft and this process results in a much better work output. One of the key skills
2:16
you learn in this course is how to take a complex task like writing an essay and breaking it down
2:24
into smaller steps for agentic workflows to execute one step at a time to then get the work output
2:31
that you want. And knowing how to decompose the task into steps and how to build the components
2:36
to execute the individual steps well turns out to be a tricky but important skill that will
2:42
determine your ability to build agentic workflows for a huge range of exciting applications. In this
2:48
course, a running example that we'll use and something that you build alongside me is a research
2:54
agent. So here's an example of what it will look like. You can enter a research topic like how do I
3:00
build a new rocket company to compete with SpaceX. I don't personally want to compete with SpaceX,
3:08
but if you want to, you can try asking a research agent to help your background research.
3:13
So this agent starts with planning out what research to use, including calling a web search
3:20
engine to download some web pages, and then to synthesize and rank findings, draft an outline,
3:26
have an editor-to-agent review for coherence, and then finally generate a comprehensive
3:30
markdown report, which it has done here, building a new rocket company to compete with SpaceX
3:35
with an intro, background, findings, and so on. I think it points out appropriately that
3:41
this is going to be a tough startup to build, so I'm not personally planning to do this, but if
3:47
you want to tackle something like this, maybe a research agent like this could help you with some
3:53
initial research. And by finding and downloading multiple sources and deeply thinking about it,
4:00
this actually ends up with a much more thoughtful report than just prompting an LLM to write an essay
4:07
for you would. One of the reasons I'm excited about this is because in my work, I've ended up
4:12
building quite a few specialized research agents, be it in legal documents for conflict legal
4:18
compliance, or for some healthcare sectors, or some business product research areas. And so I hope
4:24
that working through this example, you not only learn how to build agentic workflows for many
4:29
other applications, but that some of the ideas in building research agents will be directly useful
4:34
to you if you ever need to build a custom research agent yourself. Now, one of the often discussed
4:42
areas of AI agents is how autonomous are they? What you just saw here was a relatively complex,
4:49
highly autonomous Agentic AI workflow, but there are also other simpler workflows that are
4:55
incredibly valuable. Let's go on to the next video to talk about the degree to which agentic
5:00
workflows can be autonomous, and does it give you a framework to think about how
5:05
you might go about building different applications and how easy or difficult they might be.
5:10
See you in the next video.

1.2 Degrees of autonomy

0:02
Agents can be autonomous to different degrees. A few years ago, I noticed within the AI community
0:06
that there was a growing controversial debate about what is an agent, and some people are
0:10
writing a paper saying I built an agent, and others will say, no, that's not really a true
0:14
agent. And I felt this debate was unnecessary, which is why I started using the term agentic,
0:21
because I thought if we use it as an adjective rather than a binary, it's either an agent or not,
0:26
then we're going to have to acknowledge that systems can be agentic to different degrees.
0:31
And let's just call it all agentic and move on with the real work of building these systems
0:36
rather than debating, you know, is this sufficiently autonomous to be an agent or not?
0:40
I remember when I prepared a talk on agentic reasoning, one of my team members actually
0:46
came to me and said, hey, Andrew, we don't need yet another word. You know, we have agent,
0:50
why are you making up another word, agentic? But I decided to use it anyway. And then later on,
0:55
wrote an article in a given newsletter, The Batch, and then also posted on social media,
1:01
saying that instead of arguing over which word to include or exclude as being a true agent,
1:06
let's acknowledge that different degrees to which systems can be agentic. And I think this helped
1:11
move past the debate on what is a true agent and let us just focus on actually building them.
1:18
Some agents can be less autonomous. So take the example of writing an essay about black holes.
1:25
You can have a relatively simple agent to come up with a few web search terms or web search queries.
1:31
Then you can hard code in that you call a web search engine, fetch some web pages,
1:37
and then use that to write an essay. And this is an example of a less autonomous agent
1:41
with a fully deterministic sequence of steps. And this will work okay.
1:46
In terms of notational convention, throughout this course, I'll use the red color,
1:50
as you see here on the left, to denote the user input, such as a user query in this case,
1:56
or in later examples, maybe the input document into an agentic workflow.
2:00
The gray boxes denote calls to an LLM, and the green boxes, like the web search and the web
2:07
fetch boxes that you see here, indicate steps where other software is being used to carry out
2:13
an action, such as a web search API call or executing code to fetch the contents of a website.
2:18
Then an agent can be more autonomous, where, given a request to write an essay about black holes,
2:24
perhaps you let the LLM decide, does it want to do a web search, or does it want to search
2:29
recent news sources, or does it want to search for recent research papers on the website archive?
2:34
Based on that, maybe in this example, the LLM, not the human engineer, but the LLM chooses,
2:39
in this case, to call a web search engine, and then after that, you may let the LLM decide
2:44
how many web pages does it want to fetch, or if it fetches the PDF, does it need to
2:50
call a function, or also call a tool, to convert the PDF to text?
2:53
And in this case, maybe it fetches its top few web pages, then it can write an essay,
2:59
decide whether to reflect and improve, and maybe even go back to fetch more web pages,
3:03
and then to finally produce an output.
3:06
And so even for this example of a research agent, we can see that some agents can be
3:13
less autonomous, with a linear sequence of steps to be executed, determined by a programmer,
3:18
and some can be more autonomous, where you trust the LLM to make more decisions,
3:22
and the exact sequence of steps that happens may be even determined by the LLM, rather than
3:26
in advance by the programmer.
3:28
So for less autonomous systems, you will usually have all the steps predetermined in advance,
3:34
and any functions it calls, like web search, and we'll call that tool use, as you learn
3:38
in the third module in this course, might be hard-coded by the human engineer, by you
3:43
or me, and most of the autonomy is in what text the LLM generates.
3:47
At the end of the spectrum would be highly autonomous agents, where the agent makes many
3:51
decisions autonomously, including, for example, deciding what is the sequence of steps it
3:56
will carry out in order to write the essay.
3:59
And there's some highly autonomous agents that can even write new functions, or sometimes
4:04
create new tools that it can then execute.
4:06
And somewhere in between are semi-autonomous agents, where it can make some decisions,
4:11
choose tools, but the tools are usually more predefined.
4:14
As you look at different examples in this course, you learn how to build applications
4:18
anywhere on this spectrum of less to more highly autonomous, and you find that there
4:22
are tons of applications in the less autonomous end of the spectrum that are very valuable
4:26
being built for tons of businesses today, and at the same time, there are also applications
4:31
being worked on at the more highly autonomous end of the spectrum, but those are usually
4:36
less easily controllable, a little bit more unpredictable, and also a lot of active research
4:42
as well to figure out how to build these more highly autonomous agents.
4:46
And with that, let's go on to the next video to dive deeper into this and to hear about
4:53
some of the benefits of using agents and why they allow us to do things that just were
4:57
not possible with earlier generations of base applications.

1.3 Benefits of agentic workflows

0:03
I think the one biggest benefit of agentic workflows is that it allows you to do many
0:04
tasks effectively that just previously were not possible. But there are other benefits as well,
0:10
including parallelism that lets you do certain things quite fast, as well as modularity that
0:14
lets you combine the best of three components from many different places to build an effective
0:19
workflow. Let's take a look. My team collected some data on a coding benchmark that tests the
0:26
ability of different LLMs to write code to carry out certain tasks. The benchmark used in this case
0:32
is called Human Eval, and it turns out that GPT 3.5, this is a model that the first publicly
0:38
available version of Chat GPT was based on, if asked to write the code directly, to just type
0:42
out the computer program, gets 40% right on this benchmark. This is a positive k-metric. GPT 4 is
0:49
a much better model. Its performance leaps to 67% with this also non-agentic workflow. But it turns
0:56
out that as large as the improvement was from GPT 3.5 to GPT 4, that improvement is dwarfed by what
1:03
you can achieve by wrapping GPT 3.5 within an agentic workflow. Using different agentic techniques,
1:10
which you'll learn about later in this course, you can prompt GPT 3.5 to write code and then maybe
1:15
reflect on the code and figure out if you can improve it. And using techniques like that, you can
1:20
actually get GPT 3.5 to get much higher levels of performance. And similarly, GPT 4 used in the
1:26
context of an agentic workflow also does much better. So even with today's best LLMs, an agentic
1:33
workflow lets you get much better performance. And in fact, what we saw in this example was the
1:39
improvement from one generation of model to another, which is huge, is still not as big a difference
1:45
as implementing an agentic workflow on the previous generation of model. Another benefit of
1:51
using agentic workflows is that they can parallelize some tasks and thus do certain things much faster
1:57
than a human. For example, if you ask an agentic workflow to write an essay about black holes, you
2:04
might be able to have three LLMs run in parallel to generate ideas for web search terms to type
2:10
into the search engine. Based on the first web search, it may identify, say, three top results to
2:16
fetch. And based on the second web search, it may identify a second set of web pages to fetch and so
2:22
on. And it turns out that whereas a human doing this research would have to read these nine web
2:28
pages sequentially or one at a time, when you're using an agentic workflow, you can actually
2:33
parallelize all nine web page downloads and then finally feed all these things into an LLM to write
2:39
an essay. So even though agentic workflows do take longer than truly non-agentic workflows or by
2:45
direct generation by just prompting a single time, if you were to compare this type of agentic
2:51
workflow to how a human would have to go about the task, the ability to parallelize downloading lots
2:56
of web pages can actually let it do certain tasks much faster than the non-parallel sequential way
3:02
that a single human might process this data. To build on this example, it turns out one of the
3:08
things I often do when building agentic workflows is look at the individual components like the LLM
3:14
and add or swap out components. So for example, maybe I look at the web search engine I use up
3:20
here and I might decide that I want to soften a new web search engine. When building agentic
3:24
workflows, there are actually multiple web search engines including Google, which you can access
3:29
by a server, as well as others like Bing, DuckDuckGo, Tavily, u.com. There are actually quite a lot of
3:35
options for web search engines designed for LLMs to use. Or maybe instead of just doing three web
3:40
searches, maybe on this step we can swap in a new news search engine so we can find out what's the
3:47
latest news on recent breakthroughs on black hole science. And lastly, instead of using the same LLM
3:55
for all of the different steps, I will often try out different large language models and maybe try
4:01
out different LLM providers to see which one gives the best result for different steps of this system.
4:07
So to summarize, the main reason I use agentic workflows is it just gives much better performance
4:13
on many different applications. But in addition, it can also paralyze some tasks that humans would
4:20
otherwise have to do sequentially. And the modular design of many agentic workflows also lets us
4:27
add or update tools and sometimes swap out models. We've talked a lot about the key components of
4:33
building agentic workflows. Let's now take a look at a range of Agentic AI applications to give you
4:39
a sense of the sorts of things people are already building and the sorts of things
4:43
you'll build yourself. Let's go on to the next video.

1.4 Agentic AI applications

0:00
Let's take a look at some examples of Agentic AI applications.
0:04
One task that many businesses carry out is invoice processing.
0:09
So given an invoice like this, you might want to write software to extract the most important
0:14
fields, which for this application, let's say is the biller, that would be tech flow solutions,
0:19
the biller address, the amount due, which is $3,000, and the due date, which looks like it is
0:25
August 20th, 2025. So in many finance departments, maybe a human would look at invoices and identify
0:33
the most important fields, who do we need to pay by when, and record these in a database to make
0:39
sure that payment is issued in time. If you were to implement this with an agentic workflow,
0:44
you might do so like this. You write input an invoice, then call a PDF to text conversion API
0:51
to turn the PDF into maybe formatted text, such as markdown text for the LLM to ingest. Then the LLM
0:57
will look at the PDF and figure out, is this actually an invoice or is this some other type
1:02
of document that they should just ignore? And if it is an invoice, then it will pull up the required
1:07
fields as well as use an API or use a tool to update the database in order to save the most
1:14
important fields in the database records. So one aspect of this agentic workflow is that
1:20
there is a clear process to follow, is identify the required fields and record in the database.
1:26
Tasks like these with a clear process you want followed tend to be maybe easier for agentic
1:31
workflows to carry out because it leads to a relatively step-by-step way to reliably carry
1:36
out this task. Here's another example, maybe just a little bit harder. So if you want to build an
1:41
agent to respond to basic customer order inquiries, then the steps might be to extract the key
1:48
information, so figure out what exactly did the customer order, what's the customer's name, then
1:53
look up the relevant customer records, and then finally draft a response for human to review before
2:00
the email response is sent to the customer. So again, there's a clear process here and we
2:04
will implement this step-by-step, where we take the email, feed it to an LLM to verify or to extract
2:10
the order details, and assuming the customer email is about an order, the LLM might then choose to
2:16
call an order's database to then pull up that information. That information then goes to the LLM
2:22
to then draft an email response, and the LLM might choose to use a request review tool that, say, puts
2:29
this draft email from the LLM into queue for humans to review, so they can then be sent out after a
2:34
human has reviewed and approved it. So customer order inquiry agents like these are being built
2:40
and deployed in many businesses today. To look at a more challenging example, if you want to build a
2:45
customer service agent to respond not just to questions about an order they place, but to respond
2:51
to a more general set of questions, anything a customer may ask, and maybe the customer will ask,
2:57
do you have any black jeans or blue jeans? And to answer this question, you need to maybe make
3:03
multiple API calls to your database to first check the inventory for black jeans, then check inventory
3:08
for blue jeans, and then respond to the customer. So this is an example of a more challenging query,
3:14
where given a user input, you actually have to plan out what is the sequence of database queries
3:19
to check for inventory. Or if a user asks, I'd like to return the beach towel I bought, then to answer
3:25
this, maybe we need to verify that the customer actually bought a beach towel, and then double
3:31
check the return policy. Maybe our set returns only 30 days within the date of purchase, and only the
3:36
towel was unused. And if return is allowed, then have the agent issue a return packing slip, and
3:42
also set the database record to return pending. So in this example, if the required steps to process
3:48
the customer requests are not known ahead of time, then it results in a more challenging process,
3:54
where the LLM base application has to decide for itself that these are the three steps needed in
4:00
order to respond appropriately to this task. But you learn about some of the latest work on how to
4:06
approach this type of problem too. And to give one last example of maybe an especially difficult
4:12
type of agent to build, there's a lot of work on computer use by agents, in which agents will
4:17
attempt to use a web browser and read a web page to figure out how to carry out a complex task.
4:24
In this example, I've asked an agent to check whether seats are available on two specific United
4:30
Airlines flights from San Francisco to Washington DC, or the DCA airport. The agent has access to
4:36
a web browser they can use to carry out this task. And in the video here, you can see it navigating
4:41
the United website independently, clicking on page elements and filling in the text fields on the page
4:46
to carry out the search that I requested. As it works, the agent reasons over the content of the
4:52
page to figure out the actions it needs to take to complete the task, and what it should do next.
4:57
In this case, there's some trouble checking flights on the United site, and instead decides to
5:02
navigate to the Google Flights website to search for available flights. On the Google Flight, you
5:08
see here it finds several flight options that match the user's query, and the agent then picks one and
5:13
is taken back to the United website, where it looks like it's now on the correct web page, and so is
5:20
able to determine that yes, there are seats available on the flights that I asked about. So computer use
5:26
is an exciting cutting-edge area of research right now, and many companies are trying to get computer
5:30
use agents to work. While the agent you saw here did eventually figure out the answer, I often see
5:36
agents having trouble using web browsers well. For example, if a web page is slow to load, an agent
5:42
may fail to understand what's going on, and many web pages are still beyond agents' abilities to
5:47
pause or to read accurately. But I think computer use agents, even though not yet reliable enough
5:53
to use mission-critical applications today, are an exciting and important area of future development.
5:58
So when I'm considering building Agentic AI workflows, the tasks that are easier will tend to be ones
6:05
where there is a clear step-by-step process, or if a business already has a standard procedure, a
6:11
standard offering procedure to follow, and then it can be quite a lot of work to take that procedure
6:15
and codify it up in an AI agent, but that tends to lead to easier implementations. One thing that
6:21
makes it easier is if you are using text-only assets, because LLM/language models have
6:27
grown up really processing text, and if you need to process other input modalities, it may well be
6:33
doable, but it maybe gets a little bit harder. And on the harder end of the spectrum, if the steps are
6:38
not known ahead of time of what's needed to carry out a task, like you saw for the more advanced
6:42
customer service agent, then the agent may need to plan or solve as you go, and this tends to be
6:47
harder and more unpredictable and less reliable. And then as mentioned, if it needs to accept rich
6:52
multi-modal inputs such as sound, vision, audio, that also tends to be less reliable than the
6:58
only header process text. So I hope that gives you a sense of the types of applications you might build
7:05
with agentic workflows. When implementing one of these things yourself, one of the most important
7:10
skills is to look at a complex workflow and figure out what are the individual steps so you can
7:16
implement an agentic workflow to execute those steps one at a time. In the next video, we'll talk
7:22
about task decomposition, that is, given a complex thing you want to do, like write a research report
7:28
or have a customer agent get back to customers, how do you break that down into discrete steps
7:34
to try to implement an agentic workflow? Let's go see that in the next video.

1.5 Task decomposition: Identifying the steps in a workflow

0:00
People and businesses do a lot of stuff. How do you take this useful stuff that we do
0:06
and break it down into discrete steps for the agentic workflow to follow? Let's take a look.
0:12
Take the example of building a research agent. If you want an AI system to write an essay on
0:18
a topic X, one thing you could do is prompt an LLM to have it generate an output directly.
0:24
But if you were to do this for topics that you want deeply researched,
0:28
you may find that the LLM output covers only the surface level points, or maybe covers only the
0:33
obvious facts, but doesn't go as deep into the subject as you want it to. In this case, you
0:39
might then reflect on how you as a human would write an essay on a certain topic. Would you just
0:45
sit down and start writing, or would you take multiple steps, such as first write an essay
0:50
outline, and then search the rep, and then based on the input from the web search, write the essay.
0:55
As I take a task and decompose it into steps, one question I'm always asking myself is,
1:00
if I look at these steps one, two, and three, can each of them be done either by an LLM, or
1:07
by a short piece of code, or by a function call, or by a tool. In this case, I think an LLM can
1:13
maybe write a decent outline on many topics that I would want it to help me think through. So,
1:20
say probably okay on the first step, and then I know how to use an LLM to generate search terms
1:26
to search the web. So, I would say the second step is also doable, and then based on web search,
1:31
I think an LLM could input the web search results and write an essay. And so, this would be a
1:35
reasonable first attempt at an agentic workflow for writing an essay that goes deeper than just
1:41
direct generation. But if I were to then implement this agentic workflow and look at the results,
1:47
maybe you find that the results still aren't good enough. It's still not yet as deeply thoughtful.
1:51
Maybe the essays feel a little bit disjointed. This has actually happened to me. I once built
1:56
a research agent using this workflow, but when I read the output, it felt a bit disjointed.
2:00
You know, the start of the article didn't feel completely consistent with the middle,
2:04
didn't feel completely consistent with the end. In this case, what you might do is then reflect
2:09
on how you would change the workflow if you as a human found that the essay is a little bit
2:13
disjointed. One thing you could do is take the third step and further decompose, write the essay
2:21
into additional steps. So, instead of writing the essay on one go, you might instead have it
2:27
write the first draft, and then consider what parts need revision, and then revise the draft.
2:32
And this would be how I as a human might go about it, to not just write the final essay at my first
2:39
attempt, but write the first draft and then read over it, which is another step that the LLM is
2:44
pretty decent at. And then based on my own critique of my own essay, I'll revise the draft.
2:49
So to recap, I started off with direct generation, just one step, decided it wasn't good enough,
2:55
and so broke that down into three steps, and then maybe decided that still isn't good enough,
3:00
and took one of the steps and further broken it down or decomposed it into three more steps,
3:05
resulting in this more complex, richer process for generating an essay. And depending on how
3:12
satisfied you are with the results of this process, you may choose to even modify this essay
3:18
generation process further. Let's look at the second example of how to decompose complex tasks
3:23
into smaller steps. Take the example of responding to basic customer order inquiries. The first step
3:30
that a human customer specialization might carry out might be to first extract the key information,
3:36
such as who is this email from, what did they order, and what is the order number. And these
3:41
are things that an LLM could do. So I could just say, let's have an LLM do that. The second step
3:46
would be to then find the relevant customer records. So to write and generate the relevant
3:51
database queries to pull up the order of what the customer had ordered and when I shipped and so on.
3:56
I think an LLM with the ability to call a function to query the orders database should be able to do
4:04
that. And lastly, having pulled up the customer record or the customer order record, I might then
4:09
write and send a response back to the customer. And I think with the information we pulled up,
4:13
this third step is also doable with an LLM if I give the option to call an API to send an email.
4:19
So this would be another example of taking a task of responding to customer email and breaking it
4:25
down into three individual steps where I can look at each of these steps and say, yep, I think an LLM
4:30
or one LLM with the ability to call a function to query a database or send an email should be able
4:35
to do that. Just one last example for the invoice processing. After a PDF invoice has been converted
4:42
to text, the first step is to pull out the required information, the name of the biller, the address,
4:47
the due date, the amount due, and so on. And now I should be able to do that. And then if I want to
4:53
check that the information was extracted and save it in a new database entry, then I think an LLM
4:58
should be able to help me call a function to update the database record. And so to implement this,
5:04
we implement an agentic workflow to carry out basically these two steps. When building agentic
5:09
workflows, I think of myself as having a number of building blocks. One important building block
5:15
would be large language models or maybe large multimodal models if I want to try to process
5:20
images or audio as well. And LLMs are good at generating text, deciding what to call,
5:25
maybe extracting information. For some highly specialized tasks, I might also use some other
5:31
AI models, such as an AI model for converting a PDF to text or for text-to-speech or for image
5:38
analysis. In addition to AI models, I also have access to a number of software tools, including
5:45
different APIs that I can call to do voice search, to get maybe real-time weather data,
5:51
to send emails, check calendar, and so on. And I might also have tools to retrieve information,
5:57
to pull up data from a database, or to invent RAG or retrieval augmented generation,
6:03
where I can look up a large text database and find the most relevant text. Or I might also
6:07
have tools to execute code. And this is a tool that lets an LLM write code and then run the code
6:13
on your computer to do a huge range of things. In case some of these tools seem a bit foreign to
6:18
you, don't worry about it. We'll go through the most important tools in much greater detail in
6:23
a later module. But I think of a lot of my work when I'm building an agent workflow as looking at
6:29
the work that the person or business is doing and then trying to figure out with these building
6:33
blocks, how can I sequence these building blocks together in order to carry out the tasks that I
6:39
want my system to carry out. And this is why having a good understanding of what building
6:44
blocks are available, which I hope you have a better sense of by the end of this course as well,
6:49
will allow you to better envision what agentic workflows you can build by combining these
6:54
building blocks together. So to summarize, one of the key skills in building agentic workflows is
6:59
to look at a bunch of stuff that maybe someone does and to identify the discrete steps that
7:06
it could be implemented with. And when I'm looking at the individual discrete steps,
7:10
one question I'm always asking myself is, can this step be implemented with either an
7:16
LLM or with one of the tools such as an API or a function call that I have access to? And in case
7:22
the answer is no, I'll then often ask myself, how would I as a human do this step? And is it
7:28
possible to decompose this further or break this down into even smaller steps that then maybe is
7:34
more amenable to implementation with an LLM or with one of the software tools that I have?
7:40
So I hope this gives you a rough sense of how to think about task decomposition. In case you feel
7:45
like you don't fully have it yet, don't worry about it. We'll go through many more examples
7:50
in this course and you have a much better understanding of this by the end of this
7:54
course. But it turns out that as you build agentic workflows, you find that often you
7:59
build an initial task decomposition, initial agentic workflow, and then you want to keep
8:05
on iterating and improving on it quite a few times until it delivers the level of performance
8:10
that you want. And to drive this improvement process, which I found important for many
8:15
projects, one of the key skills is to know how to evaluate your agentic workflow. So in the next
8:21
video, we'll talk about evaluations or evals and discrete key components, how you can build,
8:28
and then also keep on improving your workflows to get the performance
8:31
that you want. Let's talk about evals in the next video.

1.6 Evaluations agentic (evals)

0:04
I've worked with many different teams on building agentic workflows, and I've found that one
0:05
of the biggest predictors for whether someone is able to do it really well versus be less
0:10
efficient at it is whether or not they're able to drive a really disciplined evaluation
0:16
process.
0:17
So, your ability to drive evals for your agentic workflow makes a huge difference in your ability
0:23
to build them effectively.
0:26
In this video, we'll take a quick overview of how to build evals, and this is a subject
0:30
that we'll actually go into much deeper in a later module in this course.
0:35
So, let's take a look.
0:37
After building an agentic workflow like this one for responding to customer order inquiries,
0:43
it turns out that it's very difficult to know in advance what are the things that could
0:47
go wrong.
0:48
And so, rather than trying to build evaluations in advance, what I recommend is you just look
0:53
for the outputs and manually look for things that you wish it was doing better.
1:00
For example, maybe you read a lot of outputs and find that it is unexpectedly mentioning
1:05
your competitors more than it should.
1:08
Many businesses don't want their agents to mention competitors because it just creates
1:12
an awkward situation.
1:14
And if you read some of these outputs, maybe you find that it sometimes says, I'm glad
1:17
you shopped with us.
1:18
We're much better than our competitor, ComproCo.
1:21
Or maybe sometimes they say, sure, it should be fun.
1:23
Unlike RivalCo, we make returns easy.
1:25
And you may look at this and go, gee, I really don't want this to mention competitors.
1:30
This is an example of a problem that is really hard to anticipate in advance of building
1:36
this agentic workflow.
1:37
So, the best practice is really to build it first and then examine it to figure out where
1:42
it is not yet satisfactory, and then to find ways to evaluate as well as improve the system
1:47
to eliminate the ways that it is still not yet satisfactory.
1:51
Assuming your business considers it an error or a mistake to mention competitors
1:57
in this way, then as you work on eliminating these competitor mentions, one way to track
2:02
progress is to add an evaluation or an eval to track how often this error occurs.
2:08
So, if you have a named list of competitors like ComproCo, RivalCo, the other co, then
2:14
you can actually write code to just search in your own output for how often it mentions
2:20
these competitors by name and count up as a number, as a fraction of the overall responses,
2:26
how frequently it mistakenly mentions competitors.
2:29
One nice thing about the problem of competitor mentions is it's an objective metric, meaning
2:35
either the competitor was mentioned or not.
2:38
And for objective criteria, you can write code to check for how often this specific
2:44
error occurs.
2:46
But because LLMs output free text, there are also going to be criteria by which you want
2:51
to evaluate this output that may be more subjective and where it's harder to just write code
2:57
to output a black and white score.
2:59
In this case, using a LLM as a judge is a common technique to evaluate the output.
3:05
So, for example, if you're building a research agent to do research on different topics,
3:10
then you can use another LLM and prompt it to maybe, say, assign the following essay
3:16
a quality score between 1 and 5, where 1 is the worst and 5 is the best essay.
3:21
Here, I'm using a Python expression to mean copy-paste the generated essay into this.
3:27
So, you can prompt the LLM to read the essay and assign it a quality score.
3:32
Then I'm going to ask the research agent to write a number of different research reports,
3:37
for example, on recent developments in black hole science or using robots to harvest fruit.
3:43
And then in this example, maybe the judge LLM assigns the essay on black holes a score
3:48
of 3, the essay on robot harvesting a score of 4, and as you work on improving your research
3:54
agent, hopefully you see these scores go up over time.
3:58
It turns out, by the way, that LLMs are actually not that good at these 1 to 5 scale ratings.
4:03
You can give it a shot, but I personally tend not to use this technique that much myself.
4:08
But in a later module, you'll learn some better techniques to have an LLM output more accurate
4:13
scores than asking it to output scores on a 1 to 5 scale, although some people will
4:17
do this, maybe an initial cut as an LLM as judge type of eval.
4:22
Just to give a preview of some of the Agentic AI evals you'll learn about later in this course,
4:28
you've already heard me talk about how you can write codes to evaluate objective criteria,
4:33
such as did it mention a competitor or not, or use an LLM as a judge for more subjective
4:37
criteria such as what's the quality of this essay.
4:39
But later, you learn about two major types of evals.
4:42
One is end-to-end, where you measure the output quality of the entire agent, as well as component
4:48
level evals, where you might measure the quality of the output of a single step in the agentic
4:53
workflow.
4:54
It turns out that these are useful for driving different parts of your development process.
4:58
One thing I do a lot as well is just examine the intermediate outputs, or sometimes we
5:03
call these the traces of the LLM, in order to understand where it is falling short of
5:09
my expectations.
5:10
And we call this error analysis, where we just read through the intermediate outputs
5:14
of every single step to try to spot opportunities for improvement.
5:17
And it turns out being able to do evals and error analysis is a really key skill.
5:22
So we have much more to say about this in the fourth module in this course.
5:27
We're nearly to the end of this first module.
5:29
Before moving on, I just want to share with you what I think are the most important design
5:33
patterns for building agentic workflows.
5:35
Let's go take a look at that in the next video.

1.7 Agentic design patterns

0:04
We build agentic workflows by taking building blocks and putting them together to sequence
0:04
out these complex workflows. In this video, I'd like to share with you a few of the key design
0:10
patterns, which are patterns for how you can think about combining these building blocks into more
0:16
complex workflows. Let's take a look. I think four key design patterns for building agentic
0:21
workflows are reflection, two-use, planning, and multi-agent collaboration. Let me briefly go over
0:27
what they mean, and then we'll actually go through most of these in-depth latent discourse as well.
0:33
The first of the major design patterns is reflection. So I might go to an LLM agent and
0:39
ask it to write code, and it turns out that an LLM might then generate code like this. It defines
0:45
here a Python function to do a certain task. I could then construct a prompt that looks like this.
0:50
I can say, here's code intended for a certain task, and then copy-paste whatever the LLM had
0:55
just output back into this prompt. And then I ask it to check the code carefully for correctness,
0:59
style, and efficiency, and give constructive criticism. And it turns out that the same LLM model
1:03
prompted this way may be able to point out some problems with the code. And if I then take this
1:09
critique and feed it back to the model to say, looks like this is a bug, could you change the
1:15
code to fix it? Then it may actually come with a better version of the code. To give a preview of
1:21
tool use, if you're able to run the code and see where the code fails, then feeding that back to
1:28
the LLM can also cause it to be able to iterate and generate a much better, say, v3 version 3 of the
1:35
code. So reflection is a common design pattern where you can ask the LLM to examine its own outputs
1:41
or maybe bring in some external sources of information, such as run the code and see if it
1:46
generates any error messages, and use that as feedback to iterate again and come up with a better
1:52
version of its output. And this design pattern isn't magic. It does not result in everything
1:58
working 100% of the time. But sometimes it can be a nice bump in the performance of your system.
2:03
Now, I've drawn this as if it was a single LLM that I'm prompting, but to foreshadow multi-agent
2:09
workflows, you can also imagine instead of having the same model critique itself, you can imagine
2:15
having a critique agent. And all that is, is an LLM that's been prompted with instructions like,
2:20
your role is to critique code, here's code intended for a task, check the code carefully, and so on.
2:25
And the second critique agent, maybe point out errors or run unit tests. And by having two
2:31
simulated agents where each agent is just an LLM prompted to take on a certain persona,
2:36
you can have them go back and forth to iterate to get a better output. In addition to reflection
2:42
pattern, the second important design pattern is tool use. Where today, LLMs can be given tools,
2:49
meaning functions that they can call in order to get work done. For example, if you ask an LLM,
2:54
what's the best coffee maker according to reviewers, and you give it a web search tool,
2:58
then it can actually search the internet to find much better answers. Or a code execution tool.
3:03
If you ask a math question like, if I invest $100 in compound interest, what do I have at the end?
3:08
It can then write code and execute code to compute an answer. Today, different developers
3:13
have given LLMs many different tools for everything from math or data analysis to gather information
3:19
by fetching things from the web or for various databases, to interface with productivity apps
3:24
like email, calendar, and so on, as well as to process images and much more. And the ability
3:30
of an LLM to decide what tools to use, meaning what functions to call, that lets the model get a lot
3:37
more done. The third of the four design patterns is planning. This is an example from a paper
3:43
called Hugging GPT, in which if you ask a system to please generate an image where a girl is reading
3:51
a book and a pose is the same as a boy in the image, then please describe the new image in
3:54
your voice. Then a model can automatically decide that to carry out this task, it first needs to
4:00
find a pose determination model to figure out the pose of the boy. Then to pose the image,
4:05
to generate a picture of a girl and image the text, and then finally text the speech. And so
4:10
in planning, an LLM decides what is the sequence of actions it needs to take. In this case, it is
4:17
a sequence of API calls so that it can then carry out the right sequence of steps in the right order
4:23
in order to carry out the task. So rather than the developer hard coding the sequence of steps
4:29
in advance, this actually lets the LLM decide what are the steps to take. Agents that plan today are
4:36
harder to control and somewhat more experimental, but sometimes they can give really delightful
4:40
results. And then finally, multi-agent workflows. Just as a human manager might hire a number of
4:47
others to work together on a complex project, in some cases it might make sense for you to hire
4:52
a set of multiple agents, maybe each of which specializes in a different role, and have them
4:58
work together to accomplish a complex task. The picture you see here on the left is taken from a
5:03
project called ChatDev, which is a software framework created by Chen Qian and collaborators.
5:09
In ChatDev, multiple agents with different roles, like chief executive officer, programmer, tester,
5:15
designer, and so on, collaborate together as if they were a virtual software company and can
5:22
collaboratively complete a range of software development tasks. Let's consider another example.
5:28
If you want to write a marketing brochure, maybe you think of hiring a team of three people, such
5:33
as a researcher to do online research, a marketer to write the marketing text, and then finally an
5:39
editor to edit and polish the text. And so in a similar way, you might consider building a multi-agent
5:47
workflow in which you have a simulated research agent, a simulated marketer agent, and a simulated
5:53
editor agent that then come together to carry out this task for you. Multi-agent workflows are
6:00
more difficult to control since you don't always know ahead of time what the agents will do, but
6:05
research has shown that they can result in better outcomes for many complex tasks, including things
6:11
like writing biographies or deciding on chess moves to make in the game. You learn more about
6:16
multi-agent workflows later in this course as well. And so with that, I hope you have a sense of what
6:21
agentic workflows can do, as well as of what are the key challenges of finding building blocks and
6:27
putting them together, maybe via these design patterns, in order to implement an agentic workflow.
6:33
And of course, also developing eval so you can see how well your system is doing and keep on improving
6:39
on it. In the next module, I'd like to share with you a deep dive into the first of these design
6:46
patterns, that is reflection, and you find that it's a maybe surprisingly simple to implement
6:53
technique that can give the performance of your system sometimes a very nice bump. So let's go on
6:59
to the next module to learn about the reflection design pattern.

模块1：Agentic AI简介「Andrew Ng：Agentic AI」

1.0 简介

欢迎来到这门关于 Agentic AI 的课程。当我创造 “agentic” 这个词来描述我所看到的一种重要且快速增长的、人们构建基础应用的方式时，我没有意识到的是，一群营销人员会抓住这个词，把它当成一个标签，贴在几乎所有能看到的东西上。这导致了对 Agentic AI 的炒作急剧升温。不过，好消息是，抛开炒作不谈，使用 Agentic AI 构建的真正有价值和有用的应用数量也增长得非常迅速，即使没有炒作那么快。在本课程中，我想向您展示构建 Agentic AI 应用的最佳实践。这将在您现在可以构建什么方面，为您开启许多新的机会。

如今，agentic 工作流正被用于构建客户支持代理等应用，或进行深度研究以帮助撰写富有洞察力的研究报告，或处理棘手的法律文件，或查看患者输入信息并提出可能的医学诊断。在我带的许多团队中，我们构建的很多项目如果没有 agentic 工作流是根本不可能完成的。因此，知道如何用它们来构建应用是当今 AI 领域最重要和最有价值的技能之一。

我发现，真正懂得如何构建 agentic 工作流的人与那些效率较低的人之间，最大的区别之一是能否推动一个规范的开发流程，特别是专注于评估和错误分析的流程。在本课程中，我将告诉您这意味着什么，并向您展示如何才能真正擅长构建这些 agentic 工作流。能够做到这一点是当今 AI 领域最重要的技能之一，它将为您开启更多的机会，无论是工作机会，还是亲手打造出色软件的机会。

那么，让我们进入下一个视频，更深入地探讨什么是 agentic 工作流。

1.1 什么是 Agentic AI

那么，什么是 Agentic AI？为什么 Agentic AI 工作流如此强大？让我们来看一看。

如今，我们许多人使用大型语言模型（LLM）的方式是提示它，比如说，为我们写一篇关于某个主题 X 的文章。我认为这类似于去找一个人，或者在这种情况下，去找一个 AI，请它为我打出一篇文章，要求它从第一个词写到最后一个词，一气呵成，并且永远不能使用退格键。事实证明，我们人类并不能通过这种被迫以完全线性顺序写作的方式来完成我们最好的作品，AI 模型也是如此。但尽管受到这种写作方式的限制，我们的大语言模型表现得出奇地好。

相比之下，使用 agentic 工作流，过程可能是这样的：你可能会让它首先写一个关于某个主题的文章大纲，然后问它是否需要进行任何网络研究。在进行了一些网络研究并可能下载了一些网页之后，再让它撰写初稿，然后阅读初稿，看看哪些部分需要修改或做更多研究，接着修改草稿，如此循环。这种工作流更类似于先进行一些思考和研究，然后进行一些修改，再进行更多的思考，等等。通过这种迭代过程，事实证明，agentic 工作流可能需要更长的时间，但它能交付出质量好得多的工作成果。

所以，一个 Agentic AI 工作流是一个基于 LLM 的应用执行多个步骤来完成一项任务的过程。在这个例子中，你可能会使用一个 LLM 来撰写文章大纲，然后你可能会使用一个 LLM 来决定在网络搜索引擎中输入什么搜索词，或者说，用什么搜索词来调用网络搜索 API，以获取相关的网页。基于此，你可以将下载的网页输入到一个 LLM 中，让它撰写初稿，然后可能使用另一个 LLM 进行反思，并决定哪些地方需要更多修改。根据你设计这个工作流的方式，也许你甚至可以加入一个“人类在环”的步骤，让 LLM 可以选择请求人类审查某些关键事实。在此基础上，它可能会修改草稿，这个过程会产生一个好得多的工作输出。

你在本课程中学到的关键技能之一，就是如何将像写文章这样的复杂任务分解成更小的步骤，让 agentic 工作流一次执行一步，从而获得你想要的工作输出。事实证明，知道如何将任务分解为步骤，以及如何构建组件来很好地执行单个步骤，是一项棘手但重要的技能，它将决定你为各种激动人心的应用构建 agentic 工作流的能力。

在本课程中，我们将使用一个贯穿始终的例子，也是你将与我一起构建的东西——一个研究代理（research agent）。下面是它看起来的样子。你可以输入一个研究主题，比如“我如何建立一个新的火箭公司来与 SpaceX 竞争”。我个人不想与 SpaceX 竞争，但如果你想，你可以试着让研究代理帮助你做背景研究。

这个代理首先会规划要使用哪些研究方法，包括调用网络搜索引擎下载一些网页，然后综合和排序研究发现，起草大纲，让一个“编辑代理”审查其连贯性，最后生成一份全面的 Markdown 报告。它在这里已经完成了，报告标题是“建立一个新的火箭公司来与 SpaceX 竞争”，内容包括引言、背景、发现等等。我认为它恰当地指出，这将是一个很难创建的初创公司，所以我个人不打算这么做。但如果你想处理这样的事情，也许像这样的研究代理可以帮助你做一些初步研究。通过查找和下载多个来源并深入思考，这实际上最终会得到一份比仅仅提示一个 LLM 为你写一篇文章要深思熟虑得多的报告。

我对此感到兴奋的原因之一是，在我的工作中，我最终构建了不少专门的研究代理，无论是在法律文件领域用于冲突法合规性，还是在一些医疗健康领域，或是一些商业产品研究领域。因此，我希望通过这个例子，你不仅能学会如何为许多其他应用构建 agentic 工作流，而且在构建研究代理时的一些想法，在你将来需要自己构建一个定制的研究代理时，能对你直接有用。

现在，关于 AI 代理经常被讨论的领域之一是它们的自主性有多高？你刚才看到的是一个相对复杂、高度自主的 Agentic AI 工作流，但也有其他更简单的工作流，它们同样非常有价值。让我们在下一个视频中讨论 agentic 工作流可以达到何种程度的自主性，并为你提供一个框架来思考你可能如何构建不同的应用，以及它们的难易程度。我们在下一个视频见。

1.2 自主程度

代理（Agent）可以在不同程度上实现自主。几年前，我注意到在 AI 社区内部，关于什么是“代理”的争议日益激烈。有些人写论文说他们构建了一个代理，而另一些人则会说，不，那并不是一个真正的代理。我觉得这场辩论没有必要，所以我开始使用 “agentic” 这个词。因为我认为，如果我们把它当作一个形容词，而不是一个二元概念——要么是代理，要么不是——那么我们就必须承认系统可以在不同程度上是 agentic 的。让我们就都称之为 agentic，然后继续我们构建这些系统的实际工作，而不是去辩论“这个系统是否足够自主以至于能被称为代理”。

我记得当我准备一个关于 agentic 推理的演讲时，我的一个团队成员实际上来找我说：“嘿，Andrew，我们不需要再来一个新词了。我们已经有 agent 了，你为什么要造另一个词 agentic 呢？” 但我还是决定使用它。后来，我在一份名为 The Batch 的时事通讯中写了一篇文章，也在社交媒体上发帖说，与其争论哪个词应该被包含或排除为真正的代理，不如让我们承认系统可以有不同程度的 agentic 特性。我认为这帮助我们超越了关于什么是真正代理的辩论，让我们能够专注于实际地构建它们。

不同程度的自主性

有些代理可以不那么自主。以写一篇关于黑洞的文章为例，你可以有一个相对简单的代理来提出几个网络搜索词条或查询。然后你可以硬编码，让它调用网络搜索引擎，获取一些网页，然后用这些信息来写文章。这是一个自主性较低的代理的例子，它有一套完全确定的步骤序列。这样做效果还不错。

在整个课程中，我会用红色表示用户输入，比如这里的用户查询，或者在后面的例子中，可能是输入到 agentic 工作流的文档。灰色的框表示对 LLM 的调用，而绿色的框，比如你在这里看到的网络搜索和网页获取框，表示使用其他软件执行操作的步骤，比如调用网络搜索 API 或执行代码来获取网站内容。

然后，一个代理可以更加自主。当收到写一篇关于黑洞的文章的请求时，也许你可以让 LLM 自行决定，是想进行网络搜索，还是搜索最近的新闻源，或者是在网站 archive 上搜索最新的研究论文。基于此，也许在这个例子中，是 LLM 而不是人类工程师，选择了调用网络搜索引擎。之后，你可能会让 LLM 决定它想获取多少个网页，或者如果它获取了 PDF，是否需要调用一个函数或工具将 PDF 转换为文本。在这种情况下，也许它获取了前几个网页，然后它可以写一篇文章，决定是否要反思和改进，甚至可能返回去获取更多的网页，然后最终产生一个输出。

因此，即使是对于研究代理这个例子，我们也可以看到一些代理可以不那么自主，其执行的步骤是线性的，由程序员确定；而另一些则可以更自主，你信任 LLM 做出更多的决定，甚至确切的步骤顺序也可能由 LLM 决定，而不是由程序员预先设定。

自主性谱系

自主性较低的系统：通常所有步骤都是预先确定的。它调用的任何函数，比如网络搜索（我们称之为“工具使用”），可能是由人类工程师（你或我）硬编码的。大部分的自主性体现在 LLM 生成的文本内容上。
高度自主的代理：在谱系的另一端，代理会自主地做出许多决定，包括例如决定执行什么步骤序列来写文章。有些高度自主的代理甚至可以编写新的函数，或创建新的工具供自己执行。
半自主代理：介于两者之间，它可以做出一些决定，选择工具，但这些工具通常是预先定义好的。

当你在本课程中看到不同的例子时，你将学会如何在这个从低自主到高自主的谱系中的任何位置构建应用。你会发现，在谱系的低自主端，有大量非常有价值的应用正在为无数企业构建。同时，在谱系的高自主端，也有应用正在开发中，但那些通常更难控制，有点更不可预测，并且也有大量的活跃研究在探索如何构建这些更高度自主的代理。

那么，让我们进入下一个视频，更深入地探讨这个话题，并听听使用代理的一些好处，以及为什么它们让我们能做到一些用前几代基础应用无法实现的事情。

1.3 Agentic 工作流的好处

我认为 agentic 工作流最大的一个好处是，它能让你有效地完成许多以前根本不可能完成的任务。但除此之外，还有其他好处，包括并行化，让你能非常快地完成某些事情，以及模块化，让你能将来自许多不同地方的最佳组件组合起来，构建一个高效的工作流。让我们来看一看。

性能提升

我的团队收集了一些关于一个编码基准测试的数据，该测试旨在检验不同 LLM 编写代码以执行特定任务的能力。这个基准测试叫做 Human Eval。结果显示，GPT-3.5（首个公开发布的 ChatGPT 版本所基于的模型），如果被要求直接编写代码，即一次性输出整个计算机程序，在这个基准测试上的正确率为 40%。而 GPT-4 是一个好得多的模型，它在使用同样的非 agentic 工作流时，性能跃升至 67%。

但事实证明，从 GPT-3.5 到 GPT-4 的提升虽然巨大，但与将 GPT-3.5 包装在一个 agentic 工作流中所能实现的提升相比，就相形见绌了。通过使用不同的 agentic 技术（你将在本课程后面学到），你可以提示 GPT-3.5 编写代码，然后或许让它反思代码并找出改进之处。使用这样的技术，你实际上可以让 GPT-3.5 达到高得多的性能水平。同样，在 agentic 工作流的背景下使用 GPT-4，其表现也会好得多。因此，即使是使用当今最好的 LLM，agentic 工作流也能让你获得更好的性能。实际上，我们在这个例子中看到，从一代模型到下一代模型的提升虽然巨大，但仍然不如在上一代模型上实施 agentic 工作流所带来的差异大。

并行化与效率

使用 agentic 工作流的另一个好处是它们可以并行化一些任务，从而比人类更快地完成某些事情。例如，如果你让一个 agentic 工作流写一篇关于黑洞的文章，你可能可以并行运行三个 LLM 来为搜索引擎生成搜索词的创意。基于第一次网络搜索，它可能会识别出，比如说，三个排名最靠前的结果来获取。基于第二次网络搜索，它可能会识别出第二组要获取的网页，以此类推。事实证明，如果一个人类来做这项研究，他将不得不按顺序或一次一个地阅读这九个网页。而当你使用 agentic 工作流时，你实际上可以并行化所有九个网页的下载，然后最终将所有这些信息输入一个 LLM 来撰写文章。所以，尽管 agentic 工作流确实比真正的非 agentic 工作流（即通过单次提示直接生成）花费更长的时间，但如果你将这种 agentic 工作流与人类完成任务的方式相比，其并行下载大量网页的能力实际上能让它比单一的人类以非并行的顺序方式处理这些数据要快得多。

模块化与灵活性

在这个例子的基础上，我经常在构建 agentic 工作流时做的一件事，就是审视像 LLM 这样的单个组件，并添加或替换组件。例如，我可能会审视我在这里使用的网络搜索引擎，然后决定换一个新的。在构建 agentic 工作流时，实际上有多个网络搜索引擎可供选择，包括可以通过服务器访问的 Google，以及像 Bing、DuckDuckGo、Tavily、you.com 等其他引擎。实际上，为 LLM 设计的网络搜索引擎有很多选择。或者，也许不仅仅是做三次网络搜索，我们可以在这一步换上一个新的新闻搜索引擎，这样我们就能找到关于黑洞科学最新突破的消息。最后，我通常会尝试不同的 LLM，而不是在所有不同步骤都使用同一个，我可能会尝试不同的 LLM 提供商，看看哪个能在系统的不同步骤中给出最好的结果。

总结一下，我使用 agentic 工作流的主要原因就是它在许多不同应用上都能提供更好的性能。但此外，它还可以并行处理一些人类必须按顺序完成的任务。而且，许多 agentic 工作流的模块化设计也让我们能够添加或更新工具，有时还能更换模型。我们已经谈了很多关于构建 agentic 工作流的关键组件。现在让我们来看一系列 Agentic AI 的应用，让你了解人们已经在构建什么样的东西，以及你将要亲手构建什么样的东西。让我们进入下一个视频。

1.4 Agentic AI 应用

让我们来看一些 Agentic AI 应用的例子。

示例一：发票处理（流程清晰）

许多企业都会执行的一项任务是发票处理。给定这样一张发票，你可能想编写软件来提取最重要的字段，对于这个应用来说，假设这些字段是：开票方（即 Tech Flow Solutions）、开票方地址、应付金额（3000美元）以及到期日（看起来是2025年8月20日）。在许多财务部门，可能是一个人查看发票并识别出最重要的字段——我们需要在何时向谁付款——然后将这些信息记录在数据库中，以确保按时付款。

如果你用 agentic 工作流来实现这个功能，你可能会这样做：输入一张发票，然后调用一个 PDF 到文本的转换 API，将 PDF 转换成可能是格式化的文本，比如 Markdown 文本，供 LLM 读取。然后 LLM 会查看 PDF 内容并判断这到底是不是一张发票，还是其他应该忽略的文档类型。如果是一张发票，它就会提取所需的字段，并使用一个 API 或工具来更新数据库，将最重要的字段保存在数据库记录中。这个 agentic 工作流的一个方面是，它有一个清晰的流程可以遵循：识别所需字段并在数据库中记录。像这样有清晰流程的任务，对于 agentic 工作流来说往往更容易执行，因为它导向了一种相对按部就班、可靠地执行任务的方式。

示例二：基础客户订单查询（流程清晰）

这里是另一个例子，可能稍微难一点。如果你想构建一个代理来回应基本的客户订单查询，那么步骤可能是：首先提取关键信息，弄清楚客户到底订购了什么，客户的名字是什么；然后查找相关的客户记录；最后起草一份回复，供人工审查，之后再将邮件发送给客户。

这里同样有一个清晰的流程，我们会按部就班地实现它：接收邮件，将其交给一个 LLM 来验证或提取订单详情。假设客户邮件是关于一个订单的，LLM 可能会选择调用订单数据库来调取信息。这些信息随后会交给 LLM，由它起草一封电子邮件回复。LLM 可能会选择使用一个“请求审查”的工具，比如将这封由 LLM 起草的邮件放入一个队列中供人工审查，以便在人工审查并批准后发送出去。像这样的客户订单查询代理如今正在许多企业中被构建和部署。

示例三：通用客户服务（更具挑战性）

来看一个更具挑战性的例子。如果你想构建一个客户服务代理，不仅能回答关于他们下的订单的问题，还能回应更广泛的问题，即客户可能问的任何事情。也许客户会问：“你们有黑色的牛仔裤或蓝色的牛仔裤吗？” 要回答这个问题，你可能需要多次调用你的数据库 API，先检查黑色牛仔裤的库存，再检查蓝色牛仔裤的库存，然后再回复客户。这是一个更具挑战性的查询的例子，给定一个用户输入，你实际上必须规划出一系列数据库查询的顺序来检查库存。

或者，如果一个用户问：“我想退回我买的沙滩巾。” 要回答这个问题，我们可能需要验证客户是否真的购买了沙滩巾，然后再次核对退货政策。也许我们的退货政策是购买后30天内且毛巾未使用过才可退货。如果允许退货，那么就让代理生成一个退货标签，并同时将数据库记录设置为“退货处理中”。在这个例子中，如果处理客户请求所需的步骤不是预先知道的，那么它就会导致一个更具挑战性的过程，基于 LLM 的应用必须自己决定需要这三个步骤才能对这个任务做出适当的回应。但你也会学到一些关于如何处理这类问题的最新方法。

示例四：计算机使用（前沿但困难）

最后一个例子，可能是一种特别难以构建的代理类型。有很多关于代理使用计算机的研究，在这种研究中，代理会尝试使用网络浏览器并阅读网页，以找出如何执行一项复杂的任务。

在这个例子中，我要求一个代理检查从旧金山到华盛顿特区（DCA机场）的两个特定美联航航班上是否有座位。该代理可以访问一个网络浏览器来执行此任务。在视频中，你可以看到它独立地浏览美联航网站，点击页面元素并填写页面上的文本字段，以执行我请求的搜索。在工作时，代理会推理页面内容，以确定完成任务需要采取的行动以及下一步该做什么。在这种情况下，它在美联航网站上检查航班时遇到了一些麻烦，于是决定转到谷歌航班网站搜索可用航班。在谷歌航班上，你看到它找到了几个符合用户查询的航班选项，然后代理选择了一个，并被带回到美联航网站，看起来它现在在正确的网页上了，因此能够确定我所询问的航班上确实有座位。

计算机使用是目前一个激动人心的前沿研究领域，许多公司都在努力让计算机使用代理能够正常工作。虽然你在这里看到的代理最终找到了答案，但我经常看到代理在使用网络浏览器时遇到困难。例如，如果一个网页加载缓慢，代理可能无法理解发生了什么，而且许多网页仍然超出了代理准确解析或阅读的能力范围。但我认为，计算机使用代理，尽管今天还不够可靠，无法用于关键任务应用，但它们是未来发展的一个激动人心且重要的领域。

难易度总结

所以，当我在考虑构建 Agentic AI 工作流时：

较容易的任务：往往是那些有清晰、按部就班流程的任务。如果一个企业已经有了标准的作业程序，那么将这个程序编码成一个 AI 代理虽然可能需要很多工作，但通常实现起来更容易。如果只使用纯文本资产，也会更容易，因为 LLM 就是在处理文本中成长起来的。
较困难的任务：如果执行任务所需的步骤不是预先知道的，就像你看到的更高级的客户服务代理那样，那么代理可能需要边做边规划或解决问题，这往往更难、更不可预测、可靠性也更低。而且如前所述，如果需要接受丰富的多模态输入，如声音、视觉、音频，也往往比只处理文本更不可靠。

希望这能让你对可能用 agentic 工作流构建的应用类型有所了解。当你自己实现这些东西时，最重要的技能之一是审视一个复杂的工作流，并找出其中的各个步骤，以便你可以实现一个 agentic 工作流来一步步执行这些步骤。在下一个视频中，我们将讨论任务分解，也就是说，给定一个你想做的复杂事情，比如写一份研究报告或者让客户代理回复客户，你如何将其分解成离散的步骤来尝试实现一个 agentic 工作流。让我们在下一个视频中看看这个。

1.5 任务分解：识别工作流中的步骤

人和企业做很多事情。你如何将我们做的这些有用的事情分解成离散的步骤，供 agentic 工作流遵循呢？让我们来看一看。

以构建一个研究代理为例。如果你想让一个 AI 系统写一篇关于主题 X 的文章，一种方法是直接提示一个 LLM 让它生成输出。但如果你对想要深入研究的主题这样做，你可能会发现 LLM 的输出只涵盖了表面层次的观点，或者可能只涵盖了显而易见的事实，但没有像你希望的那样深入探讨主题。在这种情况下，你可能会反思一下，作为一个人，你会如何写一篇关于某个主题的文章。你会直接坐下来就开始写，还是会采取多个步骤，比如先写一个文章大纲，然后搜索网络，再根据网络搜索的输入来写文章。

当我将一个任务分解成步骤时，我总是在问自己一个问题：看看这第一、第二和第三个步骤，它们中的每一个是否都可以由一个 LLM，或者一小段代码，或者一个函数调用，或者一个工具来完成。在这种情况下，我认为一个 LLM 可以在很多我希望它帮助我思考的主题上写出一个不错的大纲。所以，第一步大概没问题。然后我知道如何使用一个 LLM 来生成搜索词以搜索网络。所以，我认为第二步也是可行的。然后基于网络搜索，我认为一个 LLM 可以输入网络搜索结果并写一篇文章。因此，这对于编写一篇比直接生成更深入的文章来说，会是一个合理的 agentic 工作流的初次尝试。

但是，如果我实现了这个 agentic 工作流并查看结果，也许我发现结果仍然不够好，它还不够深思熟虑，也许文章感觉有点脱节。这实际上发生在我身上过。我曾经用这个工作流构建了一个研究代理，但当我读输出时，感觉有点不连贯，文章的开头和中间部分感觉不完全一致，和结尾部分也不完全一致。

在这种情况下，你可能会反思，如果你作为一个人发现文章有点脱节，你会如何改变工作流。一种做法是，将第三步“写文章”进一步分解成额外的步骤。所以，你可能不会让它一次性写完文章，而是让它先写初稿，然后思考哪些部分需要修改，接着再修改草稿。这就像我作为一个人可能会做的那样，不是第一次尝试就写出最终的文章，而是先写初稿，然后再通读一遍——这是 LLM 相当擅长的另一步。然后基于我自己对我文章的批判，我会修改草稿。

回顾一下，我从直接生成开始，只有一个步骤，觉得不够好，所以把它分解成了三个步骤。然后可能觉得还是不够好，于是把其中一个步骤进一步分解成了三个更小的步骤，最终得到了这个更复杂、更丰富的文章生成过程。根据你对这个过程结果的满意度，你甚至可以选择进一步修改这个文章生成过程。

让我们看第二个关于如何将复杂任务分解成更小步骤的例子。以回应基本的客户订单查询为例。一个人类客户专员可能执行的第一步是先提取关键信息，比如这封邮件是谁发的，他们订了什么，订单号是多少。这些事情一个 LLM 都能做。所以我可以说，让一个 LLM 来做这件事。第二步是找到相关的客户记录。也就是编写并生成相关的数据库查询，来调取客户订购了什么、我什么时候发货等订单信息。我认为一个能够调用函数查询订单数据库的 LLM 应该能做到这一点。最后，调出客户记录或客户订单记录后，我可能会写一封回信发给客户。我认为有了我们调取的信息，如果我给它调用 API 发送邮件的选项，这第三步用 LLM 也是可行的。所以，这是另一个将回应客户邮件的任务分解成三个独立步骤的例子，我可以审视每个步骤然后说，是的，我认为一个 LLM，或者一个能调用函数查询数据库或发送邮件的 LLM，应该能做到。

最后一个例子，关于发票处理。在 PDF 发票被转换成文本后，第一步是提取所需信息：开票方名称、地址、到期日、应付金额等等。LLM 应该能做到。然后，如果我想检查信息是否被提取并将其保存在一个新的数据库条目中，那么我认为一个 LLM 应该能帮助我调用一个函数来更新数据库记录。所以要实现这个功能，我们实现一个 agentic 工作流来基本上执行这两个步骤。

在构建 agentic 工作流时，我把自己想象成拥有一些构建模块。一个重要的构建模块是大型语言模型，或者如果我想处理图像或音频，可能是大型多模态模型。LLM 擅长生成文本、决定调用什么、提取信息等。对于一些高度专业的任务，我也可能使用其他一些 AI 模型，比如用于将 PDF 转换为文本的模型，或用于文本转语音、图像分析的模型。

除了 AI 模型，我还可以使用一些软件工具，包括我可以调用的各种 API，用来进行语音搜索、获取实时天气数据、发送邮件、检查日历等等。我也可能有用于信息检索的工具，从数据库中提取数据，或者实现 RAG（检索增强生成），即我可以查询一个大型文本数据库并找到最相关的文本。或者我也可能有执行代码的工具，这个工具能让 LLM 编写代码，然后在你的计算机上运行这些代码，从而完成各种各样的事情。

如果这些工具对你来说有些陌生，别担心。我们会在后面的模块中更详细地介绍最重要的工具。但我认为，当我构建一个 agentic 工作流时，我的大部分工作就是观察一个人或一个企业正在做的事情，然后试图弄清楚，用这些构建模块，我如何将它们按顺序组合在一起，以执行我希望我的系统执行的任务。这就是为什么对有哪些可用的构建模块有一个很好的理解（我希望在本课程结束时你会有更好的了解）会让你能更好地构想出通过组合这些构建模块可以构建出什么样的 agentic 工作流。

总结一下，构建 agentic 工作流的关键技能之一，是审视一堆可能由某人完成的事情，并识别出可以用技术实现的离散步骤。当我看待单个离散步骤时，我总是在问自己一个问题：这个步骤能用一个 LLM，或者用我能接触到的一个工具（如 API 或函数调用）来实现吗？如果答案是否定的，我通常会问自己，我作为一个人会怎么做这一步？有没有可能把它进一步分解成更小的步骤，从而可能更容易用 LLM 或我拥有的某个软件工具来实现。

希望这能让你对如何思考任务分解有一个大致的了解。如果你觉得还没有完全掌握，别担心。我们将在本课程中讲解更多的例子，到课程结束时你会有更好的理解。但事实证明，当你构建 agentic 工作流时，你常常会先构建一个初步的任务分解和 agentic 工作流，然后你需要不断地迭代和改进它好几次，直到它达到你想要的性能水平。而要推动这个改进过程（我发现这对许多项目都很重要），关键技能之一就是知道如何评估你的 agentic 工作流。所以在下一个视频中，我们将讨论评估（evals）及其关键组成部分，以及你如何能够构建并持续改进你的工作流，以获得你想要的性能。让我们在下一个视频中讨论评估。

1.6 Agentic 评估 (Evals)

我曾与许多不同的团队合作构建 agentic 工作流，我发现，预测一个人能否做得非常好与效率较低的最大因素之一，就是他们是否能够推动一个非常规范的评估流程。所以，你为你的 agentic 工作流推动评估的能力，对你有效构建它们的能力有着巨大的影响。

在这个视频中，我们将快速概述如何构建评估，这个主题我们实际上会在本课程后面的一个模块中进行更深入的探讨。那么，让我们来看一看。

在构建了像这个用于回应客户订单查询的 agentic 工作流之后，事实证明，很难预先知道可能会出什么问题。因此，我建议不要试图预先构建评估，而是直接查看输出，并手动寻找你希望它做得更好的地方。

例如，也许你阅读了大量输出后发现，它出乎意料地提到了你的竞争对手，而且次数比应有的要多。许多企业不希望他们的代理提及竞争对手，因为这只会造成尴尬的局面。如果你读了其中一些输出，也许你会发现它有时会说：“很高兴您与我们购物，我们比我们的竞争对手 ComproCo 好得多。” 或者有时它会说：“当然，退货应该很有趣。不像 RivalCo，我们让退货变得简单。” 你看到这些可能会想，天啊，我真的不希望它提及竞争对手。

这是一个在构建此 agentic 工作流之前很难预见到的问题的例子。因此，最佳实践是先构建它，然后检查它以找出它还不令人满意的地方，接着找到评估以及改进系统的方法，以消除那些仍然不令人满意的地方。

假设你的企业认为以这种方式提及竞争对手是一个错误，那么当你努力消除这些提及竞争对手的言论时，一种跟踪进展的方法是添加一个评估（eval），来追踪这个错误发生的频率。所以，如果你有一个明确的竞争对手名单，比如 ComproCo、RivalCo、the other co，你实际上可以编写代码，在你的输出中搜索它多久会按名称提及这些竞争对手，并统计出一个数字，作为总回复数的一个比例，看它错误地提及竞争对手的频率是多少。

关于提及竞争对手这个问题，一个好处是它是一个客观的指标，意味着要么提到了竞争对手，要么没有。对于客观标准，你可以编写代码来检查这种特定错误发生的频率。

但因为 LLM 输出的是自由文本，你想要评估这个输出的标准中，也总会有一些更主观的，很难仅仅通过编写代码来输出一个非黑即白的分数。在这种情况下，使用 LLM 作为评判者（LLM as a judge）是一种评估输出的常用技术。

例如，如果你正在构建一个研究代理来对不同主题进行研究，那么你可以使用另一个 LLM，并提示它，比如说：“请为以下文章评定一个1到5分的质量分数，其中1分最差，5分最好。” 这里，我用一个 Python 表达式来表示把生成的文章复制粘贴到这里。所以，你可以提示 LLM 阅读文章并给它评定一个质量分数。然后我会让研究代理写几篇不同的研究报告，例如，关于黑洞科学的最新发展，或者用机器人收割水果。在这个例子中，也许评判 LLM 给关于黑洞的文章评了3分，给关于机器人收割的文章评了4分。随着你努力改进你的研究代理，希望你能看到这些分数随着时间的推移而上升。

顺便说一下，事实证明 LLM 实际上并不那么擅长这种1到5分的评级。你可以试一试，但我个人倾向于不怎么使用这种技术。但在后面的模块中，你会学到一些更好的技术，让 LLM 输出比要求它在1到5分制上输出分数更准确的分数，尽管有些人会这样做，可能作为一种初步的“LLM 作为评判者”类型的评估。

为了预告一下你将在本课程后面学到的一些 Agentic AI 评估方法：你已经听我谈过如何编写代码来评估客观标准，比如是否提到了竞争对手；或者使用 LLM 作为评判者来评估更主观的标准，比如一篇文章的质量。但之后，你会学到两种主要的评估类型：一种是端到端评估，你衡量整个代理的输出质量；以及组件级评估，你可能衡量 agentic 工作流中单个步骤的输出质量。事实证明，这些对于推动你开发过程的不同部分都很有用。

我经常做的另一件事是，检查中间输出，有时我们称之为 LLM 的“轨迹”（traces），以了解它在哪些方面没有达到我的期望。我们称之为错误分析，即我们通读每一步的中间输出，试图发现改进的机会。事实证明，能够进行评估和错误分析是一项非常关键的技能。所以我们将在本课程的第四个模块中对此有更多的阐述。

我们即将结束第一个模块。在继续之前，我只想与你分享我认为构建 agentic 工作流最重要的设计模式。让我们在下一个视频中看看它们。

1.7 Agentic 设计模式

我们通过将构建模块组合在一起，以序列化这些复杂的工作流来构建 agentic 工作流。在这个视频中，我想与你分享一些关键的设计模式，这些模式是关于你如何思考将这些构建模块组合成更复杂工作流的方法。让我们来看一看。

我认为构建 agentic 工作流的四个关键设计模式是：反思（reflection）、工具使用（tool-use）、规划（planning）和多代理协作（multi-agent collaboration）。让我简要介绍一下它们的含义，然后我们实际上会在课程的后面深入探讨其中的大部分内容。

1. 反思 (Reflection)

第一个主要的设计模式是反思。我可能会让一个 LLM 代理编写代码，结果 LLM 可能会生成像这样的代码。它在这里定义了一个 Python 函数来执行某个任务。然后我可以构建一个这样的提示：“这里是为某个任务编写的代码”，然后把 LLM 刚刚输出的内容复制粘贴到这个提示中。接着我让它“仔细检查代码的正确性、风格和效率，并给出建设性的批评”。事实证明，用这种方式提示的同一个 LLM 模型可能能够指出代码中的一些问题。如果我再把这个批评反馈给模型说：“看起来这里有个 bug，你能修改代码来修复它吗？” 那么它实际上可能会给出一个更好版本的代码。

预告一下工具使用，如果你能运行代码并看到代码在哪里失败，那么将这个信息反馈给 LLM 也能促使它迭代并生成一个好得多的，比如说，v3（第三版）的代码。所以，反思是一种常见的设计模式，你可以让 LLM 检查它自己的输出，或者可能引入一些外部信息源，比如运行代码看看是否会产生任何错误信息，并以此作为反馈再次迭代，从而得出其输出的一个更好版本。这种设计模式并非魔术，它不会导致所有事情百分之百成功，但有时它能很好地提升你系统的性能。

现在，我把它画得好像是我在提示单个 LLM，但为了预示多代理工作流，你也可以想象不是让同一个模型自我批评，而是可以有一个“批评家代理”。它就是一个被提示了类似指令的 LLM：“你的角色是批评代码，这里是为某个任务准备的代码，请仔细检查代码等等。” 第二个批评家代理可能会指出错误或运行单元测试。通过让两个模拟的代理（每个代理都只是一个被提示扮演特定角色的 LLM）来回交互，你可以让它们进行迭代，从而得到更好的输出。

2. 工具使用 (Tool Use)

除了反思模式，第二个重要的设计模式是工具使用。如今，LLM 可以被赋予工具，即它们可以调用的函数，以便完成工作。例如，如果你问一个 LLM：“根据评论者的说法，最好的咖啡机是什么？” 并给它一个网络搜索工具，那么它实际上可以搜索互联网来找到好得多的答案。或者一个代码执行工具，如果你问一个数学问题，比如：“如果我投资100美元并获得复利，最后我会有多少钱？” 那么它可以编写并执行代码来计算答案。

如今，不同的开发者已经给了 LLM 许多不同的工具，涵盖了从数学或数据分析，到通过从网络或各种数据库获取信息来收集信息，再到与生产力应用（如电子邮件、日历等）交互，以及处理图像等等。LLM 能够决定使用什么工具（即调用什么函数），这让模型能够完成更多的工作。

3. 规划 (Planning)

四个设计模式中的第三个是规划。这是一个来自一篇名为 Hugging GPT 的论文的例子。在这个例子中，如果你要求一个系统：“请生成一张图片，一个女孩在读书，姿势和一个男孩在图片中的姿势相同，然后请用你的声音描述新图片。” 那么一个模型可以自动决定，要执行这个任务，它首先需要找到一个姿态确定模型来弄清楚男孩的姿态；然后进行姿态到图像的转换，生成一个女孩的图片；接着是图像到文本的转换；最后是文本到语音的转换。

所以在规划中，LLM 决定了它需要采取的行动序列。在这种情况下，它是一个 API 调用的序列，以便它能够按正确的顺序执行正确的步骤序列，从而完成任务。所以，不是由开发者预先硬编码步骤序列，而是让 LLM 自己决定要采取哪些步骤。今天能够进行规划的代理更难控制，也更具实验性，但有时它们能给出非常令人惊喜的结果。

4. 多代理工作流 (Multi-agent Workflows)

最后是多代理工作流。就像一个人类经理可能会雇佣一些人来共同完成一个复杂的项目一样，在某些情况下，你可能会考虑雇佣一组多个代理，每个代理可能专注于不同的角色，并让它们共同协作来完成一个复杂的任务。

你在这里左边看到的图片来自一个名为 ChatDev 的项目，这是一个由钱晨（Chen Qian）及其合作者创建的软件框架。在 ChatDev 中，多个具有不同角色的代理，如首席执行官、程序员、测试员、设计师等，像一个虚拟软件公司一样协同工作，可以协作完成一系列软件开发任务。

让我们考虑另一个例子。如果你想写一本营销手册，你可能会考虑雇佣一个三人团队，比如一个研究员做在线研究，一个营销人员撰写营销文案，最后是一个编辑来编辑和润色文本。所以，类似地，你可能会考虑构建一个多代理工作流，其中你有一个模拟的研究代理，一个模拟的营销代理，和一个模拟的编辑代理，它们共同来为你执行这个任务。多代理工作流更难控制，因为你并不总能提前知道代理们会做什么，但研究表明，对于许多复杂的任务，包括像写传记或决定在棋局中下哪一步棋这样的事情，它们可以带来更好的结果。你也会在本课程的后面学到更多关于多代理工作流的内容。

那么，希望你现在对 agentic 工作流能做什么，以及找到构建模块并将它们（或许通过这些设计模式）组合在一起以实现 agentic 工作流所面临的关键挑战，有了一个概念。当然，还有开发评估方法，这样你就能看到你的系统表现如何，并不断改进它。

在下一个模块中，我想与你深入探讨这些设计模式中的第一个，即反思。你会发现，这是一种实现起来可能出乎意料地简单，但有时能给你的系统性能带来非常好的提升的技术。那么，让我们进入下一个模块，学习反思设计模式。

吴恩达在他的新课《Agentic AI》中指出，应当让AI模仿人类的思考和工作流程：先构思，再研究，然后起草，最后反复修改。这就是“Agentic工作流”的核心思想，它正在开启一个全新的AI应用时代。

「吴恩达Agentic AI 模块1」4个颠覆性观点

引言：超越一次性问答

你是否曾有过这样的经历：为了完成一份报告，你给大语言模型（LLM）一个提示词，比如“帮我写一篇关于黑洞的论文”，然后得到了一篇看起来还不错，但深度和洞察力都略显平庸的文章？这是一种非常普遍的AI使用方式，但它远未发挥出AI的全部潜力。

吴恩达（Andrew Ng）在他的新课《Agentic AI》中提出了一个绝佳的比喻：这种“一键生成”模式，就像强迫一个人（或AI）从第一个字写到最后一个字，中间不许停顿思考，甚至不许按删除键。我们都知道，人类不是这样工作的，AI也不应该如此。真正强大的方法，是让AI模仿人类的思考和工作流程：先构思，再研究，然后起草，最后反复修改。这就是“Agentic工作流”的核心思想，它正在开启一个全新的AI应用时代。

观点一：AI不再是“一键生成”，而是学会了“思考、研究、再修改”

过去，我们与AI的互动模式主要是“单次提示-单次生成”。我们提出一个问题，AI给出一个答案。但Agentic工作流彻底改变了这一点，它将一个复杂的任务分解成一系列更小的、可执行的步骤，形成一个迭代循环。

以上文提到的论文写作为例，一个Agentic工作流可能会这样执行：

构思大纲：首先，让LLM生成一个论文的结构大纲。
网络研究：根据大纲，LLM决定需要进行哪些网络搜索，并调用搜索API获取相关资料。
起草初稿：整合研究资料，撰写第一版草稿。
审阅反思：LLM（或另一个专门的“审阅”AI）阅读初稿，判断哪些部分需要修改、补充或进一步研究。
修改完善：根据审阅意见，对草稿进行修改，直至完成。

这个过程不再是僵化的线性生成，而是更接近人类专家完成复杂任务时的真实状态：思考、行动、反思、再行动。吴恩达指出，虽然这个过程可能花费更长的时间，但它最终能产出“好得多（much better）”的工作成果。

正如吴恩达所说：“事实证明，无论是我们人类，还是AI模型，都无法在被强制以这种完全线性的顺序写作时，拿出自己最好的作品。”

从“一次性生成”到“多步流程”的转变意义重大。它意味着AI不再只是一个快速的答案生成器，而是一个能够执行复杂流程、具备初步“思考”和“工作”能力的伙伴。

观点二：提升AI能力的最大杠杆，不是新模型，而是新方法

在AI领域，我们常常认为，获得更好性能的唯一途径就是等待下一个更强大的模型（比如从GPT-3.5升级到GPT-4）。但吴恩达课程中的一个数据颠覆了这个认知：实现一个Agentic工作流所带来的性能提升，甚至可能超过模型本身的代际飞跃。

课程中引用了一个名为Human Eval的编码能力基准测试数据：

GPT-3.5（非Agentic模式）：准确率为40%。
GPT-4（非Agentic模式）：准确率跃升至67%，提升巨大。
关键发现：在GPT-3.5上应用Agentic工作流（例如增加代码反思和修正步骤），其性能提升的幅度，甚至超过了从GPT-3.5直接升级到GPT-4所带来的性能提升。

这个结果的含义极为深刻：

“从一代模型到下一代模型的提升是巨大的，但这种提升仍然比不上在上一代模型上实施一个Agentic工作流所带来的差异。”

这对所有AI开发者和企业来说都是一个振奋人心的消息。它意味着我们不必总是依赖于最新、最昂贵的大模型。通过巧妙地设计工作流，我们就能用现有的工具实现顶尖的性能。这不仅极大地降低了高水平AI应用的开发门槛，更意味着未来的竞争优势将越来越多地来自于流程设计的智慧，而不仅仅是计算资源的规模。

观点三：当下最实用的AI Agent，可能不是最“自主”的那个

媒体和科幻作品常常将AI Agent描绘成高度自主、接近人类的“数字生命”。这种想象固然引人入胜，但吴恩达提醒我们，在现实世界中，最有价值的应用往往出现在自主性的另一端。

他提出了一个“自主性光谱”的概念，从“低自主性”到“高自主性”分布。

低自主性系统：其工作流程的每一步都由开发者预先设定好。例如，一个处理发票的Agent，其流程可能是固定的：识别PDF -> 提取关键字段（付款方、金额、日期） -> 存入数据库。
高自主性系统：Agent能够根据目标自主决定采取哪些步骤。例如，一个高级客服Agent需要根据用户千变万化的提问，自行规划查询库存、核对退货政策、生成退货单等一系列操作。

一个常见的误区是认为“低自主性”是“不够高级”的AI。恰恰相反，这些可预测、可控制、可靠的工作流，正是当前商业应用的核心。

吴恩达强调：“你会发现在自主性光谱的低端，有大量非常有价值的应用正在为今天的无数企业所构建……”

这个观点将关于AI Agent的讨论从天马行空的想象拉回了坚实的商业现实。对于企业而言，最关键的不是构建一个无所不能的“通用代理”，而是打造能够可靠、高效地解决特定业务问题的专用工作流。

观点四：构建强大AI的关键技能，不是算法魔法，而是严谨的工程纪律

如果更聪明的工作流，而非更强大的模型，是释放AI性能的关键，那么成功构建这些工作流的人和挣扎在失败边缘的人之间，区别究竟在哪里？答案并非某种神秘算法，而是一种更基础的能力：工程纪律。

吴恩达观察到，在构建Agentic工作流时，高效和低效的开发者之间最大的区别，就在于是否具备严谨的工程纪律。

“我所见过的，那些真正懂得如何构建Agentic工作流的人与那些不那么高效的人之间，最大的区别之一，就是他们推动一个纪律严明的开发流程的能力，尤其是一个专注于评估（evals）和错误分析的流程。”

这究竟是什么意思？它代表着一种范式转变：从依赖灵感和反复试错的“炼丹术”，转向一种类似现代软件开发的系统化、可度量的工程实践。这意味着建立一个系统的反馈循环：

评估（Evals）：为Agent的表现定义明确的评估标准。这可以是客观指标（例如，在客服回复中是否提到了竞争对手的名字？），也可以是借助“LLM作为裁判”对生成报告的质量进行主观评分。
错误分析（Error Analysis）：当Agent犯错时，关键不是简单地调整提示词，而是要诊断错误的根本原因。错误发生在多步工作流的哪一个环节？是网络搜索步骤失败了，还是综合信息步骤误解了原文？只有定位到流程中的薄弱环节，才能进行针对性改进。

这种方法论，将AI开发转变为一门有章可循、可持续迭代的工程学科。拥有严谨的评估和分析能力，比单纯追求模型的“智能”本身更为重要。

结语：教会AI“工作”，而不只是“回答”

吴恩达的课程为我们揭示了AI发展的下一个浪潮：重点不再仅仅是创造更大、更强的模型，而是设计更聪明、更高效的工作流，来释放模型中已经存在的巨大潜力。

我们正在从教会AI“回答问题”，迈向教会AI“完成工作”。这不仅仅是技术的演进，更是我们与AI协作方式的根本性变革。当我们教会AI的不再仅仅是‘回答’，而是整个‘工作流程’时，有哪些过去因流程过于繁琐、协作过于复杂而无法企及的领域，将首次向我们敞开大门？

「吴恩达Agentic AI 模块1简报」Agentic AI工作流

概要

本文档综合分析了吴恩达关于“代理式AI工作流”（Agentic AI Workflows）的核心理念与实践方法。代理式AI是一种强大的应用程序构建范式，它将复杂的任务分解为多个步骤，通过迭代、反思和工具使用来执行，从而获得远超传统单次提示方法的性能和成果。

核心要点

代理式工作流的定义：与一次性生成结果的模式不同，代理式工作流是一个多步骤过程，AI系统通过规划、执行动作、反思和修正来完成任务。这类似于人类处理复杂问题的方式，例如先写大纲、再做研究、然后起草、最后修改。
性能的巨大飞跃：采用代理式工作流带来的性能提升，可能超过模型本身的代际升级。数据显示，为GPT-3.5模型应用代理式工作流后，其在编程基准测试中的表现甚至可以超越未使用该工作流的更强大的GPT-4模型。
自主性光谱：代理式系统存在一个从“低自主性”到“高自主性”的光谱。低自主性系统（步骤由工程师预先设定）更易于控制、更可靠，已在商业中广泛应用。高自主性系统（由LLM自行决定步骤）则更具实验性，也更难预测。
成功的关键技能：构建高效代理式工作流的两大核心技能是任务分解（将复杂任务拆解为LLM或工具可执行的小步骤）和严格的评估（“Evals”），通过系统化的错误分析和性能追踪来驱动迭代改进。
四大设计模式：构建代理式工作流主要依赖四种关键的设计模式：

反思 (Reflection): 让LLM检查并批判自身的输出，从而进行迭代改进。
工具使用 (Tool Use): 赋予LLM调用外部函数（如网络搜索、代码执行、数据库查询）的能力。
规划 (Planning): LLM自主决定完成任务所需的步骤顺序。
多代理协作 (Multi-Agent Collaboration): 模拟一个团队，让多个具有不同角色的AI代理协同工作。

广泛的应用价值：代理式工作流已成功应用于多种场景，包括深度研究报告撰写、客户支持、发票处理和复杂的法律文件分析。对于许多项目而言，没有代理式工作流，其实现将“不可能”。

1. 代理式AI工作流：定义与理念

1.1 核心概念：超越单次生成

传统的与大型语言模型（LLM）交互的方式是“直接生成”，即用户提供一个提示，LLM一次性地从头到尾生成完整的文本，这如同要求一个人“不使用退格键一次性写完一篇文章”。

代理式AI工作流则截然不同。它将一个复杂的任务分解为一系列的子任务，并以迭代的方式完成。

典型流程（以撰写研究报告为例）：
1. 生成大纲：首先，让LLM为主题撰写一个初步大纲。
2. 研究与信息收集：接着，让LLM决定需要进行哪些网络搜索，并调用搜索API来获取相关网页内容。
3. 起草初稿：基于收集到的信息，LLM撰写第一版草稿。
4. 反思与修订： LLM（或另一个AI代理）阅读初稿，识别需要修改或补充研究的部分。
5. 人类介入（可选）：在关键环节（如事实核查），可以设计让AI请求人类审核的步骤。
6. 最终修订：综合所有反馈，完成最终报告。

吴恩达指出，这种迭代过程虽然耗时更长，但最终产出的工作成果质量“要好得多”。

1.2 “代理式 (Agentic)”术语的由来

吴恩达创造并推广“代理式 (agentic)”这个形容词，是为了避免AI社区中关于“什么是真正的代理 (agent)”的无谓争论。他认为，与其将系统划分为“是代理”或“不是代理”的二元对立，不如承认系统可以在不同程度上表现出“代理行为”。这一术语的引入，旨在将焦点从定义辩论转移到构建有价值的系统上。

2. 自主性光谱：从预定步骤到动态决策

代理式AI系统并非只有一种形态，而是存在于一个从低到高的自主性光谱上。

自主性程度	特征	优点	缺点	应用场景
低自主性	1. 步骤顺序由工程师预先硬编码。 2. 工具调用是确定性的。 3. LLM的主要作用是生成文本内容。	可控性强、结果可预测、可靠性高	灵活性差、无法处理未知流程	大多数当前商业应用，如发票处理、标准客户问询。
半自主性	1. LLM可以在一定范围内做出决策。 2. 可以从预定义的工具集中选择并调用工具。	兼具一定的灵活性和可控性。	复杂性增加、灵活性差、无法处理未知流程	灵活性差、无法处理未知流程
高自主性	1. LLM自主决定完成任务的完整步骤顺序。 2. 甚至可以自主编写新的函数或创建新工具。 3. 流程是动态且不确定的。	极高的灵活性、能处理未知和复杂的任务	可控性差、结果难以预测、可靠性较低，更具实验性	自动网页浏览、复杂问题解决等前沿研究领域

吴恩达强调，位于“低自主性”一端的应用同样非常有价值，并且是当今许多企业正在构建和部署的系统。

3. 代理式工作流的核心优势

代理式工作流不仅仅是为了提升任务完成的可能性，还带来了多方面的显著优势。

3.1 性能的巨大提升

代理式工作流能够释放现有模型的更大潜力，其带来的性能提升甚至可能超过模型本身的代际升级。

编程基准测试 (Human Eval) 数据：
- GPT-3.5 (非代理式): 准确率 40%
- GPT-4 (非代理式): 准确率 67%
- GPT-3.5 (代理式): 准确率远高于非代理式的GPT-3.5，甚至超越非代理式的GPT-4。
- GPT-4 (代理式): 准确率进一步大幅提升。

吴恩达指出：“从一代模型到下一代模型的提升是巨大的，但这种差异仍然没有在上一代模型上实施代理式工作流所带来的差异大。”

3.2 并行化处理以提升效率

对于需要大量信息收集的任务，代理式工作流可以通过并行处理来显著缩短时间，这比人类的顺序处理方式更高效。

示例：在撰写研究报告时，AI可以：
- 并行生成搜索词：同时运行3个LLM实例，为不同的研究角度生成搜索查询。
- 并行下载网页：同时下载所有搜索结果中的9个相关网页。
- 整合信息：将所有下载的内容一次性提供给LLM进行综合分析和撰写。

3.3 模块化设计带来的灵活性

代理式工作流的架构天然具有模块化特性，允许开发者轻松地添加、更新或替换其中的组件，以优化整体性能。

可替换的组件包括：
- 工具：可以更换不同的网络搜索引擎（如Google, Bing, Tavily, u.com），或加入新闻搜索、学术搜索等新工具。
- 模型：可以在工作流的不同步骤尝试使用不同的LLM或LLM提供商，以找到针对特定子任务的最佳模型。

4. 实际应用场景分析

代理式工作流的应用范围广泛，其实现难度取决于任务流程的清晰度。

4.1 流程清晰、易于实现的应用

当业务流程有明确的步骤或标准操作程序（SOP）时，将其编码为代理式工作流相对容易且可靠。

发票处理：
1. 输入PDF发票，调用PDF转文本API。
2. LLM判断文件是否为发票。
3. 若是，则提取关键字段（账单方、金额、截止日期等）。
4. 调用API将提取的信息更新到数据库。
基础客户订单查询：
1. 输入客户邮件，LLM提取订单详情。
2. LLM选择调用订单数据库API，查询订单状态。
3. LLM起草回复邮件。
4. LLM调用“请求审核”工具，将草稿送交人工审核后发送。

4.2 流程未知、更具挑战性的应用

当处理任务所需的步骤无法预先确定时，系统需要具备规划能力，这使得实现难度和不可预测性都大大增加。

高级客户服务：
- 多重查询：客户询问“是否有黑色或蓝色的牛仔裤？”，AI需要规划并执行两次库存数据库查询，然后综合结果进行回复。
- 复杂逻辑：客户要求退货，AI需要规划一系列步骤：验证购买记录 -> 检查退货政策（如30天内、未使用） -> 若符合条件，则生成退货标签并更新数据库状态。

4.3 前沿与实验性应用

这是当前研究的热点领域，虽然潜力巨大，但可靠性尚不足以用于关键任务。

计算机使用（Web浏览）：
- 任务：代理被要求查询特定航班的座位情况。
- 行为：代理自主操作网络浏览器，访问航空公司网站，在页面上点击元素、填写表单。当遇到困难时（如网站加载问题），它能自主决定切换到Google Flights等其他网站继续执行任务。
- 挑战：当前的代理在处理网页加载缓慢、理解复杂网页布局方面仍存在困难，可靠性不高。

5. 任务分解：构建工作流的基石

任务分解是将一个宏大、复杂的任务拆解成一系列离散、可执行步骤的过程，是构建代理式工作流的核心技能。

5.1 任务分解的原则

分解过程的核心问题是：“这个步骤能否由一个LLM、一段简短的代码、一个函数调用或一个工具来完成？”

迭代式分解：
1. 从一个简单的流程开始（例如：写大纲 -> 搜索 -> 写文章）。
2. 评估结果。如果结果不理想（例如，文章感觉“脱节”），则选择其中一个步骤进行再分解。
3. 例如，将“写文章”进一步分解为：“写初稿” -> “反思需要修改的部分” -> “修订草稿”。
4. 持续这个过程，直到每个子步骤都足够简单，并且可以由可用的构建模块可靠地执行。

5.2 可用的构建模块

开发者在设计工作流时，可以调用多种构建模块：

AI模型：
- 大型语言模型（LLM）或多模态模型（LMM）。
- 专用AI模型（如PDF转文本、文本转语音）。
软件工具：
- API调用：网络搜索、获取天气数据、发送邮件、检查日历等。
- 信息检索：从数据库或大型文本库中检索信息（RAG）。
- 代码执行：让LLM编写代码并在本地环境中运行，以执行计算或复杂操作。

6. 评估（Evals）：驱动性能提升的关键

吴恩达强调，能否推动一个纪律严明的评估流程，是区分高效和低效的代理式工作流开发者的最大预测因素之一。

6.1 评估流程

评估不应在开发前凭空想象，而应在系统构建后，通过观察实际输出来发现问题。

构建并观察：先构建一个初始版本的工作流。
手动检查错误：阅读大量输出，寻找不符合预期的行为。例如，发现客户服务代理在回复中“意外地提到了竞争对手”。
定义评估指标：针对发现的问题，创建可追踪的评估指标。

6.2 评估方法

客观标准评估：
- 方法：对于可以明确判断对错的问题，编写代码来进行自动检查。
- 示例：编写一个脚本，搜索输出文本中是否包含一个已知的竞争对手名单，并计算其出现频率。
主观标准评估（LLM as Judge）：
- 方法：对于难以用代码衡量的质量问题（如文章的“思想深度”），使用另一个LLM作为“裁判”来打分。
- 示例：提示一个裁判LLM：“请为以下文章的质量打分，范围从1到5”。
- 注意事项：吴恩达提醒，让LLM在1到5的尺度上直接打分的效果并不理想，后续课程会介绍更可靠的技术。

6.3 评估的类型

端到端评估：衡量整个代理工作流最终输出的质量。
组件级评估：衡量工作流中单个步骤输出的质量。
错误分析：深入检查工作流的中间输出（称为“traces”），以准确定位问题根源。

7. 四大核心代理式设计模式

这些设计模式是组合构建模块以形成复杂工作流的常用策略。

7.1 反思 (Reflection)

该模式让LLM检查并改进其自身的输出。

流程：
1. LLM生成一个初步输出（如一段代码）。
2. 将该输出重新输入给同一个LLM，并附上批判性指令（例如，“请仔细检查这段代码的正确性、风格和效率，并提出建设性批评”）。
3. LLM会识别出自身之前输出中的问题或缺陷。
4. 基于这些批评，LLM生成一个改进后的新版本。
变体：也可以使用一个专门的“批评家代理”（即一个被提示扮演批评角色的LLM）来完成此任务。

7.2 工具使用 (Tool Use)

赋予LLM调用外部函数或API的能力，以获取信息或执行操作，极大地扩展了其能力范围。

示例：
- 信息获取：当被问及“根据评测，最好的咖啡机是哪款？”时，LLM可以调用网络搜索工具来查找最新信息。
- 精确计算：当被问及复杂的数学问题时，LLM可以调用代码执行工具来编写并运行代码，得出精确答案。
- 应用交互： LLM可以调用API来操作邮件、日历等生产力应用。

7.3 规划 (Planning)

在该模式下，LLM不再遵循预设的步骤，而是自主决定完成任务所需的行动序列。

示例 (Hugging GPT)：
- 用户请求： “请生成一张女孩读书的图片，姿势与另一张图片中的男孩相同，然后用你的声音描述这张新图片。”
- LLM的规划：
  1. 调用姿态确定模型，分析男孩的姿势。
  2. 调用文本到图像生成模型，生成具有相同姿势的女孩图片。
  3. 调用文本到语音模型，生成对新图片的语音描述。
特点：具备规划能力的代理更难控制，更具实验性，但有时能产生令人惊喜的结果。

7.4 多代理协作 (Multi-Agent Collaboration)

该模式通过模拟一个人类团队，让多个具有不同角色和专长的AI代理协同工作来完成复杂任务。

示例：
- 虚拟软件公司 (ChatDev)：一个由多个AI代理（如CEO、程序员、测试员、设计师）组成的虚拟团队，能够协同完成软件开发任务。
- 营销手册撰写：
  1. 研究员代理：负责进行在线研究。
  2. 营销员代理：负责撰写营销文案。
  3. 编辑代理：负责对文案进行编辑和润色。
特点：难以控制，但研究表明，对于某些复杂任务（如撰写传记），多代理协作可以产生更好的结果。

「吴恩达Agentic AI 模块1」：智能体AI工作流学习指南

本指南旨在评估和深化对吴恩达（Andrew Ng）关于智能体AI工作流课程核心概念的理解。它包括一个测验、一份答案解析、一组论文问题和一个关键术语词汇表，所有内容均基于提供的课程材料。

测验：

简答题

请用2-3句话回答以下每个问题，以检验您对核心概念的理解。

什么是智能体AI工作流（Agentic AI Workflow）？它与传统的单次提示方法有何不同？
吴恩达为什么创造并推广“agentic”（智能体的）这个词，而不是简单地使用“agent”（智能体）？
智能体系统中的“自主性”（autonomy）谱系是如何划分的？请描述其两端。
根据课程内容，使用智能体工作流相比非智能体方法，在性能提升方面有何显著优势？请引用编码基准测试的例子。
除了提升性能，智能体工作流还提供了哪两个主要好处？
什么是“任务分解”（Task Decomposition）？为什么它在构建智能体工作流中至关重要？
在构建智能体工作流时，开发人员可以使用哪些核心“构建模块”（building blocks）？
请简要解释“反思”（Reflection）这一智能体设计模式及其工作原理。
什么是“工具使用”（Tool Use）设计模式？请举例说明。
评估（evals）在开发智能体工作流中扮演什么角色？请描述两种主要的评估方法。

答案解析

什么是智能体AI工作流（Agentic AI Workflow）？它与传统的单次提示方法有何不同？ 智能体AI工作流是一个基于LLM（大语言模型）的应用程序执行多个步骤来完成任务的过程。与传统的单次提示（直接生成最终结果）不同，智能体工作流是一个迭代过程，可能包括规划、研究、草拟和修订，从而产生更高质量的成果。
吴恩达为什么创造并推广“agentic”（智能体的）这个词，而不是简单地使用“agent”（智能体）？ 吴恩达创造“agentic”这个形容词是为了避免关于什么是“真正的智能体”的二元论争议。他认为系统可以在不同程度上表现出智能体特性，使用“agentic”可以承认这种程度上的差异，让社区专注于构建系统，而不是争论定义。
智能体系统中的“自主性”（autonomy）谱系是如何划分的？请描述其两端。 智能体系统的自主性谱系从“低自主性”到“高自主性”不等。低自主性系统通常遵循由程序员预先确定的、确定性的步骤序列；而高自主性系统则能自主做出许多决策，包括决定执行任务的步骤顺序，甚至可能创建新工具。
根据课程内容，使用智能体工作流相比非智能体方法，在性能提升方面有何显著优势？请引用编码基准测试的例子。 智能体工作流能够显著提升性能，其提升幅度甚至可能超过模型本身的代际升级。在Human Eval编码基准测试中，将GPT-3.5与智能体工作流结合使用，其性能提升幅度超过了从GPT-3.5升级到GPT-4所带来的性能提升。
除了提升性能，智能体工作流还提供了哪两个主要好处？ 除了性能提升，智能体工作流还提供了另外两个好处：并行化（parallelism）和模块化（modularity）。并行化允许系统同时执行多个任务（如并行下载多个网页），从而比人类更快地完成某些工作；模块化则允许开发者轻松地添加、更新或替换工作流中的组件（如更换搜索引擎或LLM模型）。
什么是“任务分解”（Task Decomposition）？为什么它在构建智能体工作流中至关重要？ 任务分解是将一个复杂的任务或流程分解成一系列离散、更小的步骤的过程。这在构建智能体工作流中至关重要，因为它能将一个宏大目标转化为可由LLM或软件工具执行的具体、可管理的操作序列，从而实现整个工作流。
在构建智能体工作流时，开发人员可以使用哪些核心“构建模块”（building blocks）？ 开发人员可以使用的核心构建模块包括AI模型（如大语言模型或多模态模型）和软件工具。软件工具涵盖了多种功能，例如用于信息检索的API（如网络搜索、数据库查询）、用于执行任务的代码执行工具，以及与其他生产力应用（如邮件、日历）交互的接口。
请简要解释“反思”（Reflection）这一智能体设计模式及其工作原理。 “反思”是一种设计模式，即让LLM检查其自身的输出，并根据反馈进行迭代改进。例如，一个LLM生成代码后，可以提示它（或另一个“批评家”LLM）检查代码的正确性并提出批评，然后根据这些批评生成一个修正后的、更好的版本。
什么是“工具使用”（Tool Use）设计模式？请举例说明。 “工具使用”是一种设计模式，即赋予LLM调用外部函数或API（即“工具”）的能力来完成任务。例如，当被问及一个需要实时信息的问题时，LLM可以调用一个网络搜索工具来查找最新数据；当需要精确计算时，它可以调用一个代码执行工具来编写并运行代码得出答案。
评估（evals）在开发智能体工作流中扮演什么角色？请描述两种主要的评估方法。 评估在开发智能体工作流中扮演着至关重要的角色，它用于衡量系统性能、发现问题并推动迭代改进。两种主要的评估方法是：针对客观标准的评估，可以通过编写代码来检查（如检查输出中是否提及竞争对手）；以及针对主观标准的评估，通常使用“LLM作为裁判”（LLM as a judge）的方法，让另一个LLM为输出质量打分。

论文问题

请针对以下问题进行深入思考和阐述，无需提供答案。

详细论述智能体AI工作流的自主性谱系。结合课程中提到的应用案例（如发票处理和通用客户服务），分析在不同自主性程度上，系统的可靠性、可控性与任务复杂性之间的权衡关系。
解释“任务分解”在设计高效智能体工作流中的核心作用。请以设计一个“自动化法律文件合规审查代理”为例，构思并描述您会如何将其分解为一系列可执行的步骤，并说明每个步骤选择的构建模块（LLM或工具）。
分析吴恩达所提出的四种关键设计模式：反思、工具使用、规划和多智能体协作。请选择其中两种模式，深入比较它们的应用场景、实现复杂性以及它们如何共同作用以解决比单一模式更复杂的任务。
“一个 disciplined evaluation process（严谨的评估流程）是成功构建智能体工作流的最重要预测指标之一。” 请阐述您对这句话的理解。讨论端到端评估（end-to-end evals）与组件级评估（component level evals）各自的优缺点，并说明错误分析（error analysis）如何驱动工作流的迭代优化。
展望未来，课程中提到的“计算机使用代理”（computer use agents）代表了智能体AI的一个前沿研究领域。请讨论实现可靠的计算机使用代理所面临的主要挑战（如网页加载慢、界面解析困难），并探讨解决这些挑战可能的技术路径。

词汇表

术语	定义
智能体AI (Agentic AI)	一种重要的、快速增长的人工智能应用构建趋势，其核心是使用智能体工作流来完成复杂任务。
智能体AI工作流 (Agentic AI Workflow)	一种基于大语言模型的应用执行多个步骤来完成任务的过程。它通常是迭代的，包括思考、研究、修订等环节，以产生更高质量的输出。
自主性 (Autonomy)	智能体系统在没有人类预先硬编码的情况下自主做出决策和决定行动顺序的程度。这个概念存在一个从低到高的谱系。
任务分解 (Task Decomposition)	将一个复杂的、宏观的任务分解成一系列更小、离散且可由LLM或软件工具执行的步骤的过程。这是构建智能体工作流的关键技能。
评估 (Evaluations / evals)	用于衡量智能体工作流性能、发现错误和不足之处的流程。它是推动系统迭代改进的关键，包括客观指标和主观判断。
LLM作为裁判 (LLM as a Judge)	一种主观评估技术，即使用一个LLM来评估另一个LLM生成的输出的质量，通常是通过提示其对输出内容进行评分。
反思 (Reflection)	一种智能体设计模式。它让LLM检查并批判自己的输出，然后利用这些反馈或外部信息（如代码错误信息）来迭代生成一个更好的版本。
工具使用 (Tool Use)	一种智能体设计模式。它赋予LLM调用外部函数或API（即“工具”）的能力，如网络搜索、代码执行或数据库查询，以完成超越纯文本生成的任务。
规划 (Planning)	一种智能体设计模式。它让LLM自主决定为完成一项复杂任务所需采取的行动序列，而不是由开发者预先硬编码步骤。
多智能体协作 (Multi-Agent Collaboration)	一种智能体设计模式。它通过模拟多个拥有不同角色（如CEO、程序员、测试员）的智能体，让它们协同工作以完成复杂的项目。
构建模块 (Building Blocks)	构建智能体工作流的基本组件，主要包括AI模型（如LLM）和各种软件工具（如API调用、代码执行器、信息检索器）。