next/posts/Understanding-AgentGPT.mdx
Alt: A robotic agent types at a laptop in a dark room.
The invention of the Generative Pre-trained Transformer (GPT) is one of the recent decade's most important advancements in AI technology. The GPTs powering today's Large Language Models (LLMs) demonstrate a remarkable ability for reasoning, understanding, and planning. However, their true potential has yet to be fully realized.
At Reworkd, we believe that the true power of LLMs lies in agentic behavior. By engineering a system that draws on LLMs' emergent abilities and providing an ecosystem that supports environmental interactions, we can draw out the full potential of models like GPT-4. Here's how AgentGPT works.
The main products shipping LLMs are chatbots powered by
Foundation Model - Techopedia.
If you have any familiarity working with OpenAI's API, a common formula you might use for chatting with the model may include:
This method works fine when the scope of conversations is small; however, as you continue adding new messages to the chat history, the size and complexity of completions balloons, and you will quickly run into a wall: the dreaded context limit.
A context limit is the maximum number of tokens (a token usually represents a single word) that can be input into the model for a single response. They are necessary because the computational cost as we add additional tokens tends to increase quadratically. However, they are often the bane of prompt engineers.
One solution is to measure the number of tokens in the chat history before sending it to the model and removing old messages to ensure it fits the token limit. While this approach works, it ultimately reduces the amount of knowledge available to the assistant.
Another issue that standalone LLMs face is the need for human guidance. Fundamentally, LLMs are next-word predictors, and often, their internal structure is not inherently suited to higher-order thought processes, such as reasoning through complex tasks. This weakness doesn't mean they can't or don't reason. In fact, there are several studies that shows they can. However, it does mean they face certain impediments. For example, the LLM itself can create a logical list of steps; however, it has no built-in mechanisms for observation and reflection on that list.
A pre-trained model is essentially a "black box" for the end user in which the final product that is shipped has limited to no capability of actively updating its knowledge base and tends to act in unpredictable ways. As a result, it's hallucination-prone.
Thus, it requires a lot of effort on the user's part to guide the model's output, and prompting the LLM itself becomes a job on its own. This extra work is a far cry from our vision of an AI-powered future.
By providing a platform to give LLMs agentic abilities, AgentGPT aims to overcome the limitations of standalone LLMs by leveraging prompt engineering techniques, vector databases, and API tooling. Here’s some interesting work that is being done with the agent concept:
Alt: A Twitter post by Dr. Jim Fan
In a general sense, agents are rational actors. They use thinking and reasoning to influence their environment. This could be in the form of solving problems or pursuing specific goals. They might interact with humans or utilize tools. Ultimately, we can apply this concept to LLMs to instill more intelligent and logical behavior.
In AgentGPT, large language models essentially function as the brain of each agent. As a result, we can produce powerful agents by cleverly manipulating the English language and engineering a framework that supports interoperability between LLM completions and a diverse set of APIs.
Reasoning and Planning. If you were to simply take a general goal, such as "build a scaling e-commerce platform," and give it to ChatGPT, you would likely get a response along the lines of "As an AI language model…." However, through prompt engineering, we can get a model to break down goals into digestible steps and reflect on them with a method called chain of thought prompting.
Memory. When dealing with memory, we divide the problem into short-term and long-term. In managing short-term memory, we can use prompting techniques such as few-shot prompting to steer LLM responses. However, cost and context limits make it tricky to generate completions without limiting the breadth of information a model can use to make decisions.
Similarly, this issue also arises in long-term memory because it would be impossible to provide an appropriate corpus of writing to bridge the gap between GPT -4's cutoff date, 2021, till today. By using vector databases, we attempt to overcome this using specialized models for information retrieval in high-dimensional vector spaces.
Tools. Another challenge in using LLMs as general actors is their confinement to text outputs. Again, we can use prompt engineering techniques to solve this issue. We can generate predictable function calls from the LLM through few-shot and chain-of-thought methods, utilizing API tools like Google Search, Hugging Face, Dall-E, etc. In addition, we can use fine-tuned LLMs that only return responses in specialized formatting, like JSON. This is the approach OpenAI took when they recently released the function calling feature for their API.
These three concepts have formed the backbone of multiple successful agent-based LLM platforms such as Microsoft Jarvis, AutoGPT, BabyAGI, and of course, AgentGPT. With this brief overview in mind, let's dive deeper into each component.
Prompt engineering has become highly popularized, and it's only natural given its ability to increase the reliability of LLM responses, opening a wide avenue of potential applications for generative AI. AgentGPT's ability to think and reason is a result of novel prompting methods.
Prompt engineering is a largely empirical field that aims to find methods to steer LLM responses by finding clever ways to use the English language. You can think of it like lawyering, where every nuance in the wording of a prompt counts.
These are the main concepts and building blocks for more advanced prompting techniques:
AgentGPT uses an advanced form of chain-of-thought prompting called Plan-and-Solve to generate the steps you see when operating the agents.
Traditionally, chain-of-thought prompting utilized few-shot techniques to provide examples of a thinking and reasoning process. However, as is becomes a theme, it becomes more costly as the complexity of a task increases because we will need to provide more context.
Plan-and-solve (PS): By virtue of being a zero-shot method, it provides a prompting framework for LLM-guided reasoning using "trigger" words. These keywords trigger a reasoning response from the model.
We can expand on this concept by modifying the prompt to extract important variables and steps to generate a final response with a cohesive format. This method allows us to parse the final response and display it for the end user as well as feed sub-steps into future plan-and-solve prompts.
Alt: Picture of Plan & Solve
While PS prompting helps evoke a reasoning response, it still misses a fundamental concept in reasoning, and that is proper handling for reflection and action. Reflectionis fundamental for any agent because it must rationalize an action, perform that action, and use feedback to adjust future actions. Without it, the agent would be stateless and unchanging.
AgentGPT uses a prompting framework called Reasoning and Acting (ReAct) to expand on the capabilities of the Plan-and-Solve concept. ReAct aims to enable a framework for the model to access fresh knowledge through external knowledge bases and make observations of actions it has taken. Using those observations, the LLM can make educated decisions on the next set of steps to complete while performing actions to query knowledge bases such as Google Search or Wikipedia API.
Prompt engineering is largely effective in resolving challenges in short-term memory as well as instilling the reasoning behavior that you can see when AgentGPT is at work. However, prompt engineering does not resolve the issue of long-term memory. This issue is where vector databases come in, and we will look at those next.
Alt : ReAct (Reason + Act) Logic Picture
The ReAct framework allows us to generate a reasoning response, an action, and a reflection to steer the model’s response. This example is courtesy of the following paper: ReAct: Synergizing Reasoning and Acting in Language Models*
While we have seen that prompt engineering is largely effective in resolving issues with short-term memory and reasoning, we cannot solve long-term memory solely through clever English. Since we are not allowed to update the model to learn our data, we must build an external system for storing and retrieving knowledge.
A clever solution might use an LLM to generate summaries of previous conversations as context for the prompt. However, there are three significant issues with this. First, we are diluting the relevant information for the conversation; second, it introduces another cost area by paying for API usage for those summaries; and third, it's unscalable.
Thus, prompts appear to be ineffective for long-term memory. Seeing as long-term memory is a problem of storage and efficient retrieval of information, there is no absence of research in the study of search, so we must look towards vector databases.
Vector databases have been hyped up for a while now, and the hype is very deserved. They are an efficient way of storing and retrieving vectors by allowing us to use some fun new algorithms to query billions - even trillions - of data records in milliseconds.
Let's start with a little bit of vocabulary:
Facebook AI Similarity Search ( FAISS) give us access to valuable tools to control these vectors and locate them efficiently in the vector space.
Since the text is in a numerical embedding dictated by the model type (i.e., text-embedding-ada-002), there is some location in space that the text exists in, and it's based on the numbers that compose its vector. That means similar texts will be represented as vectors with similar numbers, and thus, they will likely be grouped closely. On the other hand, less similar texts will be further away. For example, texts about cooking will be closer to food than texts about physics.
There are several different algorithms for querying the vector space, but the most relevant to this discussion is the cosine similarity search. Cosine similarity measures the cosine of the angle between two non-zero vectors. It is a measure of orientation, meaning that it's used to determine how similar two documents (or whatever the vectors represent) are. Cosine similarity can range from -1 to 1, with -1 meaning the vectors are diametrically opposed (completely opposite), 0 meaning the vectors are orthogonal (or unrelated), and 1 meaning the vectors are identical.
FAISS is helpful in managing these vector spaces, but it is not a database. Vector libraries lack CRUD operations, which makes them alone unviable for long-term memory, and that's where cloud services such as Pinecone and Weaviate step in.
Pinecone and Weaviate essentially do all the hard work of managing our vectors. They provide an API that allows you to upload embeddings, perform various types of searches, and store those vectors for later. They provide the typical CRUD functions we need to instill memory into LLMs in easily-accessible Python modules.
By using them, we can encode large amounts of information for future storage and retrieval. For instance, when the LLM needs extra knowledge to complete a task, we can prompt it to query the vector space to find relevant information. Thus, we can create long-term memory.
Alt : Robot With A Rose In Hand
While prompt engineering and vector databases resolve many of the limitations and challenges of LLMs, there is still the problem of agent interaction. How can we extend the capabilities of an LLM to interact with the environment outside of text?
APIs are the answer. By utilizing APIs, we can give our agents the ability to perform a wide range of actions and access external resources.
Here are a few examples:
Using API tools in combination with prompt engineering techniques, we can create prompts that generate predictable function calls and utilize the output of API requests to enhance the agent's capabilities. This enables agents to interact with the environment in a meaningful way beyond text-based interactions.
Again, we can achieve tooling through prompt engineering by representing the tool we want to provide for the model as a function. We can then tell the model that this function exists in a prompt, so our program can call it programmatically based on the model's response. First, however, we should examine the main challenges in implementing tool interactions: consistency, context, and format.
For example, responses tend to vary among chat completions that use the same prompt. Thus, getting the LLM to issue a function call consistently is challenging. A minor solution may include adjusting the temperature of the model (a parameter to control the randomness), but the best solution should leverage an LLM's reasoning abilities. Thus, we can use the ReAct framework to help the llm understand when to issue function calls.
In doing this, we will still run into another major issue. How will the LLMs understand what tools are at their disposal? We could include the available tools in a prompt, but this could significantly increase the number of tokens we would need to send to the model. While this may be fine for an application that runs on a couple of tools, it will increase costs as we add more tools to the system. Thus, we would use vector databases to help the LLM look up relevant tools it needs.
Finally, we need to generate function calls in a predictable format. This format should include provisions for the name of the function and the parameters it takes, and it must include delimiters that allow us to parse and execute the response for those parameters programmatically. For instance, you can prompt the model to only return responses in JSON and then use built-in Python libraries to parse the stringified JSON.
Recently, it became even easier to use this type of method as well. In late June, OpenAI released gpt-4-0613 and gpt-3.5-turbo-16k-0613 (whew, these names are getting long). They natively support function calls by using a model fine-tuned for JSON to return easy-to-use function calls. You can read more about it here.
Large language models have been one of the most significant advances of the past decade. Capable of reasoning and talking like a human, they appear to be able to do anything. Despite this, several engineering challenges arise in building around an LLM, such as context limits, reasoning, and long-term retention.
Using the methods described above, AgentGPT unlocks the full potential of powerful models such as GPT-4. We can give any model superpowers using novel prompting methods, efficient vector databases, and abundant API tools. It's only the start, and we hope you'll join us on this journey.
AgentGPT represents a powerful approach to building AI agents that reason, remember, and perform. By leveraging prompt engineering, vector databases, and API tools, we can overcome the limitations of standalone LLMs and create agents that demonstrate agentic behavior.
With the ability to reason, plan, and reflect, AgentGPT agents can tackle complex tasks and interact with the environment in a meaningful way. By incorporating long-term memory through vector databases and utilizing APIs, we provide agents with access to a vast pool of knowledge and resources.
AgentGPT is a step towards unlocking the full potential of LLMs and creating intelligent agents that can assist and collaborate with humans in various domains. The combination of language models, prompt engineering, external memory, and API interactions opens up exciting possibilities for AI agents in the future.
Are you interested in learning more about prompt engineering? We encourage you to check out other informational posts on our site, or you can check out the fantastic places below, or if you are interested in contributing, check out our GitHub repo.