OpenAI has recently unveiled two research outcomes related to AI agents, namely the agent testing benchmark MLE-Bench and the multi-agent coordination framework Swarm. The announcements from OpenAI have garnered significant attention and discussion, redirecting the focus back to agents capable of self-analysis, planning, decision-making, and execution.
In fact, the AI application field has made substantial progress in the agent domain this year, particularly in the maturation of model function calling capabilities and agent frameworks. The function calling ability of models is crucial for agents to analyze problems independently and perform actual tasks, assisting agents in accurately completing tasks such as sending emails, submitting documents, and placing orders through price comparison.
In this regard, the University of California, Berkeley, proposed the BFCL testing leaderboard this year, assessing the function calling capabilities of models from multiple dimensions, including Single Turn and Multi Turn, Non-Live and Live, AST summarization and Exec summarization, hallucination evaluation, model cost, and latency.
The testing difficulty of this leaderboard is quite high; for instance, the test results for OpenAI's GPT-4 series models, Anthropic's Claude-3.5 series models, and Google's Gemini-1.5 series models only scored in the mid-50s at best.
Advertisement
However, the American AI company Writer recently announced that its newly released Palmyra X 004 model achieved a high score of 78. Writer has significantly enhanced the new model's ability to call external databases and applications and take actions, acquire SKU data and integrate it with the built-in RAG automatically, code generation and deployment capabilities, and structured output and execution capabilities (including emails, CRM, XML, logs, etc.), thereby markedly strengthening the function calling ability.
Although this preliminary result has not yet officially entered the BFCL testing leaderboard, it has already indicated that to further improve function calling ability, it involves not only the model itself but also a deeper understanding of actual application development and real business scenarios.
At the same time, various automation frameworks for agents have had some early practices, mainly focusing on tool frameworks and coordination processes that help models understand the environment, plan reasoning, and execute tasks.For instance, in this OpenAI MLE-Bench, to evaluate the capabilities of agents on machine learning engineering tasks, OpenAI has focused on analyzing the AIDE framework developed by WecoAI, the MLAB framework proposed in the MLAgentBench project, and the OpenHands framework developed by multiple institutions.
With the gradual advancement of function call capabilities and agent automation frameworks, various specialized agent companies have emerged in recent years.
Felicis Ventures, an established investment firm that has invested in many AI companies, recently took stock of agents in various vertical fields and functional directions, and representative companies have emerged.
For example, in the customer service field, there is Sierra; in the sales field, there is 11x; in the marketing field, there is Jasper; in the recruitment field, there is Mercor; in the legal field, there is Harvey; in the operations field, there is Brevian; in the compliance field, there is Norm AI; in the tax field, there is taxgpt; and in the real estate field, there is reAlpha.
In practice, there are many more AI agents in related fields and other industries, showing a trend of a hundred flowers blooming. In this wave of AI, AI applications will not be limited to chatbots; agents may be a more suitable product form and payment model.