Program Task Description
Category 1: Preventing Accidental Misuse
-
How will hallucinations and bias in LLM agents differ from typical LLMs? What novel mitigations do we need to develop?
- Reduce the amount of wasteful or harmful API calls in tool-based LLM agents through either software-based or ML-based solutions
- [2305.15334] Gorilla: Large Language Model Connected with Massive APIs
-
What are some unintended consequences of deploying LLM agents?
- Demonstrate a novel failure mode of LLM agents
- [2402.06627] Feedback Loops With Language Models Drive In-Context Reward Hacking
- [2402.06664] LLM Agents can Autonomously Hack Websites
Category 2: Preventing Malicious Use
-
How do we attack and defend LLM agents? Will this paradigm differ from standard jailbreaking of LLMs?
- Create novel prompt injection attacks or demonstrate a novel defense for LLM agents
- [2311.01011] Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- [2402.06363] StruQ: Defending Against Prompt Injection with Structured Queries
-
What privacy challenges will LLM agents face? How can we mitigate these?
- Design a novel differentially private training method to train agents on user trajectories while hiding PII
- [2401.05459] Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
- [2407.19354] The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies
-
How do we prevent LLM agents from being directed to create bioweapons or cyberweapons?
- Analyze the difficult of creating a novel cyberweapon with an LLM
- Building an early warning system for LLM-aided biological threat creation | OpenAI
- [2403.03218] The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Category 3: Steering, Controlling, and Interpreting Agents
-
How do we better specify user objectives and safety constraints in LLM agents to eliminate unintended consequences?
- Develop techniques that encourage LLMs to follow hard constraints
- [2311.04235] Can LLMs Follow Simple Rules?
- [2402.10962] Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
- [2404.13208] The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
-
How do we give users insight into agent behaviors and steer them more reliably?
- Create a visualization of how LLM objectives evolves as they solve tasks
- [2305.02469] The System Model and the User Model: Exploring AI Dashboard Design
- [2309.08600] Sparse Autoencoders Find Highly Interpretable Features in Language Models
- [2310.01405] Representation Engineering: A Top-Down Approach to AI Transparency
- [2403.10949] SelfIE: Self-Interpretation of Large Language Model Embeddings
Category 4: Auditing and Accountability
-
How do we better evaluate and audit LLM agents?
- Develop methods to automatically red-team agents
- [2309.15817] Identifying the Risks of LM Agents with an LM-Emulated Sandbox
-
How do we monitor and hold LLM agents accountable for their actions?
- Create techniques that identify dangerous actions in AI agent trajectories
- [2401.13138] Visibility into AI Agents
Category 5: Multi-Agent Safety and Security
-
What novel failure modes occur in multi-agent systems?
- Analyze collusion between agents or correlated failures among agents
- [2308.01404] Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
- [2402.07510] Secret Collusion Among Generative AI Agents
- [2404.00806] Algorithmic Collusion by Large Language Models
Category 6: Environmental and Societal Impact of Agents
-
What will be the environmental cost of agents?
- Design a project that estimates the carbon emissions cost of deploying an agent on a task
- [2111.00364] Sustainable AI: Environmental Implications, Challenges and Opportunities
-
Are agents fair? How might their deployment impact societal values or influence economic incentives? How should we govern them?
- Study how LLM agents may affect a particular sector of society, e.g., how will LLMs lead to content homogenization
- What happens when ChatGPT starts to feed on its own writing?
- [2407.14981] Open Problems in Technical AI Governance