Multi-Agent Decision-Making via Reinforcement Learning & Game Theory

TL;DR

Context: Semester-long research mentorship at IIT Kanpur studying multi-agent behavior emergence through RL and game theory
Problem: Students needed guidance translating abstract behavioral economics concepts into executable computational models
Intervention: Structured problem formulation around reward shaping, population dynamics, and Q-learning implementation
Impact: Demonstrated emergence of Tit-for-Tat strategies and documented how cooperation declines without punishment mechanisms

Intro

This project was conducted as a semester-long research mentorship under the Brain and Cognitive Society at IIT Kanpur. The research question combined neuroeconomics, evolutionary game theory, and reinforcement learning: how do micro-level agent decisions lead to macro-level behavioral patterns in competitive environments? My role was to mentor a student team through problem formulation, modeling choices, and experimental design, with emphasis on translating abstract behavioral theory into executable simulations.

Problem

Abstract behavioral concepts (cooperation, altruism, selfishness) needed computational operationalization
Students risked building RL systems with degenerate reward functions producing meaningless results
Challenge of designing rewards that reflect population-level evolution, not just individual gain
Required experimental design producing interpretable results aligned with evolutionary game theory

Intervention

Guided formalization of behavioral economics into multi-agent simulation with resource constraints
Designed reward shaping based on population-level advantage rather than individual interaction outcomes
Advised on Q-learning implementation using global state-strategy representation
Structured experiments mapping individual incentives to emergent collective behavior
Ensured theoretical consistency with evolutionary game theory and neuroeconomics literature

Impact

Learning agents converged to hybrid strategies blending cooperation and retaliation based on population composition
Demonstrated Tit-for-Tat-like behaviors emerged as robust equilibrium under repeated interactions
Showed absence of punishment mechanisms leads to gradual decline of pure cooperators and long-term instability
Team produced publication-quality results with clear mapping from micro-decisions to macro-behavior

Why This Matters

Multi-agent systems are notoriously difficult to formalize correctly. The gap between theoretical models and working implementations often produces degenerate solutions or artifacts. Mentorship focusing on reward shaping, state representation, and experimental validity ensures research produces meaningful insights rather than implementation bugs masquerading as discoveries.

Technical Deep Dive (Optional)

This section expands on the multi-agent framework, RL formulation, and research findings for readers who want technical depth.

View technical deep dive

Research Questions

Primary Question How do micro-level agent decisions lead to macro-level behavioral patterns in multi-agent competitive environments?

Sub-Questions

How do cooperation, selfishness, and altruism emerge naturally from simple rules?
How does the absence of punishment mechanisms affect long-term cooperation sustainability?
Can RL agents learn stable strategies when interacting with fixed-strategy agents?
What population dynamics emerge from different strategy mixtures?

Environment Design

Multi-Agent Simulation

Agents repeatedly interact under constrained resources (food-sharing scenario)
Each interaction outcome influenced by:
- Agent's chosen strategy
- Opponent's strategy
- Interaction history
- Current population composition

Resource Constraints

Limited food creates competitive pressure
Sharing vs hoarding decisions affect survival
Population-level effects create evolutionary pressure

Strategy Space

Fixed Strategies

Always Cooperate (AC): Unconditional sharing/cooperation
Tit-for-Tat (TFT): History-based reciprocity (copy opponent's last move)
Alternating Cooperate (ALT): Oscillation between cooperation and competition
Always Defect (AD): Never cooperates; always takes advantage

Adaptive Strategy

Learning Agent: Q-learning-based adaptive policy
Learns optimal response strategy based on population dynamics
Can develop hybrid strategies not present in fixed strategies

Reinforcement Learning Formulation

State Representation

Global Q-table mapping:
- State: Strategy type of the interacting agent
- Action: Chosen response strategy

Critical Design Decision: Reward Function

Individual-Level Rewards (Rejected Approach)

Reward based on immediate interaction outcome
Problem: Produces degenerate strategies exploiting interaction mechanics
Doesn't capture evolutionary pressure

Population-Level Rewards (Chosen Approach)

Rewards based on relative population growth/decline of strategy
Reflects evolutionary advantage over time
Simulates natural selection at population level

Why This Mattered

Aligns learning with evolutionary game theory predictions
Produces meaningful strategies rather than implementation artifacts
Enables comparison with theoretical literature

Key Findings

1. Cooperation Dynamics

Always Cooperate Agents

Declined gradually in absence of punishment mechanisms
Exploited by Always Defect agents
Population share decreased over time

Always Defect Agents

Dominated in short-term interactions
Destabilized environment long-term
Eventually suffered from lack of cooperators

Tit-for-Tat Strategies

Emerged as stable equilibrium under repeated interactions
Balanced cooperation with retaliation
Most robust strategy across varying population compositions

2. Learning Agent Behavior

Convergence Pattern

Converged to hybrid strategies
Blended cooperation and retaliation dynamically
Adapted policy as population dynamics shifted

Adaptive Response

Learned to cooperate with cooperators
Learned to retaliate against defectors
Adjusted strategy mix based on population feedback

3. Non-Linear Parameter Effects

Sensitivity Analysis

Small changes in:
- Agent count
- Interaction frequency
- Initial population distribution
Produced large behavioral shifts

Implication: Multi-agent systems exhibit complex, non-linear dynamics requiring careful experimental design

Mentorship Approach

Problem Formalization

Guided translation of behavioral concepts into mathematical models
Ensured computable representations maintained theoretical validity
Helped define clear state spaces and action spaces

Reward Shaping Guidance

Critical decision: population-level vs individual-level rewards
Explained common RL pitfalls (reward hacking, degenerate solutions)
Validated reward functions against expected theoretical behavior

Experimental Design

Structured parameter sweeps for systematic exploration
Designed visualizations producing interpretable results
Ensured metrics answered research questions directly

Theoretical Validation

Connected simulation results to evolutionary game theory literature
Validated findings against established behavioral economics research
Ensured claims were defensible and properly scoped

Scientific Contributions

Demonstrated

Clear mapping from individual incentives to emergent collective behavior
Applicability of RL to evolutionary game theory questions
How absence of punishment affects cooperation sustainability

Produced

Publication-quality experimental results
Interpretable visualizations of population dynamics
Reproducible codebase for future research

Student Development Outcomes

Through this mentorship, the team:

Gained experience in RL formulation for complex systems
Learned rigorous experimental design for multi-agent research
Developed skills translating abstract theory to implementation
Understood importance of reward shaping in RL systems

Consulting Relevance

This mentorship demonstrates ability to:

Translate theoretical research into working ML systems
Design multi-agent simulations with meaningful metrics
Guide teams through RL formulation pitfalls
Mentor engineers at intersection of ML, economics, and complex systems

Directly applicable to startups building:

Market simulations and dynamic pricing systems
Agent-based economic models
Multi-stakeholder optimization systems
Adaptive decision engines with competing objectives

Completed as semester-long research mentorship project at IIT Kanpur's Brain and Cognitive Society, 2021.