Back to Projects

Supervised Learning + Reinforcement Learning

Training and Deploying Neural Connect 4 Agents

A two-phase ML project: first training CNN and Transformer networks on MCTS self-play data (81.7% Top-1 accuracy), then extending into reinforcement learning with Policy Gradient, DQN, and SAC agents — with a DQN reaching 76.2% win rate and the team advancing to the class tournament championship.

GitHub — Supervised Models GitHub — RL Arena
Phase 1 — Supervised
Game AI, MCTS supervision, CNN/Transformer architecture comparison, cloud deployment
Phase 2 — Reinforcement Learning
Policy Gradient, Dueling Double DQN, Soft Actor-Critic, curriculum self-play
Training Stack
Python, TensorFlow/Keras, NumPy, MCTS, Google Colab
Deployment
AWS Lightsail, Docker, Anvil (Python web app)

System Architecture

This project was architected as a production-style ML pipeline rather than a standalone notebook. Clear separation of responsibilities mirrors real-world deployment patterns.

MCTS Self-Play → Dataset Generation → Model Training (Colab GPU)
→ Model Serialization (.h5)
→ Dockerized Inference API (AWS Lightsail)
→ Anvil Frontend (Authenticated UI)
→ Human vs Bot Gameplay
Layer Responsibility
Training Model development and experimentation
AWS Backend Stateless inference only
Docker Environment reproducibility
Anvil Frontend Authentication, UI, state management

Data Generation

Rather than hand-labeling positions, MCTS was used as a high-quality move generator. The pipeline ran 1,500 self-play games with 1,200 rollouts per move over ~15 hours. Randomized early moves increased diversity; duplicate board states were consolidated via majority vote.

1,500
Self-play games
1,200
Rollouts per move
~40k
Board states
15h
Compute time

Model Results

Metric CNN Hybrid Transformer
Top-1 Accuracy 78.3% 81.7%
Top-2 Accuracy 92.4% 94.1%
Inference Time 1.2 ms 3.8 ms
Training Time ~20 min ~2 hours
Best Use Case Real-time, lightweight deployment Maximum move prediction accuracy

The hybrid model improves top-1 accuracy by 3.4 points. The CNN trades a small accuracy drop for 3x faster inference and 6x faster training, making it ideal for real-time deployment.

Architecture Journey

CNN: Iterative Refinement

Initial models exposed classic failure modes: shallow CNN underfit (~60%), deep CNN overfit, heavy regularization caused capacity collapse. The final CNN balanced depth and generalization with progressive convolution blocks (32 → 256 filters), batch normalization, dropout scheduling, and Global Average Pooling.

Pure Transformer

Performance plateaued at 46–55% accuracy. The 6×7 board is too small for effective token diversity, and transformers lack inductive spatial bias. Conclusion: transformers without feature extraction underperform on compact spatial domains.

Hybrid CNN–Transformer

CNN feature extractor compresses to 3×3 spatial tokens, then a 4-layer Transformer encoder and dense classification head. This combined spatial priors with global attention for the best overall accuracy.

Tactical Error Analysis

High Confidence (>95%)

  • Immediate wins: 98%+ accuracy
  • Forced defensive blocks
  • Opening central control

Failure Modes (<65%)

  • Multi-move traps
  • Dense endgames
  • Zugzwang states

Neural networks approximate pattern recognition but cannot simulate future branches. This reflects the historical evolution from supervised AlphaGo to AlphaZero-style policy + search hybrids.

Deployment

AWS Lightsail Backend

  • Dockerized TensorFlow inference service
  • Model loaded at startup, stateless requests
  • Returns probability distribution over 7 moves
  • Deterministic, low-latency inference (<4 ms)

Anvil Frontend

  • Authenticated UI (email/password, no auto signup)
  • Model selector (CNN vs Transformer)
  • Real-time human vs bot gameplay
  • Tab navigation: Play Game | Training Description

Anvil UI

The frontend was designed to resemble a polished consumer product: gradient background, rounded board container, elevated shadows, distinct yellow/red piece styling, animated feedback messages.

Connect 4 Anvil game interface

Game interface with model selector and board

Connect 4 Anvil gameplay

Interactive gameplay view

Model Performance Visualization

Connect 4 high performance scenarios

High-confidence board states where models excel

Connect 4 challenging board scenarios

Challenging scenarios (multi-move traps, endgames)

Engineering Tradeoffs

Dimension CNN Hybrid Transformer
Accuracy Strong Best
Latency Excellent Moderate
Training Cost Low High
Implementation Complexity Moderate High

Both models were deployed to allow direct comparison in the live Anvil app.

Key Takeaways

  • Inductive bias matters more than model novelty; small spatial grids favor CNNs
  • Transformers benefit from hybridization with spatial feature extractors
  • Supervised imitation from MCTS labels has inherent planning limits; neural-guided search (AlphaZero-style) is the natural next step
  • Production deployment adds non-trivial engineering overhead: containerization, cloud hosting, authenticated UI

Phase 2: Reinforcement Learning Extension

The supervised models served as the starting point for a full reinforcement learning project (Optimization II, UT Austin MSBA). Starting from six supervised networks as baselines, we implemented and compared three RL paradigms through self-play against a growing opponent pool.

Each agent was trained via self-play against a pool seeded with the supervised baselines plus frozen snapshots of the improving agent — ensuring the opponent grew stronger alongside the learner.

Policy Gradient

Directly optimized a column-distribution policy network using discounted-return-weighted log-probability gradients (REINFORCE). Trained over 500 groups of 20 games each, with stochastic move sampling for exploration and an entropy coefficient of 0.03 to balance sharpening vs. exploration.

60.8% mean win rate across shared baselines

Dueling Double DQN (v4)

Learned a state-action value function Q(s,a) with a pre-allocated circular replay buffer, target-network synchronisation, and curriculum self-play (20–50% self-play fraction). DQN v4 resumed from v3 weights with curriculum self-play added, recovering the vs-random win rate that had degraded in v3 while improving overall average.

76.2%
Mean win rate
82.0%
vs. random
+15.4pp
over PG baseline

Soft Actor-Critic + MCTS Tournament Submission

Fused actor-critic learning with Monte Carlo Tree Search at inference time. The SAC policy network guided MCTS rollouts, combining learned value estimates with tree search lookahead — the same paradigm as AlphaZero. Trained over ~820 groups of 128 games each with 16 gradient updates per group, then paired with MCTS for the final tournament.

75.5% standalone win rate — significantly stronger with MCTS at inference
Policy Gradient training curves

Policy Gradient — loss and win rate over 500 training groups

DQN v4 training curves

DQN v4 — episode win rate, Huber loss, and curriculum self-play fraction

Agent Comparison

Agent vs. Random vs. 1-ply Mean
Policy Gradient 60.8%
DQN v2 81.5% 69.5% 75.5%
DQN v3 73.0% 72.0% 72.5%
DQN v4 82.0% 70.5% 76.2%
SAC + MCTS ★ Tournament 81.5% 69.5% 75.5% + search

Class Tournament

Advanced to the Championship

The SAC agent paired with MCTS at inference was submitted to the end-of-semester class tournament against all other groups' trained agents. The team advanced to the championship round — the strongest result in the class.

Outcome

Across both phases, this project spans MCTS-supervised training, architecture experimentation, cloud deployment, and full reinforcement learning — from policy gradient through DQN curriculum self-play to Soft Actor-Critic. The supervised phase achieved 81.7% Top-1 move accuracy with sub-4ms inference. The RL phase culminated in a SAC agent paired with MCTS at inference — combining learned value estimates with tree-search lookahead — which advanced to the class tournament championship.