Introducing DeepQuery — a benchmarking methodology for RAG systems using LLM-as-a-Judge pairwise comparisons and Bradley-Terry ranking. We evaluate vanilla, reranker-based, and Captain pipelines across the OpenRAG-Bench dataset on accuracy, win rate, and latency.