Systematic Evaluation of Similarity Metrics for Retrieval, Reranking, and Completion in Retrieval Augmented Generation Systems
Tarih
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
Two of the major problems with large language models (LLMs) are hallucinations and out-of-context responses. To deal with these problems, Retrieval Augmented Generation (RAG) has emerged as a promising approach. It grounds the output of LLMs in external knowledge. The effectiveness of RAG pipelines depends on several factors, including the choice of similarity metric. This paper presents a systematic evaluation of a comprehensive RAG pipeline that utilizes the Milvus vector database with HNSW indexing techniques in conjunction with OpenAI's embedding models and GPT-based completion. We conducted a comparative analysis of three widely used similarity metrics - Cosine, Inner Product, and L2 - under identical conditions. Based on the results, it was observed that retrieval and reranking performance are highly sensitive to the similarity metrics. Cosine and Inner Product consistently achieve substantially higher recall (R@10 = 0.9092-0.925), Mean Reciprocal Rank (MRR = 0.7806-0.7930), and nDCG (nDCG@10 = 0.8121-0.8252) than L2. In contrast, completion stage metrics such as token usage, cost, and latency remain largely unaffected by the choice of metric. These results underscore the crucial role of retrieval similarity functions in determining RAG effectiveness.









