PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. Hence, it is difficult to do extensive evaluation of LLMs in the multilingual setting, leading to lack of fair comparisons between models and difficulties in replicating the evaluation setup used by some models. Recently, several Indic (Indian language) LLMs have been created as an answer to a call to build more locally and culturally relevant LLMs. Our evaluation framework, named Pariksha, is the first comprehensive evaluation of Indic LLMs that uses a combination of Human and LLM-based evaluation. We conduct a total of 90k human evaluations and 50k LLM-based evaluations of 29 models to present leaderboards for 10 Indic languages. Pariksha not only provides inclusive and democratic evaluation by engaging a community of workers that represent the average Indian, but also serves as a research platform for improving the process of evaluation. By releasing all evaluation artifacts, we will make the evaluation process completely transparent. By conducting Pariksha at regular intervals, we aim to provide the Indic LLM community with a dynamic, evolving evaluation platform, enabling models to improve over time with insights and artifacts from our evaluations.