Firms of all measurements trust in our models to serve their clients, making it crucial for our model outputs to take care of superior precision at scale. To evaluate this, we use a considerable set of complex, factual queries that target known weaknesses in present models. We categorize the responses into right answers, incorrect answers (or hallu