I need a freelancer to prepare benchmark questions and answers for testing a custom LLM’s reasoning ability.
Scope: Question Set: Collect 500–600 LLM benchmark questions with correct answers. Focus areas: logical, mathematical, commonsense, analytical, and multi-step reasoning. Deliver as JSON or CSV. Python Script: Load questions and send them to an LLM (I'll handle API integration). Compare model answers to correct ones. Output a simple accuracy report. Requirements: Knowledge of LLMs, reasoning datasets, or NLP is preferred. Clean, documented code. Use only open or original questions. |