Graders for evaluating agent responses to biological data analysis problems.
pip install -e .from eval_graders import get_grader, GraderResult
grader = get_grader("numeric_tolerance")
result = grader.evaluate_answer(agent_answer, grader_config)
print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")| Grader | Use Case |
|---|---|
numeric_tolerance |
QC metrics, counts, ratios. Supports nested keys like metrics.chi2_p |
marker_gene_precision_recall |
Marker gene discovery, differential expression |
marker_gene_separation |
AUROC scores for marker gene quality |
label_set_jaccard |
Multi-select questions, set matching |
distribution_comparison |
Cell type proportions |
spatial_adjacency |
Spatial distance validation |
multiple_choice |
Multiple choice questions |
Replace the local graders import in eval_server.py:
# Before
from graders import GRADER_REGISTRY, extract_answer_from_conversation
# After
from eval_graders import GRADER_REGISTRY
# Note: extract_answer_from_conversation should remain in the harness
# as it depends on conversation format specific to the harnessUsage remains the same:
grader_cls = GRADER_REGISTRY[grader_type]
grader = grader_cls()
result = grader.evaluate(agent_answer, grader_config)The graders are designed to be standalone and can be integrated into any RL training or evaluation pipeline.
from eval_graders import get_grader, GRADER_REGISTRY, GraderResult
def grade_agent_response(
agent_answer: dict,
grader_type: str,
grader_config: dict
) -> GraderResult:
grader = get_grader(grader_type)
return grader.evaluate_answer(agent_answer, grader_config)Extend the base grader with custom structures:
from eval_graders import BinaryGrader, GraderResult
class CustomBioGrader(BinaryGrader):
def evaluate_answer(self, agent_answer: dict, config: dict) -> GraderResult:
# Custom evaluation logic
expected = config.get("expected_pathways", [])
predicted = agent_answer.get("pathways", [])
overlap = set(expected) & set(predicted)
recall = len(overlap) / len(expected) if expected else 0
passed = recall >= config.get("min_recall", 0.5)
return GraderResult(
passed=passed,
metrics={"recall": recall, "overlap": list(overlap)},
reasoning=f"Pathway recall: {recall:.2f}",
agent_answer=agent_answer
)
# Register custom grader
from eval_graders import GRADER_REGISTRY
GRADER_REGISTRY["custom_pathway"] = CustomBioGraderAll graders implement BinaryGrader with two equivalent methods:
grader.evaluate_answer(agent_answer: dict, config: dict) -> GraderResult
grader.evaluate(agent_answer: dict, config: dict) -> GraderResult # aliasThe GraderResult dataclass contains:
@dataclass
class GraderResult:
passed: bool # Whether the evaluation passed
metrics: dict # Detailed metrics (precision, recall, errors, etc.)
reasoning: str # Human-readable explanation
agent_answer: dict # The evaluated answerpython tests/test_sample_evals.py