eval-graders

Graders for evaluating agent responses to biological data analysis problems.

Installation

pip install -e .

Quick Start

from eval_graders import get_grader, GraderResult

grader = get_grader("numeric_tolerance")
result = grader.evaluate_answer(agent_answer, grader_config)

print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")

Available Graders

Grader	Use Case
`numeric_tolerance`	QC metrics, counts, ratios. Supports nested keys like `metrics.chi2_p`
`marker_gene_precision_recall`	Marker gene discovery, differential expression
`marker_gene_separation`	AUROC scores for marker gene quality
`label_set_jaccard`	Multi-select questions, set matching
`distribution_comparison`	Cell type proportions
`spatial_adjacency`	Spatial distance validation
`multiple_choice`	Multiple choice questions

Integration with latch-plots-eval-harness

Replace the local graders import in eval_server.py:

# Before
from graders import GRADER_REGISTRY, extract_answer_from_conversation

# After
from eval_graders import GRADER_REGISTRY

# Note: extract_answer_from_conversation should remain in the harness
# as it depends on conversation format specific to the harness

Usage remains the same:

grader_cls = GRADER_REGISTRY[grader_type]
grader = grader_cls()
result = grader.evaluate(agent_answer, grader_config)

Integration with Third-Party RL Harnesses

The graders are designed to be standalone and can be integrated into any RL training or evaluation pipeline.

Basic Integration

from eval_graders import get_grader, GRADER_REGISTRY, GraderResult

def grade_agent_response(
    agent_answer: dict,
    grader_type: str,
    grader_config: dict
) -> GraderResult:
    grader = get_grader(grader_type)
    return grader.evaluate_answer(agent_answer, grader_config)

Custom Grader Extension

Extend the base grader with custom structures:

from eval_graders import BinaryGrader, GraderResult

class CustomBioGrader(BinaryGrader):
    def evaluate_answer(self, agent_answer: dict, config: dict) -> GraderResult:
        # Custom evaluation logic
        expected = config.get("expected_pathways", [])
        predicted = agent_answer.get("pathways", [])

        overlap = set(expected) & set(predicted)
        recall = len(overlap) / len(expected) if expected else 0

        passed = recall >= config.get("min_recall", 0.5)

        return GraderResult(
            passed=passed,
            metrics={"recall": recall, "overlap": list(overlap)},
            reasoning=f"Pathway recall: {recall:.2f}",
            agent_answer=agent_answer
        )

# Register custom grader
from eval_graders import GRADER_REGISTRY
GRADER_REGISTRY["custom_pathway"] = CustomBioGrader

Grader Interface

All graders implement BinaryGrader with two equivalent methods:

grader.evaluate_answer(agent_answer: dict, config: dict) -> GraderResult
grader.evaluate(agent_answer: dict, config: dict) -> GraderResult  # alias

The GraderResult dataclass contains:

@dataclass
class GraderResult:
    passed: bool           # Whether the evaluation passed
    metrics: dict          # Detailed metrics (precision, recall, errors, etc.)
    reasoning: str         # Human-readable explanation
    agent_answer: dict     # The evaluated answer

Running Tests

python tests/test_sample_evals.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
eval_graders		eval_graders
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-graders

Installation

Quick Start

Available Graders

Integration with latch-plots-eval-harness

Integration with Third-Party RL Harnesses

Basic Integration

Custom Grader Extension

Grader Interface

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Languages

latchbio/eval-graders

Folders and files

Latest commit

History

Repository files navigation

eval-graders

Installation

Quick Start

Available Graders

Integration with latch-plots-eval-harness

Integration with Third-Party RL Harnesses

Basic Integration

Custom Grader Extension

Grader Interface

Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages