Skip to content

latchbio/eval-graders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval-graders

Graders for evaluating agent responses to biological data analysis problems.

Installation

pip install -e .

Quick Start

from eval_graders import get_grader, GraderResult

grader = get_grader("numeric_tolerance")
result = grader.evaluate_answer(agent_answer, grader_config)

print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")

Available Graders

Grader Use Case
numeric_tolerance QC metrics, counts, ratios. Supports nested keys like metrics.chi2_p
marker_gene_precision_recall Marker gene discovery, differential expression
marker_gene_separation AUROC scores for marker gene quality
label_set_jaccard Multi-select questions, set matching
distribution_comparison Cell type proportions
spatial_adjacency Spatial distance validation
multiple_choice Multiple choice questions

Integration with latch-plots-eval-harness

Replace the local graders import in eval_server.py:

# Before
from graders import GRADER_REGISTRY, extract_answer_from_conversation

# After
from eval_graders import GRADER_REGISTRY

# Note: extract_answer_from_conversation should remain in the harness
# as it depends on conversation format specific to the harness

Usage remains the same:

grader_cls = GRADER_REGISTRY[grader_type]
grader = grader_cls()
result = grader.evaluate(agent_answer, grader_config)

Integration with Third-Party RL Harnesses

The graders are designed to be standalone and can be integrated into any RL training or evaluation pipeline.

Basic Integration

from eval_graders import get_grader, GRADER_REGISTRY, GraderResult

def grade_agent_response(
    agent_answer: dict,
    grader_type: str,
    grader_config: dict
) -> GraderResult:
    grader = get_grader(grader_type)
    return grader.evaluate_answer(agent_answer, grader_config)

Custom Grader Extension

Extend the base grader with custom structures:

from eval_graders import BinaryGrader, GraderResult

class CustomBioGrader(BinaryGrader):
    def evaluate_answer(self, agent_answer: dict, config: dict) -> GraderResult:
        # Custom evaluation logic
        expected = config.get("expected_pathways", [])
        predicted = agent_answer.get("pathways", [])

        overlap = set(expected) & set(predicted)
        recall = len(overlap) / len(expected) if expected else 0

        passed = recall >= config.get("min_recall", 0.5)

        return GraderResult(
            passed=passed,
            metrics={"recall": recall, "overlap": list(overlap)},
            reasoning=f"Pathway recall: {recall:.2f}",
            agent_answer=agent_answer
        )

# Register custom grader
from eval_graders import GRADER_REGISTRY
GRADER_REGISTRY["custom_pathway"] = CustomBioGrader

Grader Interface

All graders implement BinaryGrader with two equivalent methods:

grader.evaluate_answer(agent_answer: dict, config: dict) -> GraderResult
grader.evaluate(agent_answer: dict, config: dict) -> GraderResult  # alias

The GraderResult dataclass contains:

@dataclass
class GraderResult:
    passed: bool           # Whether the evaluation passed
    metrics: dict          # Detailed metrics (precision, recall, errors, etc.)
    reasoning: str         # Human-readable explanation
    agent_answer: dict     # The evaluated answer

Running Tests

python tests/test_sample_evals.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages