Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: multiple evaluators don't consume the reader #1322

Merged
merged 1 commit into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions python/langsmith/evaluation/_runner.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""V2 Evaluation Interface."""

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

........... WARNING: the benchmark result may be unstable * the standard deviation (97.6 ms) is 14% of the mean (712 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_5_000_run_trees: Mean +- std dev: 712 ms +- 98 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (154 ms) is 11% of the mean (1.38 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_10_000_run_trees: Mean +- std dev: 1.38 sec +- 0.15 sec ........... WARNING: the benchmark result may be unstable * the standard deviation (177 ms) is 13% of the mean (1.36 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_20_000_run_trees: Mean +- std dev: 1.36 sec +- 0.18 sec ........... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 698 us +- 8 us ........... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 25.4 ms +- 0.4 ms ........... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 104 ms +- 2 ms ........... dumps_dataclass_nested_50x100: Mean +- std dev: 25.3 ms +- 0.2 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (16.6 ms) is 23% of the mean (72.0 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 72.0 ms +- 16.6 ms ........... dumps_pydanticv1_nested_50x100: Mean +- std dev: 197 ms +- 3 ms

Check notice on line 1 in python/langsmith/evaluation/_runner.py

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+----------+------------------------+ | Benchmark | main | changes | +===============================================+==========+========================+ | dumps_pydanticv1_nested_50x100 | 219 ms | 197 ms: 1.11x faster | +-----------------------------------------------+----------+------------------------+ | create_20_000_run_trees | 1.41 sec | 1.36 sec: 1.04x faster | +-----------------------------------------------+----------+------------------------+ | create_10_000_run_trees | 1.40 sec | 1.38 sec: 1.02x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_branch_and_leaf_200x400 | 705 us | 698 us: 1.01x faster | +-----------------------------------------------+----------+------------------------+ | create_5_000_run_trees | 715 ms | 712 ms: 1.00x faster | +-----------------------------------------------+----------+------------------------+ | dumps_dataclass_nested_50x100 | 25.2 ms | 25.3 ms: 1.00x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_100x200 | 103 ms | 104 ms: 1.01x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_50x100 | 25.1 ms | 25.4 ms: 1.01x slower | +-----------------------------------------------+----------+------------------------+ | dumps_pydantic_nested_50x100 | 66.9 ms | 72.0 ms: 1.07x slower | +-----------------------------------------------+----------+------------------------+ | Geometric mean | (ref) | 1.01x faster | +-----------------------------------------------+----------+------------------------+

from __future__ import annotations

Expand Down Expand Up @@ -1614,6 +1614,11 @@
f" run {run.id if run else ''}: {repr(e)}",
exc_info=True,
)
if example.attachments is not None:
for attachment in example.attachments:
reader = example.attachments[attachment]["reader"]
reader.seek(0)

return ExperimentResultRow(
run=run,
example=example,
Expand Down
72 changes: 72 additions & 0 deletions python/tests/integration_tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -1254,6 +1254,78 @@ def test_list_examples_attachments_keys(langchain_client: Client) -> None:
langchain_client.delete_dataset(dataset_id=dataset.id)


def test_evaluate_with_attachments_multiple_evaluators(
langchain_client: Client,
) -> None:
"""Test evaluating examples with attachments and multiple evaluators."""
dataset_name = "__test_evaluate_attachments_multiple" + uuid4().hex[:4]

# 1. Create dataset
dataset = langchain_client.create_dataset(
dataset_name,
description="Test dataset for evals with attachments",
data_type=DataType.kv,
)

# 2. Create example with attachments
example = ExampleUploadWithAttachments(
inputs={"question": "What is shown in the image?"},
outputs={"answer": "test image"},
attachments={
"image": ("image/png", b"fake image data for testing"),
},
)

langchain_client.upload_examples_multipart(dataset_id=dataset.id, uploads=[example])

def target(inputs: Dict[str, Any], attachments: Dict[str, Any]) -> Dict[str, Any]:
# Verify we receive the attachment data
assert "image" in attachments
assert "presigned_url" in attachments["image"]
image_data = attachments["image"]["reader"]
assert image_data.read() == b"fake image data for testing"
return {"answer": "test image"}

def evaluator_1(
outputs: dict, reference_outputs: dict, attachments: dict
) -> Dict[str, Any]:
assert "image" in attachments
assert "presigned_url" in attachments["image"]
image_data = attachments["image"]["reader"]
assert image_data.read() == b"fake image data for testing"
return {
"score": float(
reference_outputs.get("answer") == outputs.get("answer") # type: ignore
)
}

def evaluator_2(
outputs: dict, reference_outputs: dict, attachments: dict
) -> Dict[str, Any]:
assert "image" in attachments
assert "presigned_url" in attachments["image"]
image_data = attachments["image"]["reader"]
assert image_data.read() == b"fake image data for testing"
return {
"score": float(
reference_outputs.get("answer") == outputs.get("answer") # type: ignore
)
}

results = langchain_client.evaluate(
target,
data=dataset_name,
evaluators=[evaluator_1, evaluator_2],
num_repetitions=2,
)

for result in results:
assert result["evaluation_results"]["results"][0].score == 1.0
assert result["evaluation_results"]["results"][1].score == 1.0

langchain_client.delete_dataset(dataset_name=dataset_name)


def test_evaluate_with_attachments(langchain_client: Client) -> None:
"""Test evaluating examples with attachments."""
dataset_name = "__test_evaluate_attachments" + uuid4().hex[:4]
Expand Down
Loading