experiments#

Run experiments to test different models, prompts, parameters for your LLM apps. Read our quickstart guide for more information.

evaluators#

These are used to create evaluators as a class. See our docs for more information.

To import evaluators, use the following: from arize.experimental.datasets.experiments.evaluators.base import ...

class Evaluator(*args, **kwargs)#

Bases: ABC

A helper super class to guide the implementation of an Evaluator object. Subclasses must implement either the evaluate or async_evaluate method. Implementing both methods is recommended, but not required.

This Class is intended to be subclassed, and should not be instantiated directly.

async async_evaluate(*, dataset_row=None, input=MappingProxyType({}), output=None, experiment_output=None, dataset_output=MappingProxyType({}), metadata=MappingProxyType({}), **kwargs)#

Asynchronously evaluate the given inputs and produce an evaluation result. This method should be implemented by subclasses to perform the actual evaluation logic. It is recommended to implement both this asynchronous method and the synchronous evaluate method, but it is not required. :param output: The output produced by the task. :type output: Optional[TaskOutput] :param expected: The expected output for comparison. :type expected: Optional[ExampleOutput] :param dataset_row: A row from the dataset. :type dataset_row: Optional[Mapping[str, JSONSerializable]] :param metadata: Metadata associated with the example. :type metadata: ExampleMetadata :param input: The input provided for evaluation. :type input: ExampleInput :param **kwargs: Additional keyword arguments. :type **kwargs: Any

Returns:

The result of the evaluation.

Return type:

EvaluationResult

Raises:

NotImplementedError – If the method is not implemented by the subclass.

evaluate(*, dataset_row=None, input=MappingProxyType({}), output=None, experiment_output=None, dataset_output=MappingProxyType({}), metadata=MappingProxyType({}), **kwargs)#

Evaluate the given inputs and produce an evaluation result. This method should be implemented by subclasses to perform the actual evaluation logic. It is recommended to implement both this synchronous method and the asynchronous async_evaluate method, but it is not required. :param output: The output produced by the task. :type output: Optional[TaskOutput] :param expected: The expected output for comparison. :type expected: Optional[ExampleOutput] :param dataset_row: A row from the dataset. :type dataset_row: Optional[Mapping[str, JSONSerializable]] :param metadata: Metadata associated with the example. :type metadata: ExampleMetadata :param input: The input provided for evaluation. :type input: ExampleInput :param **kwargs: Additional keyword arguments. :type **kwargs: Any

Raises:

NotImplementedError – If the method is not implemented by the subclass.

types#

These are the classes used across the experiment functions.

To import types, use the following: from arize.experimental.datasets.experiments.types import ...

class Example(id=<factory>, updated_at=<factory>, input=<factory>, output=<factory>, metadata=<factory>, dataset_row=<factory>)#

Bases: object

Represents an example in an experiment dataset. :param id: The unique identifier for the example. :param updated_at: The timestamp when the example was last updated. :param input: The input data for the example. :param output: The output data for the example. :param metadata: Additional metadata for the example. :param dataset_row: The original dataset row containing the example data.

class EvaluationResult(score=None, label=None, explanation=None, metadata=<factory>)#

Bases: object

Represents the result of an evaluation. :param score: The score of the evaluation. :param label: The label of the evaluation. :param explanation: The explanation of the evaluation. :param metadata: Additional metadata for the evaluation.

class ExperimentRun(start_time, end_time, experiment_id, dataset_example_id, repetition_number, output, error=None, id=<factory>, trace_id=None)#

Bases: object

Represents a single run of an experiment. :param start_time: The start time of the experiment run. :param end_time: The end time of the experiment run. :param experiment_id: The unique identifier for the experiment. :param dataset_example_id: The unique identifier for the dataset example. :param repetition_number: The repetition number of the experiment run. :param output: The output of the experiment run. :param error: The error message if the experiment run failed. :param id: The unique identifier for the experiment run. :param trace_id: The trace identifier for the experiment run.

class ExperimentEvaluationRun(experiment_run_id, start_time, end_time, name, annotator_kind, error=None, result=None, id=<factory>, trace_id=None)#

Bases: object

Represents a single evaluation run of an experiment. :param experiment_run_id: The unique identifier for the experiment run. :param start_time: The start time of the evaluation run. :param end_time: The end time of the evaluation run. :param name: The name of the evaluation run. :param annotator_kind: The kind of annotator used in the evaluation run. :param error: The error message if the evaluation run failed. :param result: The result of the evaluation run. :type result: Optional[EvaluationResult] :param id: The unique identifier for the evaluation run. :type id: str :param trace_id: The trace identifier for the evaluation run. :type trace_id: Optional[TraceId]