GoogleCloudAiplatformV1SchemaModelevaluationMetricsPairwiseTextGenerationEvaluationMetrics
import type { GoogleCloudAiplatformV1SchemaModelevaluationMetricsPairwiseTextGenerationEvaluationMetrics } from "https://googleapis.deno.dev/v1/aiplatform:v1.ts";
Metrics for general pairwise text generation evaluation results.
§Properties
Percentage of time the autorater decided the baseline model had the better response.
A measurement of agreement between the autorater and human raters that takes the likelihood of random agreement into account.
Number of examples where the autorater chose the baseline model, but humans preferred the model.
Number of examples where the autorater chose the model, but humans preferred the baseline model.
Percentage of time humans decided the baseline model had the better response.
Percentage of time humans decided the model had the better response.
Percentage of time the autorater decided the model had the better response.
Fraction of cases where the autorater and humans thought the model had a better response out of all cases where the autorater thought the model had a better response. True positive divided by all positive.
Fraction of cases where the autorater and humans thought the model had a better response out of all cases where the humans thought the model had a better response.