GoogleCloudAiplatformV1SchemaModelevaluationMetricsPairwiseTextGenerationEvaluationMetrics

import type { GoogleCloudAiplatformV1SchemaModelevaluationMetricsPairwiseTextGenerationEvaluationMetrics } from "https://googleapis.deno.dev/v1/aiplatform:v1.ts";

Metrics for general pairwise text generation evaluation results.

interface GoogleCloudAiplatformV1SchemaModelevaluationMetricsPairwiseTextGenerationEvaluationMetrics {

accuracy?: number;

baselineModelWinRate?: number;

cohensKappa?: number;

f1Score?: number;

falseNegativeCount?: bigint;

falsePositiveCount?: bigint;

humanPreferenceBaselineModelWinRate?: number;

humanPreferenceModelWinRate?: number;

modelWinRate?: number;

precision?: number;

recall?: number;

trueNegativeCount?: bigint;

truePositiveCount?: bigint;

}

§Properties

accuracy?: number

[src]

Fraction of cases where the autorater agreed with the human raters.

baselineModelWinRate?: number

[src]

Percentage of time the autorater decided the baseline model had the better response.

cohensKappa?: number

[src]

A measurement of agreement between the autorater and human raters that takes the likelihood of random agreement into account.

f1Score?: number

[src]

Harmonic mean of precision and recall.

falseNegativeCount?: bigint

[src]

Number of examples where the autorater chose the baseline model, but humans preferred the model.

falsePositiveCount?: bigint

[src]

Number of examples where the autorater chose the model, but humans preferred the baseline model.

humanPreferenceBaselineModelWinRate?: number

[src]

Percentage of time humans decided the baseline model had the better response.

humanPreferenceModelWinRate?: number

[src]

Percentage of time humans decided the model had the better response.

modelWinRate?: number

[src]

Percentage of time the autorater decided the model had the better response.

precision?: number

[src]

Fraction of cases where the autorater and humans thought the model had a better response out of all cases where the autorater thought the model had a better response. True positive divided by all positive.

recall?: number

[src]

Fraction of cases where the autorater and humans thought the model had a better response out of all cases where the humans thought the model had a better response.

trueNegativeCount?: bigint

[src]

Number of examples where both the autorater and humans decided that the model had the worse response.

truePositiveCount?: bigint

[src]

Number of examples where both the autorater and humans decided that the model had the better response.