Skip to content

Metrics Calculation

Info

The metrics are automatically calculated in a set interval for all models in the database.

Depending on the data you have provided, the metrics calculation process will try to as many metrics possible for the specific model.

Metric Requirements

All metrics require that inferences are provided for the specific model. If not, the whole process is skipped. Each metric though has different requirements that need to be fulfilled in order to be calculated. The requirements are listed in the following sections.

Descriptive Statistics

The descriptive statistics are calculated per feature on any given dataset.

Metric Type of data
missing_count numerical & categorical
non_missing_count numerical & categorical
mean numerical
minimum numerical
maximum numerical
sum numerical
standard_deviation numerical
variance numerical

Drifting Metrics

The target of drift metrics is to calculate the data drift between 2 datasets. Currently supported drift types are:

  • Data drift
  • Concept drift

Requirements:

  • Inferences
  • Training Dataset

Note

If actuals aren't provided for all inferences, then ONLY the inferences that have actuals will be used for the calculation of the drifting metrics.

Data drift

An analysis happens comparing the current data to the reference data estimating the distributions of each feature in the two datasets. The schema of both datasets should be identical.

Returns a drift summary of the following form:

{
  "timestamp": "the timestamp of the report",
  "drift_summary": {
    "number_of_columns": "total number of dataset columns",
    "number_of_drifted_columns": "total number of drifted columns",
    "share_of_drifted_columns": "number_of_drifted_columns/number_of_columns",
    "dataset_drift": "Boolean based on the criteria below",
    "drift_by_columns": {
      "column1": {
        "column_name": "column1",
        "column_type": "the type of column (e.g. num)",
        "stattest_name": "the statistical test tha was used",
        "drift_score": "the drifting score based on the test",
        "drift_detected": "Boolean based on the criteria below",
        "threshold": "a float number based on the criteria below"
      },
      "column2": {"..."}
    }
  }
}

Logic to choose the appropriate statistical test is based on:

  • feature type: categorical or numerical
  • the number of observations in the reference dataset
  • the number of unique values in the feature (n_unique)

For small data with <= 1000 observations in the reference dataset:

  • For numerical features (n_unique > 5): two-sample Kolmogorov-Smirnov test.
  • For categorical features or numerical features with n_unique <= 5: chi-squared test.
  • For binary categorical features (n_unique <= 2), we use the proportion difference test for independent samples based on Z-score.

All tests use a 0.95 confidence level by default.

For larger data with > 1000 observations in the reference dataset:

  • For numerical features (n_unique > 5): Wasserstein Distance.
  • For categorical features or numerical with n_unique <= 5): Jensen–Shannon divergence.

All tests use a threshold = 0.1 by default.

Concept drift

An analysis happens comparing the current target feature to the reference target feature.

Returns a concept drift summary of the following form:

{
  "timestamp": "the timestamp of the report",
  "concept_drift_summary": {
    "column_name": "column1",
    "column_type": "the type of column (e.g. num)",
    "stattest_name": "the statistical test tha was used",
    "threshold": "threshold used based on criteria below",
    "drift_score": "the drifting score based on the test",
    "drift_detected": "Boolean based on the criteria below,"
  }
}

Logic to choose the appropriate statistical test is based on:

  • the number of observations in the reference dataset
  • the number of unique values in the target (n_unique)

For small data with <= 1000 observations in the reference dataset:

  • For categorical target with n_unique > 2: chi-squared test.
  • For binary categorical target (n_unique <= 2), we use the proportion difference test for independent samples based on Z-score.

All tests use a 0.95 confidence level by default.

For larger data with > 1000 observations in the reference dataset we use Jensen–Shannon divergence with a threshold = 0.1 .

Performance Evaluation Metrics

Requirements:

  • Inferences
  • Actuals for the inferences

The target of evaluation metrics is to evaluate the quality of an machine learning model. Currently supported models are:

  • Binary classification
  • Multi-class classification
  • Regression
Metric Supported model
confusion matrix binary classification & multi-class classification
accuracy binary classification & multi-class classification
precision binary classification & multi-class classification
recall binary classification & multi-class classification
f1 score binary classification & multi-class classification
r_square regression
mean_squared_error regression
mean_absolute_error regression

Explainability

Requirements:

  • Inferences

The target of the explainability feature is to provide a contribution score of each feature to each individual prediction, in a try to explain how the model concluded in the specific prediction. To achieve this we define 3 Levels of confidence in the explainability feature, based on how accessible or not is the client's input:

  • Level-0: In this level a replacement model is trained in the same training data as client's model, and used for the explainability feature.
  • Level-1 (pending): In this level a surrogate model is trained in a way of trying to achieve the same predictions as client's model, and used for the explainability feature.
  • Level-2 (pending): Client's model is used for the explainability feature.

Level-0 confidence

At this level a replacement model is trained in the same training data as client's model, and used for the explainability feature. Below there is a table of used models per machine learning task.

Model Task
LightGBM binary classification & multi-class classification & regression

The fine tuning of models, through a hyper-parameters exploration is a pending task for now.

The trained model along with the inference data are used as input in LIME library, and the below report is provided - presenting the contribution of each feature in a specific prediction (the report includes all the feature contribution score to an descending order based on the absolute score value):

{
  "feature1": "contribution score",
  "feature2": "contribution score",
  "feature3": "contribution score"
}

Level-1 confidence

Coming soon

Level-2 confidence

Coming soon