Skip to content

How to Use Data Assets

This guide shows how to integrate Azure ML Data Assets into your Kedro catalog using AzureMLAssetDataset and AzureMLPipelineDataset.

Prerequisites

  • Azure credentials configured for local runs
  • An Azure ML workspace configured in azureml.yml

Use AzureMLAssetDataset for versioned data assets

AzureMLAssetDataset reads and writes named Azure ML Data Assets (uri_file or uri_folder). It wraps any standard Kedro dataset and resolves the asset's storage path automatically.

Add an entry to your conf/base/catalog.yml:

model_inputs:
  type: kedro_azureml_pipeline.datasets.AzureMLAssetDataset
  azureml_dataset: "my-model-inputs"
  azureml_type: "uri_folder"
  dataset:
    type: pandas.ParquetDataset
    filepath: "data.parquet"

The azureml_dataset field is the name of the Data Asset in Azure ML. The dataset block is an ordinary Kedro dataset definition, so any type that accepts a filepath argument works here.

Use uri_file for single-file assets

training_config:
  type: kedro_azureml_pipeline.datasets.AzureMLAssetDataset
  azureml_dataset: "training-config"
  azureml_type: "uri_file"
  dataset:
    type: yaml.YAMLDataset
    filepath: "config.yml"

Pin a specific asset version

model_inputs:
  type: kedro_azureml_pipeline.datasets.AzureMLAssetDataset
  azureml_dataset: "my-model-inputs"
  azureml_type: "uri_folder"
  azureml_version: "3"
  dataset:
    type: pandas.ParquetDataset
    filepath: "data.parquet"

Omitting azureml_version uses the latest available version.

Local behavior

During local runs (kedro run), the AzureMLLocalRunHook downloads the asset to root_dir on first access. If the asset has not been downloaded yet, the hook fetches it from your Azure ML workspace, so Azure credentials must be available. Subsequent runs reuse the local copy.

Override the local root directory

During local runs, the plugin downloads asset data to root_dir. Override it if needed:

model_inputs:
  type: kedro_azureml_pipeline.datasets.AzureMLAssetDataset
  azureml_dataset: "my-model-inputs"
  azureml_type: "uri_folder"
  root_dir: "data/01_raw"
  dataset:
    type: pandas.ParquetDataset
    filepath: "data.parquet"

Use AzureMLPipelineDataset for inter-step data passing

AzureMLPipelineDataset passes data between Kedro nodes that run as separate Azure ML pipeline steps. Rather than referencing a named Azure ML Data Asset, Azure ML mounts a temporary storage path between steps. Use this for intermediate data that does not need to be versioned or registered as an asset.

intermediate_features:
  type: kedro_azureml_pipeline.datasets.AzureMLPipelineDataset
  dataset:
    type: pandas.ParquetDataset
    filepath: "features.parquet"

The root_dir and filepath_arg parameters work the same as in AzureMLAssetDataset.

During local runs, AzureMLPipelineDataset behaves like a normal file-backed dataset with no Azure ML calls.

When to use each type

Situation Dataset type
Input/output data registered as a versioned asset in Azure ML AzureMLAssetDataset
Intermediate data passing between pipeline steps AzureMLPipelineDataset

See the architecture overview for details on how data flows between steps during remote execution.

See also