az storage account create \
--name mldemoblob8765 \
--resource-group mldemo \
--location westus \
--sku Standard_LRS \
--kind StorageV2
az storage container create \
--name mlfiles \
--account-name mldemoblob8765
Create a permanent connection to a container in a storage account and define it as one of our datastores in the Azure Machine Learning workspace.
$schema: https://azuremlschemas.azureedge.net/latest/azureBlob.schema.json
name: mldemoblob
type: azure_blob
description: main ML blob storage
account_name: mldemoblob8765
container_name: mlfiles
credentials:
sas_token: <your_token>
A Dataset object can be accessed or passed around in the current environment through its object reference. However, a dataset can also be registered (and versioned), and hence accessed through the dataset name (and version) – this is called a registered dataset.
from azureml.core import Dataset
path = 'https://...windows.net/demo/Titanic.csv'
ds = Dataset.Tabular.from_delimited_files(path)
Once we have a reference to a datastore, we can access data within it. In the following example, we create a file dataset from files stored in a directory of the mldata datastore:
from azureml.core import Dataset, Datastore
datastore_name = "mldata"
datastore = Datastore.get(ws, datastore_name)
ds = Dataset.File.from_files((datastore, "cifar10/"))
ds = ds.register(ws, name="titanic",
create_new_version=True)
# exploring it:
ds = Dataset.get_by_name(ws, name="titanic", version=1)
panads_df = ds.to_pandas_dataframe()
pandas_df.head()
import pandas as pd
path ='https://...windows.net/demo/Titanic.csv'
df = pd.read_csv(path)
print(df.head())
Our goal is to pass the data from the authoring script to the training script, so it can be tracked and updated easily in the future. To achieve this, we can’t send the DataFrame directly, but have to pass the URL to the CSV file and use the same method to fetch the data in the training script. Let’s write a small training script whose only job is to parse the command-line arguments and fetch the data from the URL:
import argparse
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument("--input", type=str)
args = parser.parse_args()
df = pd.read_csv(args.input)
print(df.head())
As we see in the preceding code, we pass the data path from the command-line –input argument and then load the data from the location using pandas’ read_csv().
Next, we create a ScriptRunConfig constructor to submit an experiment run to Azure Machine Learning that executes the training script
src = ScriptRunConfig(
source_directory="code",
script='access_data_from_path.py',
arguments=['--input', path],
environment=get_current_env())
from azureml.core import Dataset
path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
ds = Dataset.Tabular.from_delimited_files(path)
print(ds.to_pandas_dataframe().head())
## and we can use it like this in training script:
import argparse
from azureml.core import Dataset, Run
parser = argparse.ArgumentParser()
parser.add_argument("--input", type=str)
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
ds = Dataset.get_by_id(ws, id=args.input) # Keyline!! notice get_by_id
print(ds.to_pandas_dataframe().head())
Next, we write a run configuration to submit the preceding code as an experiment to Azure Machine Learning. First, we need to convert the dataset into a command-line argument and pass it to the training script so it can be automatically retrieved in the execution runtime. We can achieve this by using the as_named_input(name) method on the dataset instance, which will convert the dataset into a named DatasetConsumptionConfig argument that allows the dataset to be passed to other environments.
In this case, the dataset will be passed in direct mode and provided as the name environment variable in the runtime environment or as the dataset ID in the command-line arguments. The dataset will also get tracked in Azure Machine Learning as an input argument to the training script.
However, as we saw in the previous code snippet, we use the Dataset.get_by_id() method to retrieve the dataset in the training script from the dataset ID. We don’t need to manually create or access the dataset ID, because the DatasetConsumptionConfig argument will be automatically expanded into the dataset ID when the training script is called by Azure Machine Learning with a direct dataset:
src = ScriptRunConfig(
source_directory="code",
script='access_data_from_dataset.py',
arguments=['--input', ds.as_named_input('titanic')],
environment=get_current_env())
env = Environment('my_env')
packages = CondaDependencies.create(conda_packages=['pip'],
pip_packages=['azureml-defaults',
'azureml-dataprep[pandas]'])
env.python.conda_dependencies = packages
script_config = ScriptRunConfig(source_directory='my_dir',
script='script.py',
arguments=['--ds', tab_ds],
environment=env)
# And the script
from azureml.core import Run, Dataset
parser.add_argument('--ds', type=str, dest='dataset_id')
args = parser.parse_args()
run = Run.get_context()
ws = run.experiment.workspace
dataset = Dataset.get_by_id(ws, id=args.dataset_id)
data = dataset.to_pandas_dataframe()
env = Environment('my_env')
packages = CondaDependencies.create(conda_packages=['pip'],
pip_packages=['azureml-defaults',
'azureml-dataprep[pandas]'])
env.python.conda_dependencies = packages
script_config = ScriptRunConfig(source_directory='my_dir',
script='script.py',
arguments=['--ds', tab_ds.as_named_input('my_dataset')],
environment=env)
# Script
from azureml.core import Run
parser.add_argument('--ds', type=str, dest='ds_id')
args = parser.parse_args()
run = Run.get_context()
dataset = run.input_datasets['my_dataset']
data = dataset.to_pandas_dataframe()
While external public data reachable through a URL is created and passed as a direct dataset, all other datasets can be accessed either as a download or as a mount. For big data datasets, Azure Machine Learning also provides an option to mount a dataset as a Hadoop Distributed File System (HDFS).
# Handle to the workspace
from azure.ai.ml import MLClient
# Authentication package
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>", # this will look like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)
https://learn.microsoft.com/en-us/training/paths/train-models-azure-machine-learning-cli-v2/
https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-python-sdk
First you always get the “credential”:
credential = InteractiveBrowserCredential()
and then a MLClient object
# Get a handle to the workspace
ml_client = MLClient(
credential=credential,
subscription_id="<SUBSCRIPTION_ID>",
resource_group_name="<RESOURCE_GROUP>",
workspace_name="<AML_WORKSPACE_NAME>",
)
Azure Machine Learning uses a Data object to register a reusable definition of data, and consume data within a pipeline. In the section below, you’ll consume some data from web url as one example. Data from other sources can be created as well. Data assets from other sources can be created as well.
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
web_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
credit_data = Data(
name="creditcard_defaults",
path=web_path,
type=AssetTypes.URI_FILE,
description="Dataset for credit card defaults",
tags={"source_type": "web", "source": "UCI ML Repo"},
version="1.0.0",
)
This code just created a Data asset, ready to be consumed as an input by the pipeline that you’ll define in the next sections. In addition, you can register the data to your workspace so it becomes reusable across pipelines.
Registering the data asset will enable you to:
Since this is the first time that you’re making a call to the workspace, you may be asked to authenticate. Once the authentication is complete, you’ll then see the dataset registration completion message.
credit_data = ml_client.data.create_or_update(credit_data)
print(
f"Dataset with name {credit_data.name} was registered to workspace, the dataset version is {credit_data.version}"
)
using type amlcompute creates a compute cluster while the other is “computeinstance”
create environment for the pipeline steps:
it can be used to create custom envs:
from azure.ai.ml.entities import Environment
custom_env_name = "aml-scikit-learn"
pipeline_job_env = Environment(
name=custom_env_name,
description="Custom environment for Credit Card Defaults pipeline",
tags={"scikit-learn": "0.24.2"},
conda_file=os.path.join(dependencies_dir, "conda.yml"),
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
version="0.1.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
print(
f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)