Dagster concepts

Software-Defined Assets

An asset is an object in persistent storage, such as a table, file, or persisted machine learning model. A software-defined asset is a Dagster object that couples an asset to the function and upstream assets that are used to produce its contents.

A software-defined asset includes the following:

An AssetKey, which is a handle for referring to the asset.
A set of upstream asset keys, which refer to assets that the contents of the software-defined asset are derived from
An op, which is a function responsible for computing the contents of the asset from its upstream dependencies.

Note: A crucial distinction between software-defined assets and ops is that software-defined assets know about their dependencies, while ops do not. Ops aren’t connected to dependencies until they’re placed inside a graph.

“Asset” is Dagster’s word for an entity, external to ops, that is mutated or created by an op. An asset might be a table in a database that an op appends to, an ML model in a model store that an op overwrites, or even a slack channel that an op writes messages to.

Ops

Ops are the core unit of computation in Dagster.

An individual op should perform relatively simple tasks, such as:

Deriving a dataset from other datasets
Executing a database query
Initiating a Spark job in a remote cluster
Querying an API and storing the result in a data warehouse
Sending an email or Slack message

Op Context

When writing an op, users can optionally provide a first parameter, context. When this parameter is supplied, Dagster will supply a context object to the body of the op. The context provides access to system information like op configuration, loggers, resources, and the current run id.

@op(config_schema={"name": str})
def context_op(context):
    name = context.op_config["name"]
    context.log.info(f"My name is {name}")

Graphs

A graph is a set of interconnected ops or sub-graphs. While individual ops typically perform simple tasks, ops can be assembled into a graph to accomplish complex tasks.

Graphs can be used in three different ways

To back assets - Basic software-defined assets are computed using a single op, but if computing one of your assets requires multiple discrete steps, you can compute it using a graph instead.
Directly inside a job - Each non-asset job contains a graph.
Inside other graphs - You can build complex graphs out of simpler graphs.

from dagster import graph, op


@op
def return_one(context) -> int:
    return 1


@op
def add_one(context, number: int) -> int:
    return number + 1


@graph
def linear():
    add_one(add_one(add_one(return_one())))

Jobs

Jobs are the main unit of execution and monitoring in Dagster. The core of a job is a graph of ops connected via data dependencies.

Ops are linked together by defining the dependencies between their inputs and outputs. An important difference between Dagster and other workflow systems is that, in Dagster, op dependencies are expressed as data dependencies, not just execution dependencies.

IO Managers

Functions decorated by @asset, @multi_asset, and @op can have parameters and return values that are loaded from and written to persistent storage. IOManagers let the user control how this data is stored and how it’s loaded in downstream ops and assets. For @asset and @multi_asset, the IO manager effectively determines where the physical asset lives.

Configuration

Run configuration allows providing parameters to jobs at the time they’re executed.

It’s often useful to provide user-chosen values to Dagster jobs or software-defined assets at runtime. For example, you might want to choose what dataset an op runs against, or provide a connection URL for a database resource. Dagster exposes this functionality through a configuration API.

Repositories

A repository is a collection of software-defined assets, jobs, schedules, and sensors. Repositories are loaded as a unit by the Dagster CLI, Dagit and the Dagster Daemon.