xarray-sql¶

`xarray_sql` ¶

`all = ['cftime', 'XarrayContext', 'read_xarray_table', 'read_xarray', 'from_map']` `module-attribute` ¶

`XarrayContext` ¶

Bases: SessionContext

A datafusion SessionContext that also supports xarray.Datasets.

`from_dataset(name, input_table, *, table_names=None, chunks=None)` ¶

Register an xarray Dataset as one or more queryable SQL tables.

When all data variables share the same dimensions, the dataset is registered as a single table named name. When variables have differing dimensions (e.g. some on a 3D grid and others on a 4D grid), the dataset is split into one table per dimension group. The tables are registered under a SQL schema (namespace) named name and named <dim1>_<dim2>_... by default::

ctx.from_dataset('era5', ds, chunks={'time': 24})
# registers tables: 'era5.time_lat_lon' and
#                   'era5.time_lat_lon_level'
ctx.sql('SELECT AVG(temperature_2m) FROM era5.time_lat_lon')

Use table_names to override the name for specific dimension tuples::

ctx.from_dataset(
    'era5', ds,
    table_names={('time', 'lat', 'lon'): 'surface'},
)
ctx.sql('SELECT * FROM era5.surface')

For datasets with non-Gregorian cftime coordinates (e.g. 360_day, julian), a cftime() scalar UDF is automatically registered so you can write ergonomic SQL filters::

ctx.from_dataset("ds360", ds, chunks={"time": 6})
ctx.sql("SELECT * FROM ds360 WHERE time >= cftime('2000-07-01')")

.. note::

Only one ``cftime()`` UDF is registered per context, using the
units and calendar of the *first* non-Gregorian coordinate
encountered. If you register multiple datasets with *different*
non-Gregorian calendars (e.g. one 360_day and one julian), the
UDF from the first registration will be used for all subsequent
``cftime()`` calls and may produce incorrect offsets for the
other dataset. In that case, create a separate ``XarrayContext``
for each calendar.

Parameters:

Name	Type	Description	Default
`name`	`str`	The SQL identifier under which the dataset is registered. For datasets with uniform dimensions, this is the table name. For datasets with mixed dimensions, this is the name of a SQL schema (namespace) containing one table per dimension group.	required
`input_table`	`Dataset`	An xarray Dataset.	required
`table_names`	`dict[tuple[str, ...], str] \| None`	Optional mapping from dimension tuples to custom table names within the schema, used when the dataset has variables with differing dimensions.	`None`
`chunks`	`Chunks`	Xarray-like chunks specification. If not provided, uses the Dataset's existing chunks.	`None`

Returns:

Type	Description
	self, to allow chaining.

`sql(query, *args, **kwargs)` ¶

Run a SQL query, returning an :class:XarrayDataFrame wrapper.

Identical to datafusion.SessionContext.sql except the returned object wraps the DataFusion DataFrame. The wrapper exposes .to_pandas() (unchanged), forwards every other DataFusion method via __getattr__, and adds .to_dataset(dimension_columns=[...]) for round-tripping the result back to an xr.Dataset.

Parameters:

Name	Type	Description	Default
`query`	`str`	A SQL query string.	required
`*args`		Forwarded to `SessionContext.sql`.	`()`
`**kwargs`		Forwarded to `SessionContext.sql`.	`{}`

Returns:

Name	Type	Description
`An`	`XarrayDataFrame`	class:`XarrayDataFrame` wrapping the DataFusion DataFrame.

`from_map(func, *iterables, args=None, **kwargs)` ¶

Create a PyArrow Table by mapping a function over iterables.

This is equivalent to dask's from_map but returns a PyArrow Table that can be used with DataFusion instead of a Dask DataFrame.

Parameters:

Name	Type	Description	Default
`func`	`Callable`	Function to apply to each element of the iterables.	required
`*iterables`	`tuple[Any, ...]`	Iterable objects to map the function over.	`()`
`args`	`tuple \| None`	Additional positional arguments to pass to func.	`None`
`**kwargs`	`dict[str, Any]`	Additional keyword arguments to pass to func.	`{}`

Returns:

Type	Description
`Table`	A PyArrow Table containing the concatenated results.

`read_xarray(ds, chunks=None)` ¶

Pivots an Xarray Dataset into a PyArrow Table, partitioned by chunks.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	An Xarray Dataset. All `data_vars` must share the same dimensions.	required
`chunks`	`Chunks`	Xarray-like chunks. If not provided, will default to the Dataset's chunks. The product of the chunk sizes becomes the standard length of each dataframe partition.	`None`

Returns:

Type	Description
`RecordBatchReader`	A PyArrow RecordBatchReader, which is a table representation of the input
`RecordBatchReader`	Dataset.

`read_xarray_table(ds, chunks=None, *, batch_size=DEFAULT_BATCH_SIZE, coord_arrays=None, _iteration_callback=None)` ¶

Create a lazy DataFusion table from an xarray Dataset.

This is the simplest way to register xarray data with DataFusion. Data is only read when queries are executed, not during registration. The table can be queried multiple times.

Each chunk becomes a separate partition, enabling DataFusion's parallel execution across multiple cores.

Note

SQL queries with WHERE clauses on dimension columns (time, lat, lon, etc.) automatically prune partitions that can't contain matching rows — this is called filter pushdown. For example:

# This query will skip loading partitions with time < '2020-02-01'
result = ctx.sql('SELECT * FROM air WHERE time > "2020-02-01"').collect()

Supported operators: =, <, >, <=, >=, BETWEEN, IN, AND, OR.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	An xarray Dataset. All data_vars must share the same dimensions.	required
`chunks`	`Chunks`	Xarray-like chunks specification. If not provided, uses the Dataset's existing chunks.	`None`
`batch_size`	`int`	Maximum rows per Arrow RecordBatch emitted per partition. Smaller values let DataFusion start processing earlier; the default (65 536) works well for most datasets.	`DEFAULT_BATCH_SIZE`
`coord_arrays`	`dict[str, ndarray] \| None`	Pre-materialised coordinate arrays keyed by dim-name string. Hand in to share a single read across multiple tables built from the same parent Dataset (e.g. surface + atmosphere from ARCO-ERA5); the dim coords are otherwise read once per `read_xarray_table` call, which is a network round-trip for Zarr-backed datasets.	`None`
`_iteration_callback`	`Callable[[Block, list[str] \| None], None] \| None`	Internal callback for testing. Called with each block dict just before it's converted to Arrow.	`None`

Returns:

Type	Description
`'LazyArrowStreamTable'`	A LazyArrowStreamTable ready for registration with DataFusion.

Example

from datafusion import SessionContext import xarray as xr from xarray_sql import read_xarray_table

ds = xr.tutorial.open_dataset('air_temperature') table = read_xarray_table(ds, chunks={'time': 240})

ctx = SessionContext() ctx.register_table('air', table)

Data is only read here, during query execution¶

Filters on 'time' will prune partitions automatically!¶

result = ctx.sql('SELECT AVG(air) FROM air').collect()

`cftime` ¶

Bridge between cftime calendars and Arrow/DataFusion types.

cftime (https://unidata.github.io/cftime/) provides datetime objects for calendars used in climate science — noleap, 360-day, all-leap, julian, etc. Arrow and DataFusion have no native concept of non-Gregorian calendars, so this module handles the conversion in two tiers:

Gregorian-like calendars (standard, gregorian, proleptic_gregorian, noleap/365_day, all_leap/366_day): mapped to pa.timestamp('us') so that string-based SQL filters like WHERE time > '1980-01-01' work naturally. Microsecond resolution avoids the 1678–2262 overflow of nanoseconds while preserving sub-second precision.
Non-Gregorian calendars (360_day, julian): mapped to pa.int64() with xarray:units and xarray:calendar metadata on the Arrow field. This preserves the original CF-convention encoding losslessly. A cftime() DataFusion UDF (registered automatically by XarrayContext.from_dataset) provides ergonomic SQL filtering.

`DEFAULT_UNITS = 'microseconds since 1970-01-01T00:00:00'` `module-attribute` ¶

`GREGORIAN_LIKE_CALENDARS = frozenset({'standard', 'gregorian', 'proleptic_gregorian', 'noleap', '365_day', 'all_leap', '366_day'})` `module-attribute` ¶

`arrow_field(name, units, cal)` ¶

Build a pa.Field for a cftime coordinate.

Gregorian-like → pa.timestamp('us'); non-Gregorian → pa.int64(). Both carry xarray:calendar and xarray:units metadata for round-trip fidelity.

`calendar(ds, coord_name)` ¶

Return the calendar name for a cftime coordinate, or None.

Checks the xarray index first (no data materialization), then falls back to inspecting element 0 of the coordinate values.

`convert_for_field(values, field)` ¶

Convert cftime values to the numeric type dictated by field.

Reads xarray:calendar and xarray:units from the field's Arrow metadata to choose between the timestamp path and the integer-offset path.

`encoding(ds, coord_name)` ¶

Return (units, calendar) for a cftime coordinate.

Reads xarray .encoding metadata (from the originating NetCDF file) first, falling back to :data:DEFAULT_UNITS.

`is_cftime(values)` ¶

Check if a numpy array contains cftime datetime objects.

`is_cftime_index(ds, coord_name)` ¶

Check if a coordinate uses a CFTimeIndex without materializing data.

`is_gregorian_like(calendar)` ¶

Return True if calendar is close enough to Gregorian for pa.timestamp.

`make_cftime_udf(units, calendar)` ¶

Create a DataFusion scalar UDF that converts date strings to int64 offsets.

This enables ergonomic SQL filtering on non-Gregorian cftime columns::

SELECT * FROM ds360 WHERE time > cftime('0500-01-01')

The UDF parses the input string as a cftime datetime in the given calendar system and returns the corresponding int64 offset in the specified units.

`partition_bounds(values)` ¶

Return (min, max, dtype_tag) for a cftime coordinate slice.

Gregorian-like calendars return nanosecond bounds tagged "timestamp_ns" (compatible with ScalarBound::TimestampNanos in the Rust pruning layer). Non-Gregorian calendars return int64 offsets tagged "int64".

`to_microseconds(values)` ¶

Convert cftime objects to int64 microseconds since Unix epoch.

Used for Gregorian-like calendars. Vectorised via cftime.date2num (implemented in C).

`to_offsets(values, units, cal)` ¶

Convert cftime objects to int64 offsets in the given units/calendar.

Used for non-Gregorian calendars where data is stored as pa.int64().

`core` ¶

`Row = list[Any]` `module-attribute` ¶

`get_columns(ds)` ¶

`unbounded_unravel(ds)` ¶

Unravel with unbounded memory (as a NumPy Array).

`unravel(ds)` ¶

`df` ¶

`Block = dict[Hashable, slice]` `module-attribute` ¶

`Chunks = dict[str, int] | None` `module-attribute` ¶

`DEFAULT_BATCH_SIZE = 65536` `module-attribute` ¶

`PartitionBounds = dict[str, tuple[Any, Any, str]]` `module-attribute` ¶

`block_slices(ds, chunks=None)` ¶

Compute block slices for a chunked Dataset.

`compute_chunks(ds, chunks)` ¶

Per-dim chunk-size tuples matching ds.chunk(chunks).chunks.

Pure arithmetic replacement for the dask rechunk round-trip; dask's .chunk() eagerly builds a task graph, which dominates block_slices() cost on large datasets.

`dataset_to_record_batch(ds, schema)` ¶

Convert an xarray Dataset partition to an Arrow RecordBatch.

Builds the RecordBatch directly from numpy arrays, bypassing the pandas round-trip (to_dataframe → reset_index → from_pandas) used by pivot(). For large partitions this reduces peak memory from ~5× to ~2× the partition size.

Dimension coordinates are broadcast to the full partition shape and ravelled. np.broadcast_to() is zero-copy; the ravel() forces one copy per coordinate (unavoidable, since broadcast arrays are non-contiguous). Data variable arrays are ravelled in-place — a zero-copy view when the underlying array is already C-contiguous (the common case for numpy-backed xarray datasets).

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	A partition-sized xarray Dataset (already sliced via isel).	required
`schema`	`Schema`	The Arrow schema for the output, as produced by _parse_schema. Column order in the output matches schema field order.	required

Returns:

Type	Description
`RecordBatch`	A RecordBatch with one column per dimension coordinate and data
`RecordBatch`	variable, in schema order.

`explode(ds, chunks=None)` ¶

Explodes a dataset into its chunks.

`from_map(func, *iterables, args=None, **kwargs)` ¶

Create a PyArrow Table by mapping a function over iterables.

This is equivalent to dask's from_map but returns a PyArrow Table that can be used with DataFusion instead of a Dask DataFrame.

Parameters:

Name	Type	Description	Default
`func`	`Callable`	Function to apply to each element of the iterables.	required
`*iterables`	`tuple[Any, ...]`	Iterable objects to map the function over.	`()`
`args`	`tuple \| None`	Additional positional arguments to pass to func.	`None`
`**kwargs`	`dict[str, Any]`	Additional keyword arguments to pass to func.	`{}`

Returns:

Type	Description
`Table`	A PyArrow Table containing the concatenated results.

`from_map_batched(func, *iterables, args=None, schema=None, **kwargs)` ¶

Create a PyArrow RecordBatchReader by mapping a function over iterables.

This is equivalent to dask's from_map but returns a PyArrow RecordBatchReader that can be used with DataFusion. It iterates over RecordBatches which are created via the func one-at-a-time.

Parameters:

Name	Type	Description	Default
`func`	`Callable[..., DataFrame]`	Function to apply to each element of the iterables. Currently, the function must return a Pandas DataFrame.	required
`*iterables`	`tuple[Any, ...]`	Iterable objects to map the function over.	`()`
`schema`	`Schema`	Optional schema needed for the RecordBatchReader.	`None`
`args`	`tuple \| None`	Additional positional arguments to pass to func.	`None`
`**kwargs`	`dict[str, Any]`	Additional keyword arguments to pass to func.	`{}`

Returns:

Type	Description
`RecordBatchReader`	A PyArrow RecordBatchReader containing the stream of RecordBatches.

`iter_record_batches(ds, schema, batch_size=DEFAULT_BATCH_SIZE)` ¶

Yield RecordBatches of at most batch_size rows from a partition Dataset.

Unlike dataset_to_record_batch, which materialises the entire partition as one batch, this generator emits smaller batches so that DataFusion can begin filtering and aggregating before the full partition is loaded. Peak memory per batch is O(batch_size) for coordinate columns and O(partition_size) for data-variable columns (which must be loaded in full from storage).

Coordinate values are computed per batch via strided index arithmetic — no broadcast array spanning the whole partition is ever allocated. Data variable flat arrays are loaded once (triggering any remote I/O) and then sliced as zero-copy views for each batch.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	A partition-sized xarray Dataset (already sliced via isel).	required
`schema`	`Schema`	The Arrow schema for the output, as produced by _parse_schema.	required
`batch_size`	`int`	Maximum number of rows per yielded RecordBatch.	`DEFAULT_BATCH_SIZE`

Yields:

Type	Description
`RecordBatch`	RecordBatches in schema column order, covering all rows of the
`RecordBatch`	partition exactly once.

`partition_metadata(ds, blocks)` ¶

Compute min/max coordinate values for each partition.

This metadata enables filter pushdown: SQL queries with WHERE clauses on dimension columns can prune partitions that can't contain matching rows.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	The xarray Dataset containing coordinate values.	required
`blocks`	`list[Block]`	List of block slices from block_slices().	required

Returns:

Type	Description
`list[PartitionBounds]`	List of dicts mapping dimension name to
`list[PartitionBounds]`	(min_value, max_value, dtype_str) tuples. For datetime64, values are nanoseconds since Unix epoch (int64), dtype_str is "timestamp_ns" For numeric types, values are Python int or float, dtype_str is "int64" or "float64"

Note

If a partition has an empty slice for a dimension, that dimension is omitted from the partition's metadata. The Rust pruning logic treats missing dimensions conservatively (never prunes on them).

`pivot(ds)` ¶

Converts an xarray Dataset to a pandas DataFrame.

`resolve_chunks(ds, chunks)` ¶

Normalise the user's chunks argument to per-dim size tuples.

Filters out keys for dims this dataset doesn't have (sub-datasets in a heterogeneous group need not contain every dimension named in the spec), then either rechunks arithmetically via compute_chunks or falls back to the dataset's existing dask chunks.

Returns an empty mapping for scalar datasets; callers should treat that as "one block covering everything".

`ds` ¶

Reconstruct xarray Datasets from SQL query results.

The inverse of the forward Dataset-to-table pivot done by :func:xarray_sql.df.pivot. Internally defines an :class:XarrayDataFrame wrapper around the DataFusion DataFrame returned by :meth:XarrayContext.sql, with a :meth:XarrayDataFrame.to_dataset method that round-trips a query result back to xr.Dataset.

Reconstruction is controlled by the chunks argument to :meth:XarrayDataFrame.to_dataset -- the xarray idiom for tuning how a result is partitioned -- rather than by reflecting on the query plan:

Eager (chunks=None, or the default "inherit" when the result keeps no multi-chunk source dimension): the plan executes exactly once via execute_stream and the result is scattered into a dense in-memory Dataset. This is the right default for reductions (aggregations), whose results are small, and it never re-executes.
Lazy / chunked (chunks is a mapping, "auto", or "inherit" over a multi-chunk source dimension): data variables are backed by :class:SQLBackendArray wrapped in xarray.core.indexing.LazilyIndexedArray and chunked via xarray's configured chunk manager (dask, cubed, ...). Each chunk maps onto the source partitions and reads its coordinate range on access by translating the indexer into a DataFusion filter expression, so only the requested partitions are materialized as Arrow RecordBatch es and scattered into numpy.

.compute() materializes the whole Dataset in memory.

`Sparsity = Literal['result', 'template']` `module-attribute` ¶

Output coordinate extent for a filtered round-trip.

"result" keeps only the dim values present in the query result, so the output is sparse and equal to whatever rows came back.
"template" reindexes to the registered Dataset's full coord ranges and fills absent cells with fill_value.

`SQLBackendArray` ¶

Bases: BackendArray

Read-only lazy N-D array view over a DataFusion DataFrame.

Bridges xarray's lazy-indexing interface (:class:xarray.backends.BackendArray) to a DataFusion query result, so an xarray Dataset can present a SQL query as if it were a materialized N-D array without actually loading any data until the caller asks for it. This is the workhorse that lets :meth:XarrayDataFrame.to_dataset return a Dataset cheaply.

On each __getitem__ call, the requested xarray indexer is translated into a DataFusion filter expression (df.filter(expr)) and a column projection (df.select(*cols)). The filtered DataFrame is consumed via execute_stream as a sequence of Arrow RecordBatch es and scattered into a preallocated numpy buffer, so only the requested data is materialized.

Constraints and caveats:

Read-only: there is no write path; the backend exists to surface query results, not to round-trip writes into a SQL store.
The underlying DataFusion DataFrame holds a reference to its originating SessionContext, which is not picklable. The class therefore overrides __copy__ and __deepcopy__ to return self -- this is safe because the backend is read-only.
IndexingSupport.OUTER: BasicIndexer and OuterIndexer are translated to filter predicates directly; VectorizedIndexer paths through xarray's adapter to outer-then-gather and so still works, just less efficiently.

Raises:

Type	Description
`ValueError, datafusion exceptions`	propagated from the underlying `df.filter().select().execute_stream()` chain if a predicate refers to a missing column, the dtype of a literal is incompatible, or the execution itself fails.
`AssertionError`	from `np.searchsorted` mis-alignment, which indicates the result contains coordinate values not present in the wrapper's pre-computed coord arrays -- usually a symptom of a filtered query whose coord discovery missed a value.

Constructed by :func:_build_lazy_scan; users should not instantiate this class directly.

`dtype = np.dtype(dtype)` `instance-attribute` ¶

`shape = tuple(shape)` `instance-attribute` ¶

`XarrayDataFrame` ¶

Wrapper around a DataFusion DataFrame with xarray-aware helpers.

Returned by :meth:xarray_sql.XarrayContext.sql. Forwards every attribute it does not define itself to the wrapped DataFrame, so .collect(), .schema(), .show(), .count() all work unchanged.

Carries a private snapshot of the context's registered Datasets so :meth:to_dataset can default dims and recover metadata dropped by the forward pivot.

Users should not construct this class directly; let :meth:XarrayContext.sql produce it.

`to_dataset(dims=None, template=None, sparsity='result', fill_value=np.nan, chunks='inherit')` ¶

Convert the result to an xr.Dataset.

Parameters:

Name	Type	Description	Default
`dims`	`list[str] \| None`	Result columns to use as Dataset dimensions. When `None`, defaults to the dims of the registered Dataset referenced by the SQL `FROM` clause (if exactly one matches), or any single registered Dataset whose dims are all present in the result columns.	`None`
`template`	`Dataset \| str \| None`	Source to recover metadata (attrs, encoding, non-dim coordinates, dim-coord dtype) from. Either an `xr.Dataset` used directly, or the name of a registered table (e.g. `"era5.surface"`) whose Dataset is looked up. When `None` and exactly one Dataset is registered, that one is used.	`None`
`sparsity`	`Sparsity`	`"result"` (default) keeps only dim values present in the result. `"template"` reindexes to the template's full coord ranges, filling absent cells with `fill_value`; requires a template.	`'result'`
`fill_value`	`Any`	Used when `sparsity="template"` reindexes to a wider extent. Defaults to `np.nan`.	`nan`
`chunks`	`Mapping[str, int] \| str \| None`	Output chunking, controlling laziness (an xarray idiom). `"inherit"` (default): reuse the source Dataset's chunk sizes, but only for dimensions that were genuinely split into multiple chunks in the input -- so the output chunk grid maps onto the source partitions. A reduction that drops the chunked dimension (e.g. a global aggregation) inherits nothing and so is materialized eagerly. Falls back to eager when no source Dataset is resolvable. `None`: eager. Execute the query once and return a dense in-memory Dataset. Best for reductions (small results). a mapping (e.g. `{"time": 100}`): chunk explicitly. Each chunk reads its coordinate range lazily via filter pushdown on access, through xarray's configured chunk manager (dask, cubed, ...). `"auto"`: size chunks to the chunk manager's byte target but snap boundaries to whole source partitions, so each chunk is a union of source partitions. Useful for finely partitioned sources (e.g. ERA5 `chunks={"time": 1}`), coarsening many tiny partitions into memory-sized, aligned chunks.	`'inherit'`

Returns:

Type	Description
`Dataset`	An `xr.Dataset` with `dims` as dimensions and the
`Dataset`	remaining result columns as data variables.

Raises:

Type	Description
`ValueError`	`dims` cannot be inferred, names a missing column, or the result has duplicate dim tuples; `template` names an unknown registered table; or `sparsity="template"` is requested without a resolvable template.

`to_pandas()` ¶

Materialize the result as a pd.DataFrame (DataFusion API).

`reader` ¶

Lazy Arrow stream reader for xarray Datasets.

This module provides XarrayRecordBatchReader, which implements the Arrow PyCapsule Interface (arrow_c_stream) to enable zero-copy, lazy streaming of xarray data to DataFusion and other Arrow consumers.

The implementation delegates to PyArrow's RecordBatchReader for the actual stream implementation, wrapping xarray block iteration in a generator.

`XarrayRecordBatchReader` ¶

A lazy Arrow stream reader for xarray Datasets.

Implements the Arrow PyCapsule Interface (arrow_c_stream) to enable zero-copy, lazy streaming of xarray data to DataFusion and other Arrow consumers.

The key property is that xarray blocks are only converted to Arrow RecordBatches when the consumer calls get_next (e.g., during DataFusion's collect()), NOT when the reader is created or registered.

Attributes:

Name	Type	Description
`schema`	`Schema`	The Arrow schema for the stream.

Example

import xarray as xr from xarray_sql import XarrayRecordBatchReader ds = xr.tutorial.open_dataset('air_temperature') reader = XarrayRecordBatchReader(ds, chunks={'time': 240})

At this point, NO data has been read from xarray¶

Data is only read when consumed:¶

import pyarrow as pa pa_reader = pa.RecordBatchReader.from_stream(reader) for batch in pa_reader: ... print(batch.num_rows) # Data read here

`schema` `property` ¶

The Arrow schema for this stream.

`read_xarray(ds, chunks=None)` ¶

Pivots an Xarray Dataset into a PyArrow Table, partitioned by chunks.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	An Xarray Dataset. All `data_vars` must share the same dimensions.	required
`chunks`	`Chunks`	Xarray-like chunks. If not provided, will default to the Dataset's chunks. The product of the chunk sizes becomes the standard length of each dataframe partition.	`None`

Returns:

Type	Description
`RecordBatchReader`	A PyArrow RecordBatchReader, which is a table representation of the input
`RecordBatchReader`	Dataset.

`read_xarray_table(ds, chunks=None, *, batch_size=DEFAULT_BATCH_SIZE, coord_arrays=None, _iteration_callback=None)` ¶

Create a lazy DataFusion table from an xarray Dataset.

This is the simplest way to register xarray data with DataFusion. Data is only read when queries are executed, not during registration. The table can be queried multiple times.

Each chunk becomes a separate partition, enabling DataFusion's parallel execution across multiple cores.

Note

SQL queries with WHERE clauses on dimension columns (time, lat, lon, etc.) automatically prune partitions that can't contain matching rows — this is called filter pushdown. For example:

# This query will skip loading partitions with time < '2020-02-01'
result = ctx.sql('SELECT * FROM air WHERE time > "2020-02-01"').collect()

Supported operators: =, <, >, <=, >=, BETWEEN, IN, AND, OR.

Parameters:

Name	Type	Description	Default
`ds`	`Dataset`	An xarray Dataset. All data_vars must share the same dimensions.	required
`chunks`	`Chunks`	Xarray-like chunks specification. If not provided, uses the Dataset's existing chunks.	`None`
`batch_size`	`int`	Maximum rows per Arrow RecordBatch emitted per partition. Smaller values let DataFusion start processing earlier; the default (65 536) works well for most datasets.	`DEFAULT_BATCH_SIZE`
`coord_arrays`	`dict[str, ndarray] \| None`	Pre-materialised coordinate arrays keyed by dim-name string. Hand in to share a single read across multiple tables built from the same parent Dataset (e.g. surface + atmosphere from ARCO-ERA5); the dim coords are otherwise read once per `read_xarray_table` call, which is a network round-trip for Zarr-backed datasets.	`None`
`_iteration_callback`	`Callable[[Block, list[str] \| None], None] \| None`	Internal callback for testing. Called with each block dict just before it's converted to Arrow.	`None`

Returns:

Type	Description
`'LazyArrowStreamTable'`	A LazyArrowStreamTable ready for registration with DataFusion.

Example

from datafusion import SessionContext import xarray as xr from xarray_sql import read_xarray_table

ds = xr.tutorial.open_dataset('air_temperature') table = read_xarray_table(ds, chunks={'time': 240})

ctx = SessionContext() ctx.register_table('air', table)

Data is only read here, during query execution¶

Filters on 'time' will prune partitions automatically!¶

result = ctx.sql('SELECT AVG(air) FROM air').collect()

`sql` ¶

`XarrayContext` ¶

Bases: SessionContext

A datafusion SessionContext that also supports xarray.Datasets.

`from_dataset(name, input_table, *, table_names=None, chunks=None)` ¶

Register an xarray Dataset as one or more queryable SQL tables.

When all data variables share the same dimensions, the dataset is registered as a single table named name. When variables have differing dimensions (e.g. some on a 3D grid and others on a 4D grid), the dataset is split into one table per dimension group. The tables are registered under a SQL schema (namespace) named name and named <dim1>_<dim2>_... by default::

ctx.from_dataset('era5', ds, chunks={'time': 24})
# registers tables: 'era5.time_lat_lon' and
#                   'era5.time_lat_lon_level'
ctx.sql('SELECT AVG(temperature_2m) FROM era5.time_lat_lon')

Use table_names to override the name for specific dimension tuples::

ctx.from_dataset(
    'era5', ds,
    table_names={('time', 'lat', 'lon'): 'surface'},
)
ctx.sql('SELECT * FROM era5.surface')

For datasets with non-Gregorian cftime coordinates (e.g. 360_day, julian), a cftime() scalar UDF is automatically registered so you can write ergonomic SQL filters::

ctx.from_dataset("ds360", ds, chunks={"time": 6})
ctx.sql("SELECT * FROM ds360 WHERE time >= cftime('2000-07-01')")

.. note::

Only one ``cftime()`` UDF is registered per context, using the
units and calendar of the *first* non-Gregorian coordinate
encountered. If you register multiple datasets with *different*
non-Gregorian calendars (e.g. one 360_day and one julian), the
UDF from the first registration will be used for all subsequent
``cftime()`` calls and may produce incorrect offsets for the
other dataset. In that case, create a separate ``XarrayContext``
for each calendar.

Parameters:

Name	Type	Description	Default
`name`	`str`	The SQL identifier under which the dataset is registered. For datasets with uniform dimensions, this is the table name. For datasets with mixed dimensions, this is the name of a SQL schema (namespace) containing one table per dimension group.	required
`input_table`	`Dataset`	An xarray Dataset.	required
`table_names`	`dict[tuple[str, ...], str] \| None`	Optional mapping from dimension tuples to custom table names within the schema, used when the dataset has variables with differing dimensions.	`None`
`chunks`	`Chunks`	Xarray-like chunks specification. If not provided, uses the Dataset's existing chunks.	`None`

Returns:

Type	Description
	self, to allow chaining.

`sql(query, *args, **kwargs)` ¶

Run a SQL query, returning an :class:XarrayDataFrame wrapper.

Identical to datafusion.SessionContext.sql except the returned object wraps the DataFusion DataFrame. The wrapper exposes .to_pandas() (unchanged), forwards every other DataFusion method via __getattr__, and adds .to_dataset(dimension_columns=[...]) for round-tripping the result back to an xr.Dataset.

Parameters:

Name	Type	Description	Default
`query`	`str`	A SQL query string.	required
`*args`		Forwarded to `SessionContext.sql`.	`()`
`**kwargs`		Forwarded to `SessionContext.sql`.	`{}`

Returns:

Name	Type	Description
`An`	`XarrayDataFrame`	class:`XarrayDataFrame` wrapping the DataFusion DataFrame.

xarray-sql¶

xarray_sql ¶

__all__ = ['cftime', 'XarrayContext', 'read_xarray_table', 'read_xarray', 'from_map'] module-attribute ¶

XarrayContext ¶

from_dataset(name, input_table, *, table_names=None, chunks=None) ¶

sql(query, *args, **kwargs) ¶

from_map(func, *iterables, args=None, **kwargs) ¶

read_xarray(ds, chunks=None) ¶

read_xarray_table(ds, chunks=None, *, batch_size=DEFAULT_BATCH_SIZE, coord_arrays=None, _iteration_callback=None) ¶

Data is only read here, during query execution¶

Filters on 'time' will prune partitions automatically!¶

cftime ¶

DEFAULT_UNITS = 'microseconds since 1970-01-01T00:00:00' module-attribute ¶

GREGORIAN_LIKE_CALENDARS = frozenset({'standard', 'gregorian', 'proleptic_gregorian', 'noleap', '365_day', 'all_leap', '366_day'}) module-attribute ¶

arrow_field(name, units, cal) ¶

calendar(ds, coord_name) ¶

convert_for_field(values, field) ¶

encoding(ds, coord_name) ¶

is_cftime(values) ¶

is_cftime_index(ds, coord_name) ¶

is_gregorian_like(calendar) ¶

make_cftime_udf(units, calendar) ¶

partition_bounds(values) ¶

to_microseconds(values) ¶

to_offsets(values, units, cal) ¶

core ¶

Row = list[Any] module-attribute ¶

get_columns(ds) ¶

unbounded_unravel(ds) ¶

unravel(ds) ¶

df ¶

Block = dict[Hashable, slice] module-attribute ¶

Chunks = dict[str, int] | None module-attribute ¶

DEFAULT_BATCH_SIZE = 65536 module-attribute ¶

PartitionBounds = dict[str, tuple[Any, Any, str]] module-attribute ¶

block_slices(ds, chunks=None) ¶

compute_chunks(ds, chunks) ¶

dataset_to_record_batch(ds, schema) ¶

explode(ds, chunks=None) ¶

from_map(func, *iterables, args=None, **kwargs) ¶

from_map_batched(func, *iterables, args=None, schema=None, **kwargs) ¶

iter_record_batches(ds, schema, batch_size=DEFAULT_BATCH_SIZE) ¶

partition_metadata(ds, blocks) ¶

pivot(ds) ¶

resolve_chunks(ds, chunks) ¶

ds ¶

Sparsity = Literal['result', 'template'] module-attribute ¶

SQLBackendArray ¶

dtype = np.dtype(dtype) instance-attribute ¶

shape = tuple(shape) instance-attribute ¶

XarrayDataFrame ¶

to_dataset(dims=None, template=None, sparsity='result', fill_value=np.nan, chunks='inherit') ¶

to_pandas() ¶

reader ¶

XarrayRecordBatchReader ¶

At this point, NO data has been read from xarray¶

Data is only read when consumed:¶

schema property ¶

read_xarray(ds, chunks=None) ¶

read_xarray_table(ds, chunks=None, *, batch_size=DEFAULT_BATCH_SIZE, coord_arrays=None, _iteration_callback=None) ¶

Data is only read here, during query execution¶

Filters on 'time' will prune partitions automatically!¶

sql ¶

XarrayContext ¶

from_dataset(name, input_table, *, table_names=None, chunks=None) ¶

sql(query, *args, **kwargs) ¶

`xarray_sql` ¶

`all = ['cftime', 'XarrayContext', 'read_xarray_table', 'read_xarray', 'from_map']` `module-attribute` ¶

`XarrayContext` ¶

`from_dataset(name, input_table, *, table_names=None, chunks=None)` ¶

`sql(query, *args, **kwargs)` ¶

`from_map(func, *iterables, args=None, **kwargs)` ¶

`read_xarray(ds, chunks=None)` ¶

`read_xarray_table(ds, chunks=None, *, batch_size=DEFAULT_BATCH_SIZE, coord_arrays=None, _iteration_callback=None)` ¶

`cftime` ¶

`DEFAULT_UNITS = 'microseconds since 1970-01-01T00:00:00'` `module-attribute` ¶

`GREGORIAN_LIKE_CALENDARS = frozenset({'standard', 'gregorian', 'proleptic_gregorian', 'noleap', '365_day', 'all_leap', '366_day'})` `module-attribute` ¶

`arrow_field(name, units, cal)` ¶

`calendar(ds, coord_name)` ¶

`convert_for_field(values, field)` ¶

`encoding(ds, coord_name)` ¶

`is_cftime(values)` ¶

`is_cftime_index(ds, coord_name)` ¶

`is_gregorian_like(calendar)` ¶

`make_cftime_udf(units, calendar)` ¶

`partition_bounds(values)` ¶

`to_microseconds(values)` ¶

`to_offsets(values, units, cal)` ¶

`core` ¶

`Row = list[Any]` `module-attribute` ¶

`get_columns(ds)` ¶

`unbounded_unravel(ds)` ¶

`unravel(ds)` ¶

`df` ¶

`Block = dict[Hashable, slice]` `module-attribute` ¶

`Chunks = dict[str, int] | None` `module-attribute` ¶

`DEFAULT_BATCH_SIZE = 65536` `module-attribute` ¶

`PartitionBounds = dict[str, tuple[Any, Any, str]]` `module-attribute` ¶

`block_slices(ds, chunks=None)` ¶

`compute_chunks(ds, chunks)` ¶

`dataset_to_record_batch(ds, schema)` ¶

`explode(ds, chunks=None)` ¶

`from_map(func, *iterables, args=None, **kwargs)` ¶

`from_map_batched(func, *iterables, args=None, schema=None, **kwargs)` ¶

`iter_record_batches(ds, schema, batch_size=DEFAULT_BATCH_SIZE)` ¶

`partition_metadata(ds, blocks)` ¶

`pivot(ds)` ¶

`resolve_chunks(ds, chunks)` ¶

`ds` ¶

`Sparsity = Literal['result', 'template']` `module-attribute` ¶

`SQLBackendArray` ¶

`dtype = np.dtype(dtype)` `instance-attribute` ¶

`shape = tuple(shape)` `instance-attribute` ¶

`XarrayDataFrame` ¶

`to_dataset(dims=None, template=None, sparsity='result', fill_value=np.nan, chunks='inherit')` ¶

`to_pandas()` ¶

`reader` ¶

`XarrayRecordBatchReader` ¶

`schema` `property` ¶

`read_xarray(ds, chunks=None)` ¶

`read_xarray_table(ds, chunks=None, *, batch_size=DEFAULT_BATCH_SIZE, coord_arrays=None, _iteration_callback=None)` ¶

`sql` ¶

`XarrayContext` ¶

`from_dataset(name, input_table, *, table_names=None, chunks=None)` ¶

`sql(query, *args, **kwargs)` ¶