Axes in MuData#
This notebooks introduces axes interface that supercharges MuData to be used beyond multimodal data storage.
Briefly, the default multimodal storage means that the modalities (AnnData objects) have observations as a shared axis (axis=0), and the variables are effectively concatenated.
We can imagine a symmetrical storage model where the variables are shared and observations are concatenated. This is possible with axis=1 provided at MuData creation time.
More than that, in some cases we might want to relax constraints even more and assume that both observations and variables are in fact shared. This allows, for instance, to store subsets of features in the same object. As both axes are shared, a convention is used here, and it is axis=-1.
Imports#
First, install and import mudata and other libraries.
%pip install mudata
from mudata import MuData, AnnData
import numpy as np
np.random.seed(1)
Multimodal: axis=0#
As expected, this is the default behaviour.
To illustrate it, let’s prepare some modalities first:
n, d1, d2 = 100, 1000, 1500
ax = AnnData(np.random.normal(size=(n, d1)))
ay = AnnData(np.random.normal(size=(n, d2)))
# same as:
# mdata = MuData({"x": ax, "y": ay})
mdata = MuData({"x": ax, "y": ay}, axis=0)
mdata
MuData object with n_obs × n_vars = 100 × 2500
2 modalities
x: 100 x 1000
y: 100 x 1500As axis=0 corresponds to shared observations, the features should be specific to their modalities. The variable names, however, are unique, which the warning is displayed about:
print("ax.var_names: [", ", ".join(ax.var_names.values[:5]) + ", ..., ", ax.var_names.values[d1 - 1], "]")
print("ay.var_names: [", ", ".join(ay.var_names.values[:5]) + ", ..., ", ay.var_names.values[d2 - 1], "]")
ax.var_names: [ 0, 1, 2, 3, 4, ..., 999 ]
ay.var_names: [ 0, 1, 2, 3, 4, ..., 1499 ]
In real-world workflows we expect to be able to identify features by their (unique) names:
ax.var_names = [f"x_var{i + 1}" for i in range(d1)]
ay.var_names = [f"y_var{i + 1}" for i in range(d2)]
mdata = MuData({"x": ax, "y": ay}, axis=0)
mdata
MuData object with n_obs × n_vars = 100 × 2500
2 modalities
x: 100 x 1000
y: 100 x 1500Multidataset: axis=1#
Now, AnnData objects can represent e.g. multiple scRNA-seq datasets. When analysing them together, it is convenient to store them in one object. This object can then incorporate annotations such as a joint embedding of the datasets.
n1, n2, d = 100, 500, 1000
ad1 = AnnData(np.random.normal(size=(n1, d)))
ad2 = AnnData(np.random.normal(size=(n2, d)))
# Cell barcodes are dataset-specific
ad1.obs_names = [f"dat1-cell{i + 1}" for i in range(n1)]
ad2.obs_names = [f"dat2-cell{i + 1}" for i in range(n2)]
What would happen if we create a MuData without specifying the axis?
mdata = MuData({"dat1": ad1, "dat2": ad2})
mdata
Answer
By default, variables are dataset/modality-specific so the number of features in MuData will be d + d = 2000.
Cells are considered shared but here, obs_names are unique for each dataset, so the number of cells will be n1 + n2 = 600.
UserWarning: Cannot join columns with the same name because var_names are intersecting.
MuData object with n_obs × n_vars = 600 × 2000
2 modalities
dat1: 100 x 1000
dat2: 500 x 1000
Now, if we point the shared axes to be variables:
mdata = MuData({"dat1": ad1, "dat2": ad2}, axis=1)
mdata
MuData object with n_obs × n_vars = 600 × 1000 (shared var)
2 modalities
dat1: 100 x 1000
dat2: 500 x 1000Different views on one modality: axis=-1#
In some workflows, like the ones with scVI, AnnData objects typically contain only selected features, e.g. genes. Raw counts for all of the genes are still valuable to keep, for other analyses.
MuData handles this scenario using the axis=-1 convention.
n, d_raw, d_preproc = 100, 900, 300
a_raw = AnnData(np.random.normal(size=(n, d_raw)))
a_preproc = a_raw[:, np.sort(np.random.choice(np.arange(d_raw), d_preproc, replace=False))].copy()
What would happen if we create a MuData with axis=0?
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=0)
mdata
Answer
With axis=0, cells are (fully) shared (100), variables are concatenated (1200). As the names for the latter intersect between AnnData objects, a warning will be displayed.
UserWarning: Cannot join columns with the same name because var_names are intersecting.
MuData object with n_obs × n_vars = 100 × 1200
2 modalities
raw: 100 x 900
preproc: 100 x 300
What would happen if we create a MuData with axis=1?
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=1)
mdata
Answer
With axis=1, variables are shared (900), while the cells are dataset-specific (200). As the names for the latter are actually the same in both AnnData objects, a warning will be displayed.
UserWarning: Cannot join columns with the same name because obs_names are intersecting.
MuData object with n_obs × n_vars = 200 × 900
2 modalities
raw: 100 x 900
preproc: 100 x 300
What we want from a MuData object is to be of dimensions (100, 900) — cells are the same for both AnnData objects as well as a subset of features.
That’s what we achieve when we point that both axes are shared:
mdata = MuData({"raw": a_raw, "preproc": a_preproc}, axis=-1)
mdata
MuData object with n_obs × n_vars = 100 × 900 (shared obs and var)
2 modalities
raw: 100 x 900
preproc: 100 x 300