Managing annotations#
MuData objects have multimodal annotations stored in the same way as AnnData objects. For instance, observations are annotated using the .obs table, and variables are annotated usign the .var table.
As observations and variables of the MuData object are derived from observations and variables of individual modalities, it can be useful to copy or to move annotations between the global table and the tables of individual modalities tables.
For this, mudata offers .pull_obs() / .pull_var() methods to copy metadata from individual modalities to the global annotation (.obs or .var). The opposite flow of metadata — from global metadata to individual modalities — can be achieved with .push_obs() / .push_var() methods.
import numpy as np
import pandas as pd
from mudata import *
Annotations in multimodal objects#
Pulling annotations#
There are a few parameters that can help to specify which annotations to be pulled. Generally, there are two ways of specifying the annotation columnns: providing them explicitely with columns=[...] and providing the types of columns to be pulled (e.g. common or unique).
Pulling feature annotations with .pull_var()#
For demonstration purposes, we will use a simple MuData object with some annotations for the features:
def make_mdata():
D1, D2, D3 = 10, 20, 30
D = D1 + D2 + D3
mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
mod1.var_names = [f"var{i}" for i in range(D1)]
mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
mod2.obs_names = mod1.obs_names.copy()
mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]
mod3 = AnnData(np.arange(5101, 8101, 1).reshape(-1, D3))
mod3.obs_names = mod1.obs_names.copy()
mod3.var_names = [f"var{i}" for i in range(D1 + D2, D)]
# common column already present in all modalities
mod1.var["highly_variable"] = True
mod2.var["highly_variable"] = np.tile([False, True], D2 // 2)
mod3.var["highly_variable"] = np.tile([True, False], D3 // 2)
# column present in some (2 out of 3) modalities (non-unique)
mod2.var["arange"] = np.arange(D2)
mod3.var["arange"] = np.arange(D3)
# column present in one modality (unique)
mod3.var["is_region"] = True
mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3})
return mdata
mdata = make_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.var = mdata.var.loc[:, []]
mdata
MuData object with n_obs × n_vars = 100 × 60
3 modalities
mod1: 100 x 10
var: 'highly_variable'
mod2: 100 x 20
var: 'highly_variable', 'arange'
mod3: 100 x 30
var: 'highly_variable', 'arange', 'is_region'All columns. By default, all columns will be pulled:
mdata.pull_var(join_nonunique=True)
mdata.var.dtypes
highly_variable boolean
arange float64
mod3:is_region boolean
dtype: object
# Clean up
mdata.var = mdata.var.loc[:, []]
columns=... Individual columns can be specified to be used in this operation. Both colname and modname:colname formats are supported.
A column that is present across modalities will be pulled from all the modalities:
mdata.pull_var(columns=["highly_variable"])
print(f"{(~pd.isnull(mdata.var.highly_variable)).sum()} values in mdata.var.highly_variable")
mdata.var.dtypes
60 values in mdata.var.highly_variable
highly_variable boolean
dtype: object
mdata.var = mdata.var.loc[:, []]
Pull particular columns, e.g. a single column from a specified modality:
mdata.pull_var(columns=["mod2:highly_variable"])
print(f"{(~pd.isnull(mdata.var['mod2:highly_variable'])).sum()} values in mdata.var['mod2:highly_variable']")
mdata.var.dtypes
20 values in mdata.var['mod2:highly_variable']
mod2:highly_variable boolean
dtype: object
mdata.var = mdata.var.loc[:, []]
As a result, mdata.var['mod2:highly_variable'] will be a nullable boolean array with corresponding values from mdata['mod2'].var.highly_variable.
The value of highly_variable for features from other modalities is NA.
common, nonunique, unique Note that the common annotation is now prefixed with a modality name as it is has been requested from a limited set of modalities. In this case it behaves similarly to a unique column such as is_region in the mod3. The third type of annotations is non-unique — those are the ones that are present in some but not all modalities.
mdata.pull_var(common=True, nonunique=True, unique=False)
mdata.var.dtypes
highly_variable boolean
mod2:arange float64
mod3:arange float64
dtype: object
mdata.var = mdata.var.loc[:, []]
This makes it possible to pull only unique, i.e. modality-specific, columns:
# unique column
mdata.pull_var(unique=True, common=False, nonunique=False)
mdata.var.dtypes
mod3:is_region boolean
dtype: object
mdata.var = mdata.var.loc[:, []]
… just as it is possible to pull a specific unique column without specifying the modality name:
# unique column
mdata.pull_var(columns=["is_region"])
mdata.var.dtypes
mod3:is_region boolean
dtype: object
mdata.var = mdata.var.loc[:, []]
join_common=..., join_nonunique=... Use join_common=False and join_nonunique=True to change if the annotations are collated across modalities. Unique columns are always prefixed by modality name.
Compare join_nonunique=False:
mdata.pull_var(columns=["arange"], join_nonunique=False)
mdata.var.dtypes
mod2:arange float64
mod3:arange float64
dtype: object
mdata.var = mdata.var.loc[:, []]
— with join_nonunique=True:
mdata.pull_var(columns=["arange"], join_nonunique=True)
mdata.var.dtypes
arange float64
dtype: object
mdata.var = mdata.var.loc[:, []]
mods=... It is also possible to limit the amount of modalities to pull columns from. For example, columns=["mod1:highly_variable", "mod3:highly_variable"] can also be expressed as
mdata.pull_var(columns=["highly_variable"], mods=["mod1", "mod3"])
mdata.var.dtypes
mod1:highly_variable boolean
mod3:highly_variable boolean
dtype: object
mdata.var = mdata.var.loc[:, []]
Last but not least, columns can be automatically dropped from source.
mdata.pull_var(nonunique=False, unique=False, drop=True)
mdata.var.dtypes
highly_variable boolean
dtype: object
The highly_variable label has thus been effectively moved from the individual modalities to the global annotation:
for mod in mdata.mod.values():
print("highly_variable" in mod.var.columns)
False
False
False
Pulling samples annotations with .pull_obs()#
Annotating individual observations is one of the key steps of analytical workflows. For instance, in single-cell sequencing datasets, observations are individual cells, and annotating their identity (cell type, cell state, etc.) as well as managing their source (tissue, organ, donor identity, species) are pivotal for understainding the underlying biology. Those operations are also complicated by multi-layered structure of multimodal datasets.
The .pull_obs() method of MuData aims to abstract this complexity away.
For demonstration purposes, we will use a simple MuData object with some annotations for the observations:
def make_mdata():
N = 100
D1, D2, D3 = 10, 20, 30
D = D1 + D2 + D3
mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
mod1.var_names = [f"var{i}" for i in range(D1)]
mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
mod2.obs_names = mod1.obs_names.copy()
mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]
mod3 = AnnData(np.arange(5101, 8101, 1).reshape(-1, D3))
mod3.obs_names = mod1.obs_names.copy()
mod3.var_names = [f"var{i}" for i in range(D1 + D2, D)]
# common column already present in all modalities
mod1.obs["qc"] = True
mod2.obs["qc"] = True
mod3.obs["qc"] = np.tile([True, False], N // 2)
# column present in some (2 out of 3) modalities (non-unique)
mod2.obs["arange"] = np.arange(N)
mod3.obs["arange"] = np.arange(N, 2 * N)
# column present in one modality (unique)
mod3.obs["mod3_cell"] = True
mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3})
return mdata
mdata = make_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.obs = mdata.obs.loc[:, []]
mdata
MuData object with n_obs × n_vars = 100 × 60
3 modalities
mod1: 100 x 10
obs: 'qc'
mod2: 100 x 20
obs: 'qc', 'arange'
mod3: 100 x 30
obs: 'qc', 'arange', 'mod3_cell'In a multimodal object, observations are shared across modalities. For this reason, join_* arguments cannot be set to True, and the annotations will always be prefixed with a modality name. Apart from this, the underlying implementation as well as the available parameters are the same as demonstrated above for .var.
All columns. By default, all columns will be pulled:
mdata.pull_obs()
mdata.obs.dtypes
mod1:qc boolean
mod2:arange int64
mod2:qc boolean
mod3:arange int64
mod3:mod3_cell boolean
mod3:qc boolean
dtype: object
# Clean up
mdata.obs = mdata.obs.loc[:, []]
columns=... Individual columns can be specified to be used in this operation. Both colname and modname:colname formats are supported.
A column that is present across modalities will be pulled from all the modalities:
mdata.pull_obs(columns=["qc"])
mdata.obs.dtypes
mod1:qc boolean
mod2:qc boolean
mod3:qc boolean
dtype: object
mdata.obs = mdata.obs.loc[:, []]
Pull particular columns, e.g. a single column from a specified modality:
mdata.pull_obs(columns=["mod2:qc"])
mdata.obs.dtypes
mod2:qc boolean
dtype: object
mdata.obs = mdata.obs.loc[:, []]
common, nonunique, unique Column types are deduced according to the presence in all / some / single modality(-ies). Because of the sharedness structure, they will all be prefixed by a modality name:
mdata.pull_obs(common=True, nonunique=True, unique=False)
mdata.obs.dtypes
mod1:qc boolean
mod2:arange int64
mod2:qc boolean
mod3:arange int64
mod3:qc boolean
dtype: object
mdata.obs = mdata.obs.loc[:, []]
So it is possible to pull only unique, i.e. modality-specific, columns:
# unique column
mdata.pull_obs(unique=True, common=False, nonunique=False)
mdata.obs.dtypes
mod3:mod3_cell boolean
dtype: object
mdata.obs = mdata.obs.loc[:, []]
… just as it is possible to pull a specific unique column without specifying the modality name:
# unique column
mdata.pull_obs(columns=["mod3_cell"])
mdata.obs.dtypes
mod3:mod3_cell boolean
dtype: object
mdata.obs = mdata.obs.loc[:, []]
mods=... It is also possible to limit the amount of modalities to pull columns from. For example, columns=["mod1:qc", "mod3:qc"] can also be expressed as
mdata.pull_obs(columns=["qc"], mods=["mod1", "mod3"])
mdata.obs.dtypes
mod1:qc boolean
mod3:qc boolean
dtype: object
mdata.obs = mdata.obs.loc[:, []]
Last but not least, columns can be automatically dropped from source.
mdata.pull_obs(nonunique=False, unique=False, drop=True)
mdata.obs.dtypes
mod1:qc boolean
mod2:qc boolean
mod3:qc boolean
dtype: object
The qc label has thus been effectively moved from the individual modalities to the global annotation:
for mod in mdata.mod.values():
print("qc" in mod.obs.columns)
False
False
False
Pushing annotations#
Annotations can also be pushed from the global .var or .obs table to the individual modalities.
Pushing feature annotations with .push_var()#
For demonstration purposes, we will use a simple MuData object with some global annotations for the features:
def make_mdata():
D1, D2, D3 = 10, 20, 30
D = D1 + D2 + D3
mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
mod1.var_names = [f"var{i}" for i in range(D1)]
mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
mod2.obs_names = mod1.obs_names.copy()
mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]
mod3 = AnnData(np.arange(5101, 8101, 1).reshape(-1, D3))
mod3.obs_names = mod1.obs_names.copy()
mod3.var_names = [f"var{i}" for i in range(D1 + D2, D)]
mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3})
# common column to be propagated to all modalities
mdata.var["highly_variable"] = True
# prefix column to be propagated to the respective modalities
mdata.var["mod2:if_mod2"] = np.concatenate(
[
np.repeat(pd.NA, D1),
np.repeat(True, D2),
np.repeat(pd.NA, D3),
]
)
return mdata
mdata = make_mdata()
mdata
MuData object with n_obs × n_vars = 100 × 60
var: 'highly_variable'
3 modalities
mod1: 100 x 10
mod2: 100 x 20
mod3: 100 x 30push_var() will add a highly_variable column to each modality and a if_mod2 column to the mod2 modality:
mdata.push_var()
for m in mdata.mod.keys():
print(mdata[m].var.dtypes)
highly_variable bool
dtype: object
highly_variable bool
if_mod2 object
dtype: object
highly_variable bool
dtype: object
# Clean up
for m in mdata.mod.keys():
mdata[m].var = mdata[m].var.loc[:, []]
common=, prefixed= options can be used to adjust the selection of columns to be pushed — non-prefixed (common) ones and/or the ones prefixed with modality name.
Only common:
mdata.push_var(common=True, prefixed=False)
for m in mdata.mod.keys():
print(mdata[m].var.dtypes)
highly_variable bool
dtype: object
highly_variable bool
dtype: object
highly_variable bool
dtype: object
# Clean up
for m in mdata.mod.keys():
mdata[m].var = mdata[m].var.loc[:, []]
… or only prefixed columns can be pushed:
mdata.push_var(common=False, prefixed=True)
for m in mdata.mod.keys():
print(mdata[m].var.dtypes)
Series([], dtype: object)
if_mod2 object
dtype: object
Series([], dtype: object)
# Clean up
for m in mdata.mod.keys():
mdata[m].var = mdata[m].var.loc[:, []]
Prefixed columns are pushed to the respective modalities.
Alternatively, columns= allows to provide an explicit list of columns to be propagated to modalities, and mods= allows to limit the modalities for propagating annotations:
mdata.push_var(columns=["highly_variable"], mods=["mod3"])
for m in mdata.mod.keys():
print(mdata[m].var.dtypes)
Series([], dtype: object)
Series([], dtype: object)
highly_variable bool
dtype: object
# Clean up
for m in mdata.mod.keys():
mdata[m].var = mdata[m].var.loc[:, []]
Annotations can be also dropped from the mdata.var after pushing them from individual modalities with drop=True — or just dropped without propagation with only_drop=True:
mdata.push_var(prefixed=False, drop=True)
mdata.push_var(columns=["if_mod2"], only_drop=True)
This will propagate highly_variable column to all the modalities and drop it from the mdata.var, and will also drop mdata.var.if_mod2 column:
print(f"mdata.var columns:\n{mdata.var.dtypes}")
mdata.var columns:
Series([], dtype: object)
print(f"mdata['mod2'].var columns:\n{mdata['mod2'].var.dtypes}")
mdata['mod2'].var columns:
highly_variable bool
dtype: object
Pushing samples annotations with .push_obs()#
For demonstration purposes, we will use a simple MuData object with some global annotations for the observations:
def make_mdata():
D1, D2 = 10, 20
mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
mod1.var_names = [f"var{i}" for i in range(D1)]
mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
mod2.obs_names = mod1.obs_names.copy()
mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]
mdata = MuData({"mod1": mod1, "mod2": mod2})
# common column to be propagated to all modalities
mdata.obs["true"] = True
return mdata
mdata = make_mdata()
mdata
MuData object with n_obs × n_vars = 100 × 30
obs: 'true'
2 modalities
mod1: 100 x 10
mod2: 100 x 20push_obs() will add a true column to each modality:
mdata.push_obs()
for m in mdata.mod.keys():
print(mdata[m].obs.dtypes)
true bool
dtype: object
true bool
dtype: object
# Clean up
for m in mdata.mod.keys():
mdata[m].obs = mdata[m].obs.loc[:, []]
common=, prefixed= options can be used to adjust the selection of columns to be pushed — non-prefixed (common) ones and/or the ones prefixed with modality name:
mdata.push_obs(common=False)
for m in mdata.mod.keys():
print(mdata[m].obs.dtypes)
Series([], dtype: object)
Series([], dtype: object)
# Clean up
for m in mdata.mod.keys():
mdata[m].obs = mdata[m].obs.loc[:, []]
Alternatively, columns= allows to provide an explicit list of columns to be propagated to modalities, and mods= allows to limit the modalities for propagating annotations:
mdata.push_obs(columns=["true"], mods=["mod2"])
for m in mdata.mod.keys():
print(f"modality {m}:")
print(mdata[m].obs.dtypes)
print()
modality mod1:
Series([], dtype: object)
modality mod2:
true bool
dtype: object
# Clean up
for m in mdata.mod.keys():
mdata[m].obs = mdata[m].obs.loc[:, []]
Annotations can be also dropped from mdata.obs after pushing them from individual modalities with drop=True — or just dropped without propagation with only_drop=True:
mdata.push_obs(only_drop=True)
This will just drop mdata.var.true column:
print(f"mdata.obs columns:\n{mdata.obs.dtypes}")
mdata.obs columns:
Series([], dtype: object)
print(f"mdata['mod2'].obs columns:\n{mdata['mod2'].obs.dtypes}")
mdata['mod2'].obs columns:
Series([], dtype: object)
Multi-dataset annotations#
The axes interface enables MuData to be used beyond multimodal data. This includes multi-dataset containers with axis=1 (shared features) and data subsets with axis=-1 (shared observations and features).
def make_mdata():
N1, N2, N3 = 10, 20, 30
N = N1 + N2 + N3
D = 100
mod1 = AnnData(np.arange(0, 100, 0.1).reshape(N1, -1))
mod1.obs_names = [f"obs{i}" for i in range(N1)]
mod1.var_names = [f"var{i}" for i in range(D)]
mod2 = AnnData(np.arange(3101, 5101, 1).reshape(N2, -1))
mod2.obs_names = [f"obs{i}" for i in range(N1, N1 + N2)]
mod2.var_names = mod1.var_names.copy()
mod3 = AnnData(np.arange(5101, 8101, 1).reshape(N3, -1))
mod3.obs_names = [f"obs{i}" for i in range(N1 + N2, N)]
mod3.var_names = mod1.var_names.copy()
# common column already present in all modalities
mod1.obs["dataset"] = "dataset1"
mod2.obs["dataset"] = "dataset2"
mod3.obs["dataset"] = "dataset3"
# column present in some (2 out of 3) modalities (non-unique)
mod2.obs["species"] = "human"
mod3.obs["species"] = "mouse"
# column present in one modality (unique)
mod3.obs["reference"] = True
mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3}, axis=1)
return mdata
mdata = make_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.obs = mdata.obs.loc[:, []]
mdata.var = mdata.var.loc[:, []]
mdata
MuData object with n_obs × n_vars = 60 × 100 (shared var)
3 modalities
mod1: 10 x 100
obs: 'dataset'
mod2: 20 x 100
obs: 'dataset', 'species'
mod3: 30 x 100
obs: 'dataset', 'species', 'reference'mdata.pull_obs(join_nonunique=True, prefix_unique=False)
mdata.obs.dtypes
dataset object
species object
reference boolean
dtype: object
mdata.pull_var()
mdata.var.dtypes
Series([], dtype: object)
Stages annotations#
MuData objects with mdata.axis == -1 can contains “modalities” that have both samples and features shared. This can be useful for example for storing different processing stages, with both samples and features being filtered out with some quality control (QC) procedures.
Similarly to other axes, .pull_obs()/pull_var() and .push_obs()/.push_var() work as well.
def make_staged_mdata():
N, D = 10, 100
Nsub, Dsub = 8, 50
mod1 = AnnData(np.arange(0, 100, 0.1).reshape(N, D))
mod1.obs_names = [f"obs{i}" for i in range(N)]
mod1.var_names = [f"var{i}" for i in range(D)]
mod2 = AnnData(np.arange(3101, 3501, 1).reshape(Nsub, Dsub))
mod2.obs_names = [f"obs{i}" for i in range(Nsub)]
mod2.var_names = [f"var{i}" for i in range(Dsub)]
# common column already present in all modalities
mod1.obs["status"] = True
mod2.obs["status"] = True
# column present in one modality (unique)
mod2.obs["filtered"] = True
mod2.var["filtered"] = True
mdata = MuData({"raw": mod1, "qced": mod2}, axis=-1)
return mdata
mdata = make_staged_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.obs = mdata.obs.loc[:, []]
mdata.var = mdata.var.loc[:, []]
mdata
MuData object with n_obs × n_vars = 10 × 100 (shared obs and var)
2 modalities
raw: 10 x 100
obs: 'status'
qced: 8 x 50
obs: 'status', 'filtered'
var: 'filtered'mdata.pull_obs(prefix_unique=False)
mdata.obs.dtypes
raw:status boolean
filtered boolean
qced:status boolean
dtype: object
mdata.pull_var(prefix_unique=False)
mdata.var.dtypes
filtered boolean
dtype: object
Nested MuData objects#
Annotations can be also managed for nested MuData objects:
def make_nested_mdata():
stages = make_staged_mdata()
stages.obs = stages.obs.loc[:, []] # pre-0.3
mod2 = AnnData(np.arange(10000, 12000, 1).reshape(10, -1))
mod2.obs_names = [f"obs{i}" for i in range(mod2.n_obs)]
mod2.var_names = [f"mod2:var{i}" for i in range(mod2.n_vars)]
mdata = MuData({"mod1": stages, "mod2": mod2}, axis=-1)
mdata.obs["dataset"] = "ref"
return mdata
mdata = make_nested_mdata()
mdata
MuData object with n_obs × n_vars = 10 × 300 (shared obs and var)
obs: 'dataset'
2 modalities
mod1: MuData object with n_obs × n_vars = 10 × 100 (shared obs and var)
2 modalities
raw: 10 x 100
obs: 'status'
qced: 8 x 50
obs: 'status', 'filtered'
var: 'filtered'
mod2: 10 x 200print(mdata.mod)
MuData
├─ mod1 MuData [shared obs and var] (10 × 100)
│ ├─ raw AnnData (10 x 100)
│ └─ qced AnnData (8 x 50)
└─ mod2 AnnData (10 x 200)
Propagation is not recursive by intention, and annotations in the inner mod1 should be explicitely pushed down to individual AnnData objects when desired:
mdata.push_obs()
for mod in mdata.mod.values():
print(mod.obs.dtypes)
dataset object
dtype: object
dataset object
dtype: object
for mod in mdata["mod1"].mod.values():
print(mod.obs.dtypes)
status bool
dtype: object
status bool
filtered bool
dtype: object
An example of the recursive push_obs() operation:
def push_obs_rec(mdata: MuData):
mdata.push_obs()
for mod in mdata.mod.values():
if isinstance(mod, MuData):
push_obs_rec(mod)
push_obs_rec(mdata)
for mod in mdata["mod1"].mod.values():
assert "dataset" in mod.obs