Managing annotations

Managing annotations#

MuData objects have multimodal annotations stored in the same way as AnnData objects. For instance, observations are annotated using the .obs table, and variables are annotated usign the .var table.

As observations and variables of the MuData object are derived from observations and variables of individual modalities, it can be useful to copy or to move annotations between the global table and the tables of individual modalities tables.

For this, mudata offers .pull_obs() / .pull_var() methods to copy metadata from individual modalities to the global annotation (.obs or .var). The opposite flow of metadata — from global metadata to individual modalities — can be achieved with .push_obs() / .push_var() methods.

import numpy as np
import pandas as pd
from mudata import *

Annotations in multimodal objects#

Pulling annotations#

There are a few parameters that can help to specify which annotations to be pulled. Generally, there are two ways of specifying the annotation columnns: providing them explicitely with columns=[...] and providing the types of columns to be pulled (e.g. common or unique).

Pulling feature annotations with `.pull_var()`#

For demonstration purposes, we will use a simple MuData object with some annotations for the features:

def make_mdata():
    D1, D2, D3 = 10, 20, 30
    D = D1 + D2 + D3

    mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
    mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
    mod1.var_names = [f"var{i}" for i in range(D1)]

    mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
    mod2.obs_names = mod1.obs_names.copy()
    mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]

    mod3 = AnnData(np.arange(5101, 8101, 1).reshape(-1, D3))
    mod3.obs_names = mod1.obs_names.copy()
    mod3.var_names = [f"var{i}" for i in range(D1 + D2, D)]

    # common column already present in all modalities
    mod1.var["highly_variable"] = True
    mod2.var["highly_variable"] = np.tile([False, True], D2 // 2)
    mod3.var["highly_variable"] = np.tile([True, False], D3 // 2)

    # column present in some (2 out of 3) modalities (non-unique)
    mod2.var["arange"] = np.arange(D2)
    mod3.var["arange"] = np.arange(D3)

    # column present in one modality (unique)
    mod3.var["is_region"] = True

    mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3})
    return mdata

mdata = make_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.var = mdata.var.loc[:, []]

mdata

MuData object with n_obs × n_vars = 100 × 60
  3 modalities
    mod1:	100 x 10
      var:	'highly_variable'
    mod2:	100 x 20
      var:	'highly_variable', 'arange'
    mod3:	100 x 30
      var:	'highly_variable', 'arange', 'is_region'

All columns. By default, all columns will be pulled:

mdata.pull_var(join_nonunique=True)
mdata.var.dtypes

highly_variable    boolean
arange             float64
mod3:is_region     boolean
dtype: object

# Clean up
mdata.var = mdata.var.loc[:, []]

columns=... Individual columns can be specified to be used in this operation. Both colname and modname:colname formats are supported.

A column that is present across modalities will be pulled from all the modalities:

mdata.pull_var(columns=["highly_variable"])
print(f"{(~pd.isnull(mdata.var.highly_variable)).sum()} values in mdata.var.highly_variable")
mdata.var.dtypes

60 values in mdata.var.highly_variable

highly_variable    boolean
dtype: object

mdata.var = mdata.var.loc[:, []]

Pull particular columns, e.g. a single column from a specified modality:

mdata.pull_var(columns=["mod2:highly_variable"])
print(f"{(~pd.isnull(mdata.var['mod2:highly_variable'])).sum()} values in mdata.var['mod2:highly_variable']")
mdata.var.dtypes

20 values in mdata.var['mod2:highly_variable']

mod2:highly_variable    boolean
dtype: object

mdata.var = mdata.var.loc[:, []]

As a result, mdata.var['mod2:highly_variable'] will be a nullable boolean array with corresponding values from mdata['mod2'].var.highly_variable. The value of highly_variable for features from other modalities is NA.

common, nonunique, unique Note that the common annotation is now prefixed with a modality name as it is has been requested from a limited set of modalities. In this case it behaves similarly to a unique column such as is_region in the mod3. The third type of annotations is non-unique — those are the ones that are present in some but not all modalities.

mdata.pull_var(common=True, nonunique=True, unique=False)
mdata.var.dtypes

highly_variable    boolean
mod2:arange        float64
mod3:arange        float64
dtype: object

mdata.var = mdata.var.loc[:, []]

This makes it possible to pull only unique, i.e. modality-specific, columns:

# unique column
mdata.pull_var(unique=True, common=False, nonunique=False)
mdata.var.dtypes

mod3:is_region    boolean
dtype: object

mdata.var = mdata.var.loc[:, []]

… just as it is possible to pull a specific unique column without specifying the modality name:

# unique column
mdata.pull_var(columns=["is_region"])
mdata.var.dtypes

mod3:is_region    boolean
dtype: object

mdata.var = mdata.var.loc[:, []]

join_common=..., join_nonunique=... Use join_common=False and join_nonunique=True to change if the annotations are collated across modalities. Unique columns are always prefixed by modality name.

Compare join_nonunique=False:

mdata.pull_var(columns=["arange"], join_nonunique=False)
mdata.var.dtypes

mod2:arange    float64
mod3:arange    float64
dtype: object

mdata.var = mdata.var.loc[:, []]

— with join_nonunique=True:

mdata.pull_var(columns=["arange"], join_nonunique=True)
mdata.var.dtypes

arange    float64
dtype: object

mdata.var = mdata.var.loc[:, []]

mods=... It is also possible to limit the amount of modalities to pull columns from. For example, columns=["mod1:highly_variable", "mod3:highly_variable"] can also be expressed as

mdata.pull_var(columns=["highly_variable"], mods=["mod1", "mod3"])
mdata.var.dtypes

mod1:highly_variable    boolean
mod3:highly_variable    boolean
dtype: object

mdata.var = mdata.var.loc[:, []]

Last but not least, columns can be automatically dropped from source.

mdata.pull_var(nonunique=False, unique=False, drop=True)
mdata.var.dtypes

highly_variable    boolean
dtype: object

The highly_variable label has thus been effectively moved from the individual modalities to the global annotation:

for mod in mdata.mod.values():
    print("highly_variable" in mod.var.columns)

False
False
False

Pulling samples annotations with `.pull_obs()`#

Annotating individual observations is one of the key steps of analytical workflows. For instance, in single-cell sequencing datasets, observations are individual cells, and annotating their identity (cell type, cell state, etc.) as well as managing their source (tissue, organ, donor identity, species) are pivotal for understainding the underlying biology. Those operations are also complicated by multi-layered structure of multimodal datasets.

The .pull_obs() method of MuData aims to abstract this complexity away.

For demonstration purposes, we will use a simple MuData object with some annotations for the observations:

def make_mdata():
    N = 100
    D1, D2, D3 = 10, 20, 30
    D = D1 + D2 + D3

    mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
    mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
    mod1.var_names = [f"var{i}" for i in range(D1)]

    mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
    mod2.obs_names = mod1.obs_names.copy()
    mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]

    mod3 = AnnData(np.arange(5101, 8101, 1).reshape(-1, D3))
    mod3.obs_names = mod1.obs_names.copy()
    mod3.var_names = [f"var{i}" for i in range(D1 + D2, D)]

    # common column already present in all modalities
    mod1.obs["qc"] = True
    mod2.obs["qc"] = True
    mod3.obs["qc"] = np.tile([True, False], N // 2)

    # column present in some (2 out of 3) modalities (non-unique)
    mod2.obs["arange"] = np.arange(N)
    mod3.obs["arange"] = np.arange(N, 2 * N)

    # column present in one modality (unique)
    mod3.obs["mod3_cell"] = True

    mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3})
    return mdata

mdata = make_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.obs = mdata.obs.loc[:, []]

mdata

MuData object with n_obs × n_vars = 100 × 60
  3 modalities
    mod1:	100 x 10
      obs:	'qc'
    mod2:	100 x 20
      obs:	'qc', 'arange'
    mod3:	100 x 30
      obs:	'qc', 'arange', 'mod3_cell'

In a multimodal object, observations are shared across modalities. For this reason, join_* arguments cannot be set to True, and the annotations will always be prefixed with a modality name. Apart from this, the underlying implementation as well as the available parameters are the same as demonstrated above for .var.

All columns. By default, all columns will be pulled:

mdata.pull_obs()
mdata.obs.dtypes

mod1:qc           boolean
mod2:arange         int64
mod2:qc           boolean
mod3:arange         int64
mod3:mod3_cell    boolean
mod3:qc           boolean
dtype: object

# Clean up
mdata.obs = mdata.obs.loc[:, []]

columns=... Individual columns can be specified to be used in this operation. Both colname and modname:colname formats are supported.

A column that is present across modalities will be pulled from all the modalities:

mdata.pull_obs(columns=["qc"])
mdata.obs.dtypes

mod1:qc    boolean
mod2:qc    boolean
mod3:qc    boolean
dtype: object

mdata.obs = mdata.obs.loc[:, []]

Pull particular columns, e.g. a single column from a specified modality:

mdata.pull_obs(columns=["mod2:qc"])
mdata.obs.dtypes

mod2:qc    boolean
dtype: object

mdata.obs = mdata.obs.loc[:, []]

common, nonunique, unique Column types are deduced according to the presence in all / some / single modality(-ies). Because of the sharedness structure, they will all be prefixed by a modality name:

mdata.pull_obs(common=True, nonunique=True, unique=False)
mdata.obs.dtypes

mod1:qc        boolean
mod2:arange      int64
mod2:qc        boolean
mod3:arange      int64
mod3:qc        boolean
dtype: object

mdata.obs = mdata.obs.loc[:, []]

So it is possible to pull only unique, i.e. modality-specific, columns:

# unique column
mdata.pull_obs(unique=True, common=False, nonunique=False)
mdata.obs.dtypes

mod3:mod3_cell    boolean
dtype: object

mdata.obs = mdata.obs.loc[:, []]

… just as it is possible to pull a specific unique column without specifying the modality name:

# unique column
mdata.pull_obs(columns=["mod3_cell"])
mdata.obs.dtypes

mod3:mod3_cell    boolean
dtype: object

mdata.obs = mdata.obs.loc[:, []]

mods=... It is also possible to limit the amount of modalities to pull columns from. For example, columns=["mod1:qc", "mod3:qc"] can also be expressed as

mdata.pull_obs(columns=["qc"], mods=["mod1", "mod3"])
mdata.obs.dtypes

mod1:qc    boolean
mod3:qc    boolean
dtype: object

mdata.obs = mdata.obs.loc[:, []]

Last but not least, columns can be automatically dropped from source.

mdata.pull_obs(nonunique=False, unique=False, drop=True)
mdata.obs.dtypes

mod1:qc    boolean
mod2:qc    boolean
mod3:qc    boolean
dtype: object

The qc label has thus been effectively moved from the individual modalities to the global annotation:

for mod in mdata.mod.values():
    print("qc" in mod.obs.columns)

False
False
False

Pushing annotations#

Annotations can also be pushed from the global .var or .obs table to the individual modalities.

Pushing feature annotations with `.push_var()`#

For demonstration purposes, we will use a simple MuData object with some global annotations for the features:

def make_mdata():
    D1, D2, D3 = 10, 20, 30
    D = D1 + D2 + D3

    mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
    mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
    mod1.var_names = [f"var{i}" for i in range(D1)]

    mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
    mod2.obs_names = mod1.obs_names.copy()
    mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]

    mod3 = AnnData(np.arange(5101, 8101, 1).reshape(-1, D3))
    mod3.obs_names = mod1.obs_names.copy()
    mod3.var_names = [f"var{i}" for i in range(D1 + D2, D)]

    mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3})

    # common column to be propagated to all modalities
    mdata.var["highly_variable"] = True

    # prefix column to be propagated to the respective modalities
    mdata.var["mod2:if_mod2"] = np.concatenate(
        [
            np.repeat(pd.NA, D1),
            np.repeat(True, D2),
            np.repeat(pd.NA, D3),
        ]
    )

    return mdata

mdata = make_mdata()

mdata

MuData object with n_obs × n_vars = 100 × 60
  var:	'highly_variable'
  3 modalities
    mod1:	100 x 10
    mod2:	100 x 20
    mod3:	100 x 30

push_var() will add a highly_variable column to each modality and a if_mod2 column to the mod2 modality:

mdata.push_var()

for m in mdata.mod.keys():
    print(mdata[m].var.dtypes)

highly_variable    bool
dtype: object
highly_variable      bool
if_mod2            object
dtype: object
highly_variable    bool
dtype: object

# Clean up
for m in mdata.mod.keys():
    mdata[m].var = mdata[m].var.loc[:, []]

common=, prefixed= options can be used to adjust the selection of columns to be pushed — non-prefixed (common) ones and/or the ones prefixed with modality name.

Only common:

mdata.push_var(common=True, prefixed=False)

for m in mdata.mod.keys():
    print(mdata[m].var.dtypes)

highly_variable    bool
dtype: object
highly_variable    bool
dtype: object
highly_variable    bool
dtype: object

# Clean up
for m in mdata.mod.keys():
    mdata[m].var = mdata[m].var.loc[:, []]

… or only prefixed columns can be pushed:

mdata.push_var(common=False, prefixed=True)

for m in mdata.mod.keys():
    print(mdata[m].var.dtypes)

Series([], dtype: object)
if_mod2    object
dtype: object
Series([], dtype: object)

# Clean up
for m in mdata.mod.keys():
    mdata[m].var = mdata[m].var.loc[:, []]

Prefixed columns are pushed to the respective modalities.

Alternatively, columns= allows to provide an explicit list of columns to be propagated to modalities, and mods= allows to limit the modalities for propagating annotations:

mdata.push_var(columns=["highly_variable"], mods=["mod3"])

for m in mdata.mod.keys():
    print(mdata[m].var.dtypes)

Series([], dtype: object)
Series([], dtype: object)
highly_variable    bool
dtype: object

# Clean up
for m in mdata.mod.keys():
    mdata[m].var = mdata[m].var.loc[:, []]

Annotations can be also dropped from the mdata.var after pushing them from individual modalities with drop=True — or just dropped without propagation with only_drop=True:

mdata.push_var(prefixed=False, drop=True)
mdata.push_var(columns=["if_mod2"], only_drop=True)

This will propagate highly_variable column to all the modalities and drop it from the mdata.var, and will also drop mdata.var.if_mod2 column:

print(f"mdata.var columns:\n{mdata.var.dtypes}")

mdata.var columns:
Series([], dtype: object)

print(f"mdata['mod2'].var columns:\n{mdata['mod2'].var.dtypes}")

mdata['mod2'].var columns:
highly_variable    bool
dtype: object

Pushing samples annotations with `.push_obs()`#

For demonstration purposes, we will use a simple MuData object with some global annotations for the observations:

def make_mdata():
    D1, D2 = 10, 20

    mod1 = AnnData(np.arange(0, 100, 0.1).reshape(-1, D1))
    mod1.obs_names = [f"obs{i}" for i in range(mod1.n_obs)]
    mod1.var_names = [f"var{i}" for i in range(D1)]

    mod2 = AnnData(np.arange(3101, 5101, 1).reshape(-1, D2))
    mod2.obs_names = mod1.obs_names.copy()
    mod2.var_names = [f"var{i}" for i in range(D1, D1 + D2)]

    mdata = MuData({"mod1": mod1, "mod2": mod2})

    # common column to be propagated to all modalities
    mdata.obs["true"] = True

    return mdata

mdata = make_mdata()

mdata

MuData object with n_obs × n_vars = 100 × 30
  obs:	'true'
  2 modalities
    mod1:	100 x 10
    mod2:	100 x 20

push_obs() will add a true column to each modality:

mdata.push_obs()

for m in mdata.mod.keys():
    print(mdata[m].obs.dtypes)

true    bool
dtype: object
true    bool
dtype: object

# Clean up
for m in mdata.mod.keys():
    mdata[m].obs = mdata[m].obs.loc[:, []]

common=, prefixed= options can be used to adjust the selection of columns to be pushed — non-prefixed (common) ones and/or the ones prefixed with modality name:

mdata.push_obs(common=False)

for m in mdata.mod.keys():
    print(mdata[m].obs.dtypes)

Series([], dtype: object)
Series([], dtype: object)

# Clean up
for m in mdata.mod.keys():
    mdata[m].obs = mdata[m].obs.loc[:, []]

Alternatively, columns= allows to provide an explicit list of columns to be propagated to modalities, and mods= allows to limit the modalities for propagating annotations:

mdata.push_obs(columns=["true"], mods=["mod2"])

for m in mdata.mod.keys():
    print(f"modality {m}:")
    print(mdata[m].obs.dtypes)
    print()

modality mod1:
Series([], dtype: object)

modality mod2:
true    bool
dtype: object

# Clean up
for m in mdata.mod.keys():
    mdata[m].obs = mdata[m].obs.loc[:, []]

Annotations can be also dropped from mdata.obs after pushing them from individual modalities with drop=True — or just dropped without propagation with only_drop=True:

mdata.push_obs(only_drop=True)

This will just drop mdata.var.true column:

print(f"mdata.obs columns:\n{mdata.obs.dtypes}")

mdata.obs columns:
Series([], dtype: object)

print(f"mdata['mod2'].obs columns:\n{mdata['mod2'].obs.dtypes}")

mdata['mod2'].obs columns:
Series([], dtype: object)

Multi-dataset annotations#

The axes interface enables MuData to be used beyond multimodal data. This includes multi-dataset containers with axis=1 (shared features) and data subsets with axis=-1 (shared observations and features).

def make_mdata():
    N1, N2, N3 = 10, 20, 30
    N = N1 + N2 + N3
    D = 100

    mod1 = AnnData(np.arange(0, 100, 0.1).reshape(N1, -1))
    mod1.obs_names = [f"obs{i}" for i in range(N1)]
    mod1.var_names = [f"var{i}" for i in range(D)]

    mod2 = AnnData(np.arange(3101, 5101, 1).reshape(N2, -1))
    mod2.obs_names = [f"obs{i}" for i in range(N1, N1 + N2)]
    mod2.var_names = mod1.var_names.copy()

    mod3 = AnnData(np.arange(5101, 8101, 1).reshape(N3, -1))
    mod3.obs_names = [f"obs{i}" for i in range(N1 + N2, N)]
    mod3.var_names = mod1.var_names.copy()

    # common column already present in all modalities
    mod1.obs["dataset"] = "dataset1"
    mod2.obs["dataset"] = "dataset2"
    mod3.obs["dataset"] = "dataset3"

    # column present in some (2 out of 3) modalities (non-unique)
    mod2.obs["species"] = "human"
    mod3.obs["species"] = "mouse"

    # column present in one modality (unique)
    mod3.obs["reference"] = True

    mdata = MuData({"mod1": mod1, "mod2": mod2, "mod3": mod3}, axis=1)
    return mdata

mdata = make_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.obs = mdata.obs.loc[:, []]
mdata.var = mdata.var.loc[:, []]

mdata

MuData object with n_obs × n_vars = 60 × 100 (shared var) 
  3 modalities
    mod1:	10 x 100
      obs:	'dataset'
    mod2:	20 x 100
      obs:	'dataset', 'species'
    mod3:	30 x 100
      obs:	'dataset', 'species', 'reference'

mdata.pull_obs(join_nonunique=True, prefix_unique=False)
mdata.obs.dtypes

dataset       object
species       object
reference    boolean
dtype: object

mdata.pull_var()
mdata.var.dtypes

Series([], dtype: object)

Stages annotations#

MuData objects with mdata.axis == -1 can contains “modalities” that have both samples and features shared. This can be useful for example for storing different processing stages, with both samples and features being filtered out with some quality control (QC) procedures.

Similarly to other axes, .pull_obs()/pull_var() and .push_obs()/.push_var() work as well.

def make_staged_mdata():
    N, D = 10, 100
    Nsub, Dsub = 8, 50

    mod1 = AnnData(np.arange(0, 100, 0.1).reshape(N, D))
    mod1.obs_names = [f"obs{i}" for i in range(N)]
    mod1.var_names = [f"var{i}" for i in range(D)]

    mod2 = AnnData(np.arange(3101, 3501, 1).reshape(Nsub, Dsub))
    mod2.obs_names = [f"obs{i}" for i in range(Nsub)]
    mod2.var_names = [f"var{i}" for i in range(Dsub)]

    # common column already present in all modalities
    mod1.obs["status"] = True
    mod2.obs["status"] = True

    # column present in one modality (unique)
    mod2.obs["filtered"] = True
    mod2.var["filtered"] = True

    mdata = MuData({"raw": mod1, "qced": mod2}, axis=-1)
    return mdata

mdata = make_staged_mdata()
# TODO: shouldn't be needed from 0.4
# mdata.update(pull=False)
mdata.obs = mdata.obs.loc[:, []]
mdata.var = mdata.var.loc[:, []]

mdata

MuData object with n_obs × n_vars = 10 × 100 (shared obs and var) 
  2 modalities
    raw:	10 x 100
      obs:	'status'
    qced:	8 x 50
      obs:	'status', 'filtered'
      var:	'filtered'

mdata.pull_obs(prefix_unique=False)
mdata.obs.dtypes

raw:status     boolean
filtered       boolean
qced:status    boolean
dtype: object

mdata.pull_var(prefix_unique=False)
mdata.var.dtypes

filtered    boolean
dtype: object

Nested `MuData` objects#

Annotations can be also managed for nested MuData objects:

def make_nested_mdata():
    stages = make_staged_mdata()
    stages.obs = stages.obs.loc[:, []]  # pre-0.3

    mod2 = AnnData(np.arange(10000, 12000, 1).reshape(10, -1))
    mod2.obs_names = [f"obs{i}" for i in range(mod2.n_obs)]
    mod2.var_names = [f"mod2:var{i}" for i in range(mod2.n_vars)]

    mdata = MuData({"mod1": stages, "mod2": mod2}, axis=-1)

    mdata.obs["dataset"] = "ref"

    return mdata

mdata = make_nested_mdata()
mdata

MuData object with n_obs × n_vars = 10 × 300 (shared obs and var) 
  obs:	'dataset'
  2 modalities
    mod1:	MuData object with n_obs × n_vars = 10 × 100 (shared obs and var) 
      2 modalities
        raw:	10 x 100
          obs:	'status'
        qced:	8 x 50
          obs:	'status', 'filtered'
          var:	'filtered'
    mod2:	10 x 200

print(mdata.mod)

MuData
├─ mod1 MuData [shared obs and var] (10 × 100)
│  ├─ raw AnnData (10 x 100)
│  └─ qced AnnData (8 x 50)
└─ mod2 AnnData (10 x 200)

Propagation is not recursive by intention, and annotations in the inner mod1 should be explicitely pushed down to individual AnnData objects when desired:

mdata.push_obs()

for mod in mdata.mod.values():
    print(mod.obs.dtypes)

dataset    object
dtype: object
dataset    object
dtype: object

for mod in mdata["mod1"].mod.values():
    print(mod.obs.dtypes)

status    bool
dtype: object
status      bool
filtered    bool
dtype: object

An example of the recursive push_obs() operation:

def push_obs_rec(mdata: MuData):
    mdata.push_obs()
    for mod in mdata.mod.values():
        if isinstance(mod, MuData):
            push_obs_rec(mod)

push_obs_rec(mdata)

for mod in mdata["mod1"].mod.values():
    assert "dataset" in mod.obs

Managing annotations

Contents

Managing annotations#

Annotations in multimodal objects#

Pulling annotations#

Pulling feature annotations with .pull_var()#

Pulling samples annotations with .pull_obs()#