Utilities Module

celltag_tools.utils.jaccard_similarities(mat)

Computes the Jaccard similarity for all pairs of columns in a given sparse matrix.

Args:
mat (scipy.sparse.spmatrix):

A binary sparse matrix (rows x columns). Each column is a feature vector to be compared with every other column.

Returns:
scipy.sparse.spmatrix:

A sparse matrix (same shape as mat.T * mat) where each entry (i, j) represents the Jaccard similarity between columns i and j of the input matrix. The diagonal is set to 0.

celltag_tools.utils.table_to_spmtx(row_data, col_data, count_data)

Converts row, column, and count data into a CSR (Compressed Sparse Row) matrix.

Args:
row_data (array-like):

Row labels (e.g., cell barcodes).

col_data (array-like):

Column labels (e.g., CellTag identifiers).

count_data (array-like):

Counts or other values to populate the sparse matrix.

Returns:
tuple:
A tuple (celltag_mat, cells, celltags) where:
  • celltag_mat (scipy.sparse.csr_matrix): The constructed sparse matrix of shape (len(unique_rows), len(unique_columns)).

  • cells (numpy.ndarray): Sorted unique row labels.

  • celltags (numpy.ndarray): Sorted unique column labels.

celltag_tools.utils.check_mtx_dict(target_mtx_dict)

Validates that the provided matrix dictionary conforms to the expected structure for CellTagData matrices (e.g., allow_mtx, bin_mtx, metric_mtx).

Args:
target_mtx_dict (celltag_mtx_dict):
A dictionary-like object expected to contain:
  • ‘mtx’: A scipy.sparse.spmatrix

  • ‘cells’: A numpy.ndarray of cell identifiers

  • ‘celltags’: A numpy.ndarray of cell tag identifiers

Raises:
ValueError: If target_mtx_dict is not a celltag_mtx_dict, if it does not have

exactly three keys (‘mtx’, ‘cells’, ‘celltags’), or if the types of those values are incorrect.

celltag_tools.utils.find_homoplasy(n_cells, moi, barcode_abundance, ct_min=2, ct_max=25, n_iters=1000, verbose=False)

Simulates CellTag signatures in a population of cells to estimate the rate of CellTag signature duplication (homoplasy) across unrelated cells (i.e. false clones).

In each iteration: 1. A Poisson-distributed random count of CellTags is assigned to each cell (mean = moi). 2. Cells with CellTag counts outside [ct_min, ct_max] are filtered out. 3. CellTags are sampled from the provided abundance distribution and assigned to each remaining cell. 4. The duplication rate is computed as the fraction of cell pairs sharing the exact same CellTag signature.

Args:
n_cells (int):

The number of cells to simulate in each iteration (prior to filtering).

moi (float):

The mean of the Poisson distribution from which the CellTag counts per cell are drawn.

barcode_abundance (pd.DataFrame | list):

A DataFrame containing CellTag abundances (first column) with barcodes as the index, or a list of barcodes (assumed uniform abundance).

ct_min (int, optional):

The minimum allowed number of CellTags in a cell (inclusive). Defaults to 2.

ct_max (int, optional):

The maximum allowed number of CellTags in a cell (inclusive). Defaults to 25.

n_iters (int, optional):

The number of Monte Carlo simulation iterations to run. Defaults to 1000.

verbose (bool, optional):

If True, prints progress messages every 10 iterations. Defaults to False.

Returns:
list[float]:

A list of duplication rates (homoplasy) across the simulation iterations. Each entry represents the duplication rate in one iteration.

Raises:
ValueError:

If barcode_abundance is neither a DataFrame nor a list.

Example:
>>> # Using a uniform abundance of barcodes
>>> homoplasy_rates = find_homoplasy(
...     n_cells=1000,
...     moi=5,
...     barcode_abundance=["tagA", "tagB", "tagC"],
...     ct_min=2,
...     ct_max=25,
...     n_iters=10,
...     verbose=True
... )
>>> print(homoplasy_rates)
Notes:
  • The duplication rate is the proportion of pairs of cells that share the exact same set of CellTags. It’s computed as:

    net_dup_pairs / comb(len(filtered_cells), 2).

  • comb(x, 2) is shorthand for binomial coefficient C(x, 2) = x*(x-1)/2.

celltag_tools.utils.get_clone_cell_embed(adata_obj, ct_obj, clone_weight=1)

Creates a combined AnnData object containing both single-cell RNA data and clone-level “pseudo-cells” co-embedded in a knowledge graph, based on the connectivities in adata_obj and the clone assignments in ct_obj.clone_table.

The new connectivity graph is constructed by: 1. Scaling down the original adata_obj.obsp[‘connectivities’] by 1 / clone_weight

if clone_weight >= 1.

  1. Building a sparse clone-cell connectivity matrix from ct_obj.clone_table.

  2. Combining the two connectivity matrices into a larger graph with rows/columns for both cells and clones.

  3. Storing the result in adata_obj_coembed.obsp[‘connectivities’].

Args:
adata_obj (anndata.AnnData):

The AnnData object containing single-cell data and a precomputed neighbors graph in adata_obj.obsp[‘connectivities’].

ct_obj (CellTagData):

A valid CellTagData object containing a clone_table with columns for clone IDs and cell barcodes.

clone_weight (float, optional):

A scaling factor for weighting or penalizing the clone-cell connections relative to cell-cell connections. Defaults to 1.

Returns:
anndata.AnnData:

A new AnnData object containing: - .obs_names: The concatenation of the original cell barcodes and the

clone IDs.

  • .obsp[‘connectivities’]: The merged connectivity matrix for cells and clones.

  • .uns[‘neighbors’]: Copied parameters from the original adata_obj.

Raises:
ValueError: If ct_obj is invalid or missing clone_table, or if

adata_obj.obsp[‘connectivities’] is empty.

celltag_tools.utils.merge_nn(nn_graph, all_cells, cell_list)

Merges a given list of cells with their nearest neighbors as defined by a nearest-neighbor graph.

For each cell in cell_list, the function retrieves its neighbors from nn_graph (row corresponding to that cell in all_cells) and unions them into a set.

Args:
nn_graph (scipy.sparse.spmatrix or numpy.ndarray):

A nearest-neighbor matrix where row i contains nonzero entries at the columns corresponding to the neighbors of cell i.

all_cells (array-like):

A list or array of all cell identifiers, matching the rows/columns of nn_graph.

cell_list (array-like):

A list of cell identifiers whose neighbors should be collected together.

Returns:
set:

A set of cell identifiers including all cell_list cells plus any of their nearest neighbors found in nn_graph.