Tools Module (tl)

celltag_tools.tools.read_celltag(celltag_path, sample_prefix=None, assay='RNA', triplet_th=1, starcode_th=2, starcode_path=None, allowlist_path=None, inplace=True)

Reads and processes CellTag read data from file paths, applying filtering, error correction, and allowlisting. Optionally returns a new CellTagData object or the processed data.

Args:
celltag_path (str | list[str]):

Path or list of paths to the CellTag read files (TSV format).

sample_prefix (str | list[str], optional):

Prefix or list of prefixes to be added to CellTag barcodes. If not provided and multiple paths are specified, prefixes are autogenerated.

assay (str, optional):

Single-cell assay type. Must be either “RNA” or “ATAC”. Defaults to “RNA”.

triplet_th (int, optional):

Threshold for filtering out read triplets (UMI or read occurrences) below this count. Defaults to 1.

starcode_th (int, optional):

Edit distance threshold for collapsing barcodes via Starcode. Defaults to 2.

starcode_path (str, optional):

Path to the Starcode installation directory. Must contain the executable ‘starcode’.

allowlist_path (str, optional):

Path to the allowlist file (TSV) containing valid CellTags.

inplace (bool, optional):

If True, returns a CellTagData object with the processed data set inside it. If False, returns a tuple containing (processed reads, thresholds, sequencing saturation). Defaults to True.

Returns:
CellTagData:

If inplace=True, returns a CellTagData object with ct_reads, thresholds, and seq_sat set.

tuple:
If inplace=False, returns a tuple of:
  • pd.DataFrame: Processed CellTag read data.

  • dict: Dictionary containing thresholds {‘starcode’: starcode_th, ‘triplet’: triplet_th}.

  • float: Sequencing saturation percentage.

Raises:
ValueError: If any of the provided file paths do not exist, if Starcode is not found,

if the allowlist is missing, or if assay is invalid (“RNA” or “ATAC” only).

celltag_tools.tools.create_allow_mtx(ct_obj, overwrite=False, inplace=True)

Creates a sparse allow matrix (cell x CellTag) in the provided CellTagData object, using either UMI counts (for RNA) or read counts (for ATAC).

Args:
ct_obj (CellTagData):

A valid CellTagData object containing the ‘ct_reads’ attribute.

overwrite (bool, optional):

If False (default), raises an error if an allow matrix already exists. If True, overwrites the existing allow matrix.

inplace (bool, optional):

If True (default), updates the allow_mtx attribute within the ct_obj. If False, returns the created allow matrix and associated row/column labels.

Returns:
tuple:
If inplace=False, returns (allow_mtx, allow_rows, allow_cols), where:
  • allow_mtx (scipy.sparse.csr_matrix): The constructed allow matrix.

  • allow_rows (list): List of cell barcodes (rows).

  • allow_cols (list): List of allowed CellTags (columns).

Raises:
ValueError: If ct_obj is not a CellTagData object or if the allow matrix already exists

and overwrite=False.

celltag_tools.tools.create_bin_mtx(ct_obj, bin_th=1, overwrite=False, inplace=True)

Binarizes the allow matrix from a CellTagData object. The resulting matrix is stored in bin_mtx if ‘inplace=True’, or returned. Values above bin_th are set to True (1), else False (0).

Args:
ct_obj (CellTagData):

A valid CellTagData object containing an allow matrix in allow_mtx.

bin_th (int, optional):

Threshold for binarization. Defaults to 1.

overwrite (bool, optional):

If False (default), raises an error if a binarized matrix already exists. If True, overwrites the existing bin_mtx.

inplace (bool, optional):

If True (default), updates bin_mtx attribute within ct_obj. If False, returns the binarized matrix and associated row/column labels.

Returns:
tuple:
If inplace=False, returns (ct_bin_mtx, cells, celltags), where:
  • ct_bin_mtx (scipy.sparse.csr_matrix): The binarized matrix.

  • cells (list): List of cell barcodes (rows).

  • celltags (list): List of CellTags (columns).

Raises:
ValueError: If ct_obj is not a CellTagData object or if the allow_mtx is missing/invalid,

or if a binarized matrix already exists and overwrite=False.

celltag_tools.tools.create_metric_mtx(ct_obj, met_lower=1, met_upper=25, overwrite=False, inplace=True)

Performs metric-based filtering on the binarized cell x CellTag matrix to remove cells with too few or too many CellTags (defined by met_lower and met_upper). Produces a filtered matrix stored in metric_mtx if ‘inplace=True’, or returns it.

Args:
ct_obj (CellTagData):

A valid CellTagData object containing a binarized matrix in bin_mtx.

met_lower (int, optional):

Minimum number of CellTags required for a cell to be retained. Defaults to 1.

met_upper (int, optional):

Maximum number of CellTags allowed for a cell to be retained. Defaults to 25.

overwrite (bool, optional):

If False (default), raises an error if a metric matrix already exists. If True, overwrites the existing metric_mtx.

inplace (bool, optional):

If True (default), updates metric_mtx attribute within ct_obj. If False, returns the filtered matrix and associated row/column labels.

Returns:
tuple:
If inplace=False, returns (celltag_mat_met, cells_met, celltags_met), where:
  • celltag_mat_met (scipy.sparse.csr_matrix): The filtered (metric) matrix.

  • cells_met (ndarray): Array of filtered cell barcodes (rows).

  • celltags_met (ndarray): Array of filtered CellTags (columns).

Raises:
ValueError: If ct_obj is not a CellTagData object, if bin_mtx is missing/invalid,

or if a metric matrix already exists and overwrite=False.

celltag_tools.tools.call_clones(ct_obj, jaccard_th=0.7, return_graph=False, overwrite=False, inplace=True)

Identifies clonal relationships among cells based on the Jaccard similarity of their CellTag profiles. Optionally returns the Jaccard matrix, a graph representation, and a clone table, or stores them within the given CellTagData object.

Args:
ct_obj (CellTagData):

A valid CellTagData object containing a filtered matrix in metric_mtx.

jaccard_th (float, optional):

Threshold for Jaccard similarity to consider cells part of the same clone. Defaults to 0.7.

return_graph (bool, optional):

If True, additionally returns the graph representation of cell clones. Defaults to False.

overwrite (bool, optional):

If False (default), raises an error if the Jaccard matrix or clone table already exist. If True, overwrites existing data.

inplace (bool, optional):

If True (default), updates the ct_obj with jaccard_mtx, clone_table, and optionally clone_graph if return_graph=True. If False, returns the requested data.

Returns:
tuple | None:
Depending on inplace and return_graph:
  • If inplace=False and return_graph=False: (jac_mat, clones).

  • If inplace=False and return_graph=True: (jac_mat, clone_graph, clones).

  • If `inplace=True function sets the following attributes on the CellTagData object:
    • ct_obj.jaccard_mtx

    • ct_obj.clone_table

    • ct_obj.thresholds[“jaccard”]

    • ct_obj.clone_graph (only if return_graph=True)

Where:
  • jac_mat (scipy.sparse.csr_matrix): Jaccard similarity matrix.

  • clone_graph (networkx.Graph): Graph where nodes represent cells, and edges represent similarity > jaccard_th.

  • clones (pd.DataFrame): Table mapping cells to their assigned clones.

Raises:
ValueError: If ct_obj is not a CellTagData object, if metric_mtx is missing/invalid,

or if jaccard_mtx or clone_table already exist and overwrite=False.

celltag_tools.tools.assign_fate(ct_obj, fate_col='day', fate_key='d5', cell_type_key='cell_type2', inplace=False)

Assigns a “fate” to each clone in a CellTagData object’s clone table, based on the most frequent cell type (cell_type_key) present at a specified time point (fate_key in column fate_col).

For each clone (identified by clone.id), the function finds rows in the clone table where fate_col == fate_key and determines the most common cell_type_key among those rows. This value is assigned as the clone’s “fate,” along with the percentage of cells (fate_pct) that match this fate within that clone at fate_key. If no cells meet the fate criteria (e.g., time point is missing), the clone is labeled with fate=’no_fate_cells’ and fate_pct=0.

Args:
ct_obj (CellTagData):

A valid CellTagData object containing clone_table.

fate_col (str, optional):

Column name in clone_table that defines the time point or condition used to assign fate. Defaults to ‘day’.

fate_key (str, optional):

A value in fate_col specifying which rows represent the “fate” condition. Defaults to ‘d5’.

cell_type_key (str, optional):

Column name in clone_table that specifies the cell type. Defaults to ‘cell_type2’.

inplace (bool, optional):

If True, updates ct_obj.clone_table directly. If False (default), returns a modified DataFrame without changing ct_obj.

Returns:
pandas.DataFrame | None:
  • If inplace=False, returns the updated clone table with new columns fate and fate_pct.

  • If inplace=True, the function returns None and updates ct_obj.clone_table in place.

Raises:
ValueError: If ct_obj is not a CellTagData object, if clone_table is missing

or invalid, or if the specified columns (fate_col, cell_type_key) are not found in the table.

celltag_tools.tools.naive_atac_rna_pairing(ct_obj, seed=100, state_day=None, add_fate=True)

Performs a naive pairing of cells labeled as ATAC with those labeled as RNA within the same clone, using clone_table of a CellTagData object. Random pairing is done so that every ATAC sibling is matched to an RNA sibling, potentially looping over if the sets differ in size.

Args:
ct_obj (CellTagData):

A valid CellTagData object containing clone_table.

seed (int, optional):

Seed value for NumPy’s random generator to ensure reproducible pairings. Defaults to 100.

state_day (str | None, optional):

If provided, restricts the pairing to cells in the clone_table where ‘day’ == state_day. Defaults to None.

add_fate (bool, optional):

If True, attempts to append an additional row containing the fate value for all paired cells. The column ‘fate’ must exist in clone_table. Defaults to True.

Returns:
numpy.ndarray:

A 2D array of shape (2, N) or (3, N), where N is the total number of pairs. - First row: ATAC cell barcodes - Second row: RNA cell barcodes - Third row (optional): The single fate value repeated for each pair

(only if add_fate is True and ‘fate’ column exists).

Raises:
ValueError: If ct_obj is invalid, if clone_table is missing or not a DataFrame,

or if add_fate=True but ‘fate’ column is missing.

celltag_tools.tools.get_clone_celltag_mtx(ct_obj, sig_type='core')

Builds a clone-by-CellTag matrix from the metric-filtered matrix in a CellTagData object, based on which CellTags are present in each clone.

For each clone (from ct_obj.clone_table): - A sub-matrix of the metric-filtered matrix (ct_obj.metric_mtx) is extracted

for cells belonging to that clone.

  • Depending on sig_type, a list of CellTags is selected:
    • “core”: CellTags present in more than one cell of the clone.

    • “union”: CellTags present in at least one cell of the clone.

  • Each clone’s chosen CellTags are accumulated.

Finally, this information is converted into a sparse matrix via table_to_spmtx, returning a clone-by-CellTag matrix of ones (indicating presence of each CellTag in a particular clone).

Args:
ct_obj (CellTagData):

A valid CellTagData object, which must include metric_mtx and a clone_table.

sig_type (str, optional):

Determines which CellTags define the clone’s “signature”: - “core”: CellTags present in more than one cell of the clone. - “union”: CellTags present in at least one cell of the clone. Defaults to “core”.

Returns:
tuple:
(sparse_mtx, row_labels, col_labels) as returned by table_to_spmtx, where:
  • row_labels are clone IDs.

  • col_labels are CellTag identifiers.

  • sparse_mtx is a clone-by-CellTag matrix of ones indicating presence.

Raises:
ValueError: If ct_obj is invalid or does not contain the required

metric matrix (metric_mtx) or clone_table.

celltag_tools.tools.ident_sparse_clones(ct_obj, n_largest=10, density_th=0.2, plot=False, **kwargs)

Identifies the “sparse” clones among the largest clones in a given metadata table, defined by an edge density threshold. Optionally generates a scatter plot of clone size vs. edge density.

Args:
clone_info (pd.DataFrame):

A DataFrame containing per-clone metadata, including columns: - ‘clone.id’ - ‘size’ (number of cells in each clone) - ‘edge.den’ (edge density of the clone subgraph)

n_largest (int, optional):

Number of top clones by size to consider for filtering. Defaults to 10.

density_th (float, optional):

Maximum edge density for a clone to be considered “sparse.” Defaults to 0.2.

plot (bool, optional):

If True, returns a matplotlib Axes object with a scatter plot of clone size vs. edge density. Defaults to False.

**kwargs:

Additional keyword arguments passed to the plotting function (e.g., marker size).

Returns:
pd.DataFrame | tuple[None, matplotlib.axes.Axes]:
  • If any sparse clones are found, returns a DataFrame subset of clone_info containing only those sparse clones. If plot=True, also returns the Axes object.

  • If no sparse clones are found, returns None. If plot=True, returns (None, Axes).

Notes:
  • Sparse clones are defined here as clones that rank among the top n_largest by size but have edge.den < density_th.

  • The optional plotting is handled by plot_size_by_den.

celltag_tools.tools.fix_sparse_clones(ct_obj, sparse_ids=None)

Reassigns cells from “sparse” clones by splitting them into maximal cliques, then recombines all clones into a new clone table. Useful for refining clone assignments after initial clone calling.

Specifically, for each clone in ct_obj.clone_graph whose index is in sparse_ids , we repeatedly extract the largest clique and mark those cells as a new clone until no edges remain.

Args:
ct_obj (CellTagData):

A valid CellTagData object with a clone_graph attribute representing clonal subgraphs and a clone_table.

sparse_ids (array-like | None, optional):

List of clone IDs (1-based) to be split. If None, the function does nothing and returns immediately. Defaults to None.

Returns:
pd.DataFrame | None:
  • If inplace=True, updates ct_obj.clone_table with the newly rebuilt clone assignments and returns None.

  • Otherwise, returns a new clone table (pd.DataFrame) without modifying ct_obj.

Raises:

ValueError: If cell number checks fail or if ct_obj is not valid.

Notes:
  • This function references an inplace check near the end, but there’s no formal inplace argument in its signature. If you want in-place updates, consider adding inplace=True to the signature.

  • The final clone IDs are re-enumerated starting from 1.