Allowlisting the CellTag plasmid library to obtain a list of high confidence barcodes

CellTag constructs are made available as lentiviral plasmid libraries. While the theoretical diversity of CellTags is very high (65,536 for the original tags and ~68 billion for the 18N libraries) the real barcode diversity in the plasmid library is limited due to bottlenecks during the process of synthesis, cloning etc. To identify CellTag barcodes present in the plasmid library, we perform an allowlisting step. The detailed methodology for generating sequencing libraries for allowlisting is described in Jindal et al. Nat. Biotech. (2023). The following text outlines the computational workflow for processing the sequencing data. Scripts and notebooks are available at our GitHub repo:

  • Obtain Read 1 (R1) fastq files for the 2 replicates and parse celltag reads from each by running the following script: allowlisting_scripts/parse_fq_allowlisting.sh <sample name> <grep pattern> <path to R1 fastq file>

  • Error correct identified CellTag barcodes using starcode (set distance threshold to 4): allowlisting_scripts/starcode_collapse.sh <distance threshold> <path to output of parse_fq_allowlisting.sh>

  • Use allowlisting_scripts/create_allowlist.ipynb to identify list of barcodes present in both replicates to obtain the final allowlist

The allowlist for the multi-v1 library used in our paper has been provided in our GitHub repo at misc_files/18N-multi-v1-allowlist.csv. The allowlists for 8N-v1,v2 and v3 libraries have been made available on addgene