Skip to content

salmon 1.2.0 release notes

Compare
Choose a tag to compare
@rob-p rob-p released this 10 Apr 21:03
· 334 commits to master since this release

Improvements and changes

Improvements


  • Extreme reduction in the required intermediate disk space used when building the salmon index. This improvement is due to the changes implemented by @iminkin in TwoPaCo (which pufferfish, and hence salmon, uses for constructing the colored, compacted dBG) addressing the issue here. This means that for larger references or references with many "contigs" (transcripts), the intermediate disk space requrements are reduced by up to 2 orders of magnitude!

  • Reduction in the memory required for indexing, especially when indexing with a small value of k. This improvement comes from (1) fixing a bug that was resulting in an unnecessarily-large allocation when "pufferizing" the output of TwoPaCo and (2) improving the storage of some intermediate data structures used during index construction. These improvements should help reduce the burden of constructing a decoy-aware index with small values of k. The issue of reducing the number of intermediate files created (which can hurt performance on NFS-mounted drives) is being worked on upstream, but is not yet resolved.

alevin

  • This release introduces support for the quantification of CITE-seq / feature barcoding based single-cell protocols! A full, end-to-end tutorial is soon-to-follow on the alevin-tutorial website.

New flags and options:


  • Salmon learned a new option (currently Beta) --softclip : This flag allows soft-clipping at the beginning and end of reads when they are scored with selective-alignment. If used in conjunction with the --writeMappings flag, then the CIGAR strings in the resulting SAM output will designate any soft-clipping that occurs at the beginning or end of the read. Note: To pass the selective-alignment filter, the read must still obtain a score of at least maximum achievable score * minScoreFraction, but softclipping allows omitting a poor quality sub-alignment at the beginning or end of the read with no change to the resulting score for the rest of the alignment (rather than forcing a negative score for these sub-alignments).

  • Salmon learned a new option --decoyThreshold <thresh>: For an alignemnt to an annotated transcript to be considered invalid, it must have an alignment score s such that s < (decoyThreshold * bestDecoyScore). A value of 1.0 means that any alignment strictly worse than the best decoy alignment will be discarded. A smaller value will allow reads to be allocated to transcripts even if they strictly align better to the decoy sequence. The previous behavior of salmon was to discard any mappings to annotated transcripts that were strictly worse than the best decoy alignment. This is equivalent to setting --decoyThreshold 1.0, which is the default behavior.

  • Salmon learned a new option --minAlnProb <prob> (default ): When selective alignment is carried out on a read, each alignment A is assigned a probability given by $e^{-(scoreExp * (bestScore - score(A)))}$, where the default scoreExp is just 1.0. Depending on how much worse a given alignment is compared to the best alignment for a read, this can result in an exceedingly small alignment probability. The --minAlnProb option lets one set the alignment probability below which an alignment's probability will be truncated to 0. This allows skipping the alignments for a fragment that are unlikely to be true (and which could increase the difficulty of inference in some cases). The default value is 1e-5.

  • Salmon learned a new flag --disableChainingHeuristic: Passing this flag will turn off the heuristic of Li 2018 that is used to speed up the MEM chaining step, where the inner loop of the chaining algorithm is terminated after a small number of previous pointers for a given MEM have been found. Passing this flag can improve the sensitivity of alignment to sequences that are highly repetitive (especially those with overlapping repetition), but it can make the chaining step somewhat slower.

  • Salmon learned a new flag --auxTargetFile <file>. The file passed to this option should be a list of targets (i.e. sequences indexed during indexing, or aligned against in the provided BAM file) for which auxliliary models (sequence-specific, fragment-GC, and position-specific bias correction) should not be applied. The format of this file is to provide one target name per-line, in a newline separated file. Unlike decoy sequences, this list of sequences is provided to the quant command, and can be different between different runs if so-desired. Also, unlike decoy sequences, the auxiliary targets will be quantified (e.g. they will have entries in quant.sf and can have reads assigned to them). To aid in metadata tracking of targets marked as auxiliary, the aux_info directory contains a new file aux_target_ids.json, which contains a json file listing the indices of targets that were treated as "auxiliary" targets in the current run.

  • The equivalence class output is now gzipped when written (and written to aux_info/eq_classes.txt.gz rather than aux_info/eq_classes.txt). To detect this behavior, an extra property gzipped is written to the eq_class_properties entry of aux_info/meta_info.json. Apart from being gzipped to save space, however, the format is unchanged. So, you can simply read the file using a gzip stream, or, alternatively, simply unzip the file before reading it.

  • Added special handling for reading SAM files that were, themselves, produced by salmon. Specifically, when reading SAM files produced by salmon, the AS tag will be used to assign appropriate conditional probabilities to different mappings for a fragment (rather than looking for a CIGAR string, which is not computed).

  • The versionInfo.json file generated during indexing now remember the specific version of salmon that was used to build the index. The indexVersion field is already a version identifier that is incremented when the index changes in a binary-incompatible way. However, the new field will allow one to know the exact salmon version that was used to build the index.

alevin

  • A couple of new flags has been added to support the feature barcoding based quantification in the alevin framework.

    • index command
      • --features: Performs indexing on a tsv file instead of a regular FASTA reference file. The tsv file should contain the name of the features, tab separated by the nucleotide sequence identifiers.
    • alevin command
      • --featureStart: The start index (0 based) of the feature barcode in the R2 read file. (Typically 0 for CITE-seq and 10 for 10x feature barcoding).
      • --featureLength: The length of the feature barcode in the R2 read file. (Typically 15 for both CITE-seq and 10x feature barcoding).
      • --citeseq: The command is used for quantifying the feature barcoded data where the alignment of the barcodes (with 1-edit distance) is done instead of the mRNA reads.
      • No --tgMap is needed when using --citeseq single-cell protocol.
  • --end 3 has been enabled in this release. It is useful for protocols where UMI comes before the Cellular Barcode (CB). Note: --end 3 does not start subsequencing from the 3' end of the R1 file. Instead, alevin still starts counting the subsequence from the 5' end. However, we first sample the UMI instead of the CB. The idea here is, --end 5 represents CB+UMI while --end 3 represents UMI+CB and all the sequences beyond the |CB| + |UMI| length are ignored, no matter what value is set for the flag --end.

Bug fixes

  • Fixed an issue (upstream in pufferfish), that is actually arising from bbhash. Specifically, the issue was unexpected behavior of bbhash during minimum perfect hash construction. It may create temporary files during MPHF construction, and it was using the current working directory to do this, with no option to override this behavior. We have fixed this in our copy of the bbhash code, and the salmon index command will now use the provided output directory as temporary working space for bbhash. This issue has been reported upstream in bbhash as issue 19.

  • Fixed an issue with long target names (raised in issue 451) not being allowed in the index. Previously, in the pufferfish-based index, target names of length > 255 were clipped to 255 characters. While this is not normally a problem, pipelines that attempt to encode significant metadata in the target name may be affected by this limit. With this release, target names of up to 65,536 characters are supported. Thanks to @chilampoon for raising this issue.

  • Fixed an issue where the computed alignment score could be wrong (too high) when there were MEMs in the highest-scoring chain that overlapped in the query and the reference by different amounts. This was relatively infrequent, but has now been fixed. Thanks to @cdarby for reporting the issue and providing a test case to fix it!

  • Fixed an issue where, in rare situations, usage of the alignment cache could cause non-determinism the the score for certain alignments, which could result in small fluctuations in the number of assigned fragments. The fix involves both addressing a bug in ksw2 where an incorrect alignment score for global alignment could be returned in certain rare situations depending on how the bandwidth parameter is set, and also by being more stringent in what alignments are inserted into the alignment cache and which mappings are searched for in the alignment cache. Many thanks to @csoneson for raising this issue and finding a dataset containing enough of the corner cases to track down and fix the issue. Thanks to @mohsenzakeri for isolating the underlying cases and figuring out how to fix them.

alevin

  • The big feature hash generated when --dumpBfh is set, creates a reverse UMI sequences than those present originally. This was a legacy bug, introduced when shifting from jellyfish based 2-bit encoding to the AlevinKmer class based 2-bit encoding. This has been fixed in the this release.

  • Fixed an issue where the --writeUnmappedNames did not work properly with alevin. This addresses issue 501.

Other notes

  • As raised in issue 500, the salmon executable, since v1.0.0, assumes the SSE4 instruction set. While this feature has been standard on processors for a long time, some older hardware may not have this feature set. This compile flag was removed from the pufferfish build in this release, as we noticed no speed regressions in its absence. However, please let us know if you notice any non-trivial speed regressions (which we do not expect).