salmon 1.2.0 index size with Rnor data using SA method #505

tamuanand · 2020-04-14T04:19:02Z

Hi

I am using salmon 1.2.0 to build a Salmon-SAF index on Rat data from Ensembl -- using full genome as decoy.

wget ftp://ftp.ensembl.org/pub/release-96/fasta/rattus_norvegicus/cdna/Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz

wget ftp://ftp.ensembl.org/pub/release-96/fasta/rattus_norvegicus/dna_index/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz

Other steps as listed at:
https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

After I build the SA index, the directory size is 45 GB. The same with v1.1.0 on the RefGenomes site is about 15GB - http://refgenomes.databio.org/v2/asset/rn6/salmon_sa_index/splash?tag=default

Is this expected?

The text was updated successfully, but these errors were encountered:

rob-p · 2020-04-14T04:20:19Z

No; this is not expected. Can you list the contents of the index directory?

tamuanand · 2020-04-14T04:50:00Z

Hi @rob-p

Salmon index command used

salmon index -t rnor_gentrome.fa -d decoys.txt -k 17 --keepFixedFasta --keepDuplicates -p 16 -i rnor_ENSEMBL_96_index

Directory size after indexing completes

du -sh .
45G

Listing of files and their sizes

 4.2K     ref_indexing.log
  115     pre_indexing.log
  126     versionInfo.json
 944M     mphf.bin
 6.0G     pos.bin
 1004     info.json
 256K     refAccumLengths.bin
 701M     refseq.bin
  15G     ctable.bin
 2.5G     ctg_offsets.bin
 128K     reflengths.bin
 2.9G     seq.bin
 1.5G     rank.bin
 128K    complete_ref_lens.bin
 2.8G     ref_k17_fixed.fa
  22K     duplicate_clusters.tsv

rob-p · 2020-04-14T04:57:57Z

So, I am still very surprised by the 45G number, but one big difference here is that the refgenomes indices are with the default value of k (k=31). As you can see, the dominant element of the index here is the ctable.bin, which stores where unitigs of the dBG start within each reference. As the k-met size gets smaller, unitigs get shorter, and they appear more places. Further, the increase here is not linear as k decreases. I’d suspect most of the size difference is due to that. There is also an optimization for the ctable.bin that we have been working on that wastes fewer bits, and will make this part of the index somewhat smaller in such cases. However, I am fairly certain that is not the dominant factor here. You could see what this is with the default k; I’d expect that to be closer to the refgenomes number (but perhaps a bit different due to version differences and the —keepDuplicates flag).

tamuanand · 2020-04-14T04:59:43Z

Ok - I will give this a try with k 31 and not have keepDuplicates

Will update soon

tamuanand · 2020-04-14T05:10:06Z

When I try it this way, the program exits

salmon index -t rnor_gentrome.fa -d decoys.txt -k 31 -p 16 -i rnor_ENSEMBL_96_SA_index_k31

These are the last few lines

[2020-04-14 01:07:52.750] [puff::index::jointLog] [critical] The decoy file contained the names of 955 decoy sequences, but 953 were matched by seq
uences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta
 file that is being indexed.
[2020-04-14 01:07:52.895] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

tamuanand · 2020-04-14T05:16:29Z

I am a bit puzzled that salmon indexing runs fine with a smaller kmer size of 17, but exits when used with kmer size of 31

rob-p · 2020-04-14T13:45:07Z

Hi @tamuanand,

Ok, it seems something simple with the preparation of the decoys.txt file. I'm looking into it. If you watch the log, you see the following output before the (intentional exit with status code 1):

[2020-04-14 09:44:12.991] [puff::index::jointLog] [critical] The decoy file contained the names of 955 decoy sequences, but 953 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed.
[2020-04-14 09:44:13.304] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1
Command exited with non-zero status 1
56.66user 9.14system 1:04.69elapsed 101%CPU (0avgtext+0avgdata 6902936maxresident)k
3792inputs+16outputs (30major+3629051minor)pagefaults 0swaps

--Rob

rob-p · 2020-04-14T14:18:05Z

Ok, for some reason it's not finding these two particular decoys (even though they do seem to be in the gentrome file):

[2020-04-14 10:14:50.710] [puff::index::jointLog] [critical] The header AABR07022993.1 was in decoyNames but was not seen!
[2020-04-14 10:14:50.710] [puff::index::jointLog] [critical] The header AABR07023006.1 was in decoyNames but was not seen!

still looking further.

rob-p · 2020-04-14T14:36:42Z

Ok, I figured it out :) — these two decoy references are (1) identical with each other and (2) collide with another decoy reference. Currently, the way we process decoys, we don't allow duplicate decoys (it makes even less sense to allow duplicate decoys than to allow duplicate transcripts). However, the reason indexing worked with k=17 is not because of k but because of the --keepDuplicates flag. With that flag, these decoys get added. I think the right thing for us to do on our part is to remove duplicate decoys if they appear in the reference and the user has not passed --keepDuplicates.

However, for the time being, I think the best thing to do is simply to remove AABR07022993.1 and AABR07023006.1 from the toplevel file and from decoys.txt, since the sequence they contain is already represented in the decoy part of the index. This will represent a full and comprehensive SAF index. I'm pinging @k3yavi to see if he has any good idea about the easiest way to cull these duplicate refs from the input files.

tamuanand · 2020-04-14T14:38:00Z

thanks for looking into it @rob-p

Just curious, how did it work without issues when I had -k 17 and --keepDuplicates and --keepFixedFasta

tamuanand · 2020-04-14T14:41:14Z

@rob-p Looks like we both posted at the same time 🤔

you answered my question on why it worked with k 17 and --keepDuplicates

rob-p · 2020-04-14T14:41:17Z

The key flag there is --keepDuplicates. That allows duplicates in the index (which we generally think are a bad idea since they are indistinguishable during quantification). However, it applies to both non-decoy sequences and decoy sequences. So, in that case, AABR07022993.1 and AABR07023006.1 were added to the index as decoys despite the fact there was already a decoy that had exactly their sequence. The k=17 part seems not to matter. If I use your exact command from above, but replace 17 with 31, indexing proceeds past the fixFasta step. It should actually be done indexing soon. I'll let you know how big the k=31 index (with duplicates) is.

rob-p · 2020-04-14T14:52:51Z

Ok, the indexing completed after just < 20m of real time (with 12 threads).
The k=31 index is just like one would expect:

rob@SERVER:/mnt/scratch5/rnor_test_issue505 $ du -h salmon_index/
15G	salmon_index/

tamuanand · 2020-04-14T16:31:06Z

Thanks @rob-p

I will close this

uros-sipetic · 2020-04-21T23:16:21Z

Hi @rob-p and @tamuanand - I ran into the same issue with the Rnor6.0 reference from Ensembl, so thanks for the tips and solutions here on the two duplicate sequences!

rob-p · 2020-04-22T14:56:09Z

Hi @tamuanand and @uros-sipetic,

Thanks for the feedback on this! I just cut v1.2.1 which "fixes" the behavior. It will simply discard any duplicate decoy sequences, which resolves this problem without requiring manual intervention.

tamuanand · 2020-04-22T17:53:43Z

@rob-p Thanks for releasing 1.2.1

Do any of the new fixes affect salmon quant

Why the question: I have already started using 1.2.0 for different quant steps after the fixes for duplicate decoys? Hence, do I need to move to 1.2.1, re-index everything etc

rob-p · 2020-04-22T18:05:48Z

No, there are no changes here. Further, indices built from version 1.0.0 are forward-compatible up through the current release. There is no need to rebuild any indices. Also, though 1.2.1 added a new flag, it made no changes to defaults, so quantifications between 1.2.0 and 1.2.1 are directly comparable.

tamuanand closed this as completed Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

salmon 1.2.0 index size with Rnor data using SA method #505

salmon 1.2.0 index size with Rnor data using SA method #505

tamuanand commented Apr 14, 2020

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020 •

edited

Loading

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020

tamuanand commented Apr 14, 2020

tamuanand commented Apr 14, 2020

rob-p commented Apr 14, 2020 •

edited

Loading

rob-p commented Apr 14, 2020

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020

tamuanand commented Apr 14, 2020

rob-p commented Apr 14, 2020

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020

uros-sipetic commented Apr 21, 2020 •

edited

Loading

rob-p commented Apr 22, 2020

tamuanand commented Apr 22, 2020

rob-p commented Apr 22, 2020

salmon 1.2.0 index size with Rnor data using SA method #505

salmon 1.2.0 index size with Rnor data using SA method #505

Comments

tamuanand commented Apr 14, 2020

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020 • edited Loading

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020

tamuanand commented Apr 14, 2020

tamuanand commented Apr 14, 2020

rob-p commented Apr 14, 2020 • edited Loading

rob-p commented Apr 14, 2020

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020

tamuanand commented Apr 14, 2020

rob-p commented Apr 14, 2020

rob-p commented Apr 14, 2020

tamuanand commented Apr 14, 2020

uros-sipetic commented Apr 21, 2020 • edited Loading

rob-p commented Apr 22, 2020

tamuanand commented Apr 22, 2020

rob-p commented Apr 22, 2020

tamuanand commented Apr 14, 2020 •

edited

Loading

rob-p commented Apr 14, 2020 •

edited

Loading

uros-sipetic commented Apr 21, 2020 •

edited

Loading