Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salmon 1.2.0 index size with Rnor data using SA method #505

Closed
tamuanand opened this issue Apr 14, 2020 · 18 comments
Closed

salmon 1.2.0 index size with Rnor data using SA method #505

tamuanand opened this issue Apr 14, 2020 · 18 comments

Comments

@tamuanand
Copy link

Hi

I am using salmon 1.2.0 to build a Salmon-SAF index on Rat data from Ensembl -- using full genome as decoy.

wget ftp://ftp.ensembl.org/pub/release-96/fasta/rattus_norvegicus/cdna/Rattus_norvegicus.Rnor_6.0.cdna.all.fa.gz

wget ftp://ftp.ensembl.org/pub/release-96/fasta/rattus_norvegicus/dna_index/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz

Other steps as listed at:
https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

After I build the SA index, the directory size is 45 GB. The same with v1.1.0 on the RefGenomes site is about 15GB - http://refgenomes.databio.org/v2/asset/rn6/salmon_sa_index/splash?tag=default

Is this expected?

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

No; this is not expected. Can you list the contents of the index directory?

@tamuanand
Copy link
Author

tamuanand commented Apr 14, 2020

Hi @rob-p

Salmon index command used

salmon index -t rnor_gentrome.fa -d decoys.txt -k 17 --keepFixedFasta --keepDuplicates -p 16 -i rnor_ENSEMBL_96_index

Directory size after indexing completes

du -sh .
45G

Listing of files and their sizes

 4.2K     ref_indexing.log
  115     pre_indexing.log
  126     versionInfo.json
 944M     mphf.bin
 6.0G     pos.bin
 1004     info.json
 256K     refAccumLengths.bin
 701M     refseq.bin
  15G     ctable.bin
 2.5G     ctg_offsets.bin
 128K     reflengths.bin
 2.9G     seq.bin
 1.5G     rank.bin
 128K    complete_ref_lens.bin
 2.8G     ref_k17_fixed.fa
  22K     duplicate_clusters.tsv

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

So, I am still very surprised by the 45G number, but one big difference here is that the refgenomes indices are with the default value of k (k=31). As you can see, the dominant element of the index here is the ctable.bin, which stores where unitigs of the dBG start within each reference. As the k-met size gets smaller, unitigs get shorter, and they appear more places. Further, the increase here is not linear as k decreases. I’d suspect most of the size difference is due to that. There is also an optimization for the ctable.bin that we have been working on that wastes fewer bits, and will make this part of the index somewhat smaller in such cases. However, I am fairly certain that is not the dominant factor here. You could see what this is with the default k; I’d expect that to be closer to the refgenomes number (but perhaps a bit different due to version differences and the —keepDuplicates flag).

@tamuanand
Copy link
Author

Ok - I will give this a try with k 31 and not have keepDuplicates

Will update soon

@tamuanand
Copy link
Author

When I try it this way, the program exits

salmon index -t rnor_gentrome.fa -d decoys.txt -k 31 -p 16 -i rnor_ENSEMBL_96_SA_index_k31

These are the last few lines

[2020-04-14 01:07:52.750] [puff::index::jointLog] [critical] The decoy file contained the names of 955 decoy sequences, but 953 were matched by seq
uences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta
 file that is being indexed.
[2020-04-14 01:07:52.895] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

@tamuanand
Copy link
Author

I am a bit puzzled that salmon indexing runs fine with a smaller kmer size of 17, but exits when used with kmer size of 31

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

Hi @tamuanand,

Ok, it seems something simple with the preparation of the decoys.txt file. I'm looking into it. If you watch the log, you see the following output before the (intentional exit with status code 1):

[2020-04-14 09:44:12.991] [puff::index::jointLog] [critical] The decoy file contained the names of 955 decoy sequences, but 953 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed.
[2020-04-14 09:44:13.304] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1
Command exited with non-zero status 1
56.66user 9.14system 1:04.69elapsed 101%CPU (0avgtext+0avgdata 6902936maxresident)k
3792inputs+16outputs (30major+3629051minor)pagefaults 0swaps

--Rob

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

Ok, for some reason it's not finding these two particular decoys (even though they do seem to be in the gentrome file):

[2020-04-14 10:14:50.710] [puff::index::jointLog] [critical] The header AABR07022993.1 was in decoyNames but was not seen!
[2020-04-14 10:14:50.710] [puff::index::jointLog] [critical] The header AABR07023006.1 was in decoyNames but was not seen!

still looking further.

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

Ok, I figured it out :) — these two decoy references are (1) identical with each other and (2) collide with another decoy reference. Currently, the way we process decoys, we don't allow duplicate decoys (it makes even less sense to allow duplicate decoys than to allow duplicate transcripts). However, the reason indexing worked with k=17 is not because of k but because of the --keepDuplicates flag. With that flag, these decoys get added. I think the right thing for us to do on our part is to remove duplicate decoys if they appear in the reference and the user has not passed --keepDuplicates.

However, for the time being, I think the best thing to do is simply to remove AABR07022993.1 and AABR07023006.1 from the toplevel file and from decoys.txt, since the sequence they contain is already represented in the decoy part of the index. This will represent a full and comprehensive SAF index. I'm pinging @k3yavi to see if he has any good idea about the easiest way to cull these duplicate refs from the input files.

@tamuanand
Copy link
Author

thanks for looking into it @rob-p

Just curious, how did it work without issues when I had -k 17 and --keepDuplicates and --keepFixedFasta

@tamuanand
Copy link
Author

@rob-p Looks like we both posted at the same time 🤔

you answered my question on why it worked with k 17 and --keepDuplicates

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

The key flag there is --keepDuplicates. That allows duplicates in the index (which we generally think are a bad idea since they are indistinguishable during quantification). However, it applies to both non-decoy sequences and decoy sequences. So, in that case, AABR07022993.1 and AABR07023006.1 were added to the index as decoys despite the fact there was already a decoy that had exactly their sequence. The k=17 part seems not to matter. If I use your exact command from above, but replace 17 with 31, indexing proceeds past the fixFasta step. It should actually be done indexing soon. I'll let you know how big the k=31 index (with duplicates) is.

@rob-p
Copy link
Collaborator

rob-p commented Apr 14, 2020

Ok, the indexing completed after just < 20m of real time (with 12 threads).
The k=31 index is just like one would expect:

rob@SERVER:/mnt/scratch5/rnor_test_issue505 $ du -h salmon_index/
15G	salmon_index/

@tamuanand
Copy link
Author

Thanks @rob-p

I will close this

@uros-sipetic
Copy link

uros-sipetic commented Apr 21, 2020

Hi @rob-p and @tamuanand - I ran into the same issue with the Rnor6.0 reference from Ensembl, so thanks for the tips and solutions here on the two duplicate sequences!

@rob-p
Copy link
Collaborator

rob-p commented Apr 22, 2020

Hi @tamuanand and @uros-sipetic,

Thanks for the feedback on this! I just cut v1.2.1 which "fixes" the behavior. It will simply discard any duplicate decoy sequences, which resolves this problem without requiring manual intervention.

@tamuanand
Copy link
Author

@rob-p Thanks for releasing 1.2.1

Do any of the new fixes affect salmon quant

Why the question: I have already started using 1.2.0 for different quant steps after the fixes for duplicate decoys? Hence, do I need to move to 1.2.1, re-index everything etc

@rob-p
Copy link
Collaborator

rob-p commented Apr 22, 2020

No, there are no changes here. Further, indices built from version 1.0.0 are forward-compatible up through the current release. There is no need to rebuild any indices. Also, though 1.2.1 added a new flag, it made no changes to defaults, so quantifications between 1.2.0 and 1.2.1 are directly comparable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants