Add reading alevin metadata (and refactor) #26

jashapiro · 2021-07-28T18:17:01Z

The main point of this PR was to add reading some metadata from alevin and alein-fry outputs and incorporate that data into the SingleCellExperiment metadata slot. While I was waiting for the output files to be made more consistent, I ended up doing a bit of refactoring as well.

The biggest change on the refactoring front was moving the tximport reads to a separate function in order to simplify the main read_alevin() function. I also ended up renaming read_usa_mode() to reflect that we might use it in the future for other tasks (namely CITE-seq).

The metadata reading happens in a separate function as well, for cleanliness.

Questions for reviewers

I'd appreciate a close look at the metadata fields I chose to include. Are there other fields that seem worth including?

We may want to rename our index files to make the origin of the data (ensembl release) a bit more clear, but that is outside the scope of this PR/issue.

There is one choice I made that I am not fully comfortable with, which was to add the transcript_type only upon collapsing calls (so that lives in the read_alevin() function instead of read_alevin_metadata()). I can see that we might want to handle that differently, but I wasn't sure the best way.

related to #17

as originally planned

allyhawkins

I think these are generally good changes. To answer your question about what should we include or not include in the metadata, I think you got most of everything we would want to capture. I added a suggestion about adding in the tx2gene file name for if we are using alevin since that contains the version information and also could be beneficial to have. The other thing I thought of that I'm going back and forth on is the number of cells along with general metrics like mean/ median umis per cell and genes per cell. Both alevin and alevin-fry output the number of cells that are considered and alevin has a alevin_meta_info.json file that outputs the mean umis per cell and mean genes per cell. I know we might calculate some of these in the QC report and I'm assuming that we will add them on to the sce at that point? But is it worth grabbing this information from the alevin report here? I could see an argument for both sides here.

The other question you had was about adding the transcript_type within the read_alevin function. This may only be a half solution, but one possibility is to use which_counts as input to the read_alevin_metadata function and then add it in the function. I'm not sure it makes a huge difference but if you want to add it there instead? Right now, which_counts is a required input and I think it would be helpful to have that in all metadata where you currently have it for only those that are in usa or intron mode.

allyhawkins · 2021-07-29T16:46:08Z

R/read_alevin.R

-  counts <- collapse_intron_counts(mtx, which_counts)
-  return(counts)
+  # Create a metadata list
+  meta <- list(salmon_version = cmd_info[['salmon_version']],


Suggested change

meta <- list(salmon_version = cmd_info[['salmon_version']],

meta <- list(salmon_version = cmd_info$salmon_version,

This technically has a colon at the end of only the most recent salmon versions. To account for the variation we can use the $ instead. I submitted an issue on Salmon's repo about this bug COMBINE-lab/salmon#691.

allyhawkins · 2021-07-29T16:48:06Z

R/read_alevin.R

+read_alevin_metadata <- function(quant_dir){
+  cmd_info_path <- file.path(quant_dir, "cmd_info.json")
+  permit_json_path <- file.path(quant_dir, "generate_permit_list.json")
+  collate_json_path <- file.path(quant_dir, "collate.json")


Do we even need this file? We don't need any of the metadata from it so unless we just want to check it exists for completeness of the run I think we can remove this.

Yeah, I guess we never use it...

allyhawkins · 2021-07-29T16:55:36Z

R/read_alevin.R

+  meta$af_permit_type <- permit_info[['permit-list-type']]
+  meta$af_resolution <- quant_info[['resolution_strategy']]
+  meta$usa_mode <- quant_info[['usa_mode']]
+


Suggested change

meta$alevin_tx2gene <- cmd_info[['tgMap']]

Do we want to add a line to grab the tx2gene map information if we are using alevin? That will also help grab some of the version information since the names of those files have the ensembl version in them.

I like the idea of this, but for --rad mode this might not actually be there (the argument is no longer required, as the file is unused). There should be an equivalent in the quant.json file, but there isn't a specific field, so we would have to parse it out of the command line call stored in "cmd". I don't love that.

I think we probably want to explore other ways of passing along this information along the workflow... not sure the best way as of now.

allyhawkins · 2021-07-29T17:16:15Z

R/read_alevin.R

+#'   with NULL values for missing elements.
+#'
+#' @noRd
+read_alevin_metadata <- function(quant_dir){


Suggested change

read_alevin_metadata <- function(quant_dir){

read_alevin_metadata <- function(quant_dir, which_counts = c("spliced", "unspliced"){

If we want to add transcript type during this part why don't we include it as an argument here and then append it later?

I thought about doing this (it was there in a pervious version) but I think my logic was that this function is (by its name) about reading metadata, so anything not read in could be added separately.

allyhawkins · 2021-07-29T17:16:59Z

R/read_alevin.R

+  meta$alevinfry_version <- permit_info[['version_str']]
+  meta$af_permit_type <- permit_info[['permit-list-type']]
+  meta$af_resolution <- quant_info[['resolution_strategy']]
+  meta$usa_mode <- quant_info[['usa_mode']]


Suggested change

meta$usa_mode <- quant_info[['usa_mode']]

meta$usa_mode <- quant_info[['usa_mode']]

meta$transcript_type <- which_counts

If you like my earlier suggestion of including the transcript type in this function.

allyhawkins · 2021-07-29T17:20:25Z

tests/testthat/test-read_alevin.R

  expect_s4_class(sce, "SingleCellExperiment")
  expect_equal(dim(sce), sce_af_size)
  # check that column names are barcodes
  col_barcode <- str_detect(colnames(sce), "^[ACGT]+$")
  expect_true(all(col_barcode))
+  # check metadata
+  expect_equal(sce@metadata$mapping_tool, "alevin-fry")


Suggested change

expect_equal(sce@metadata$mapping_tool, "alevin-fry")

expect_false(is.na(sce@metadata$salmon_version))

expect_false(is.na(sce@metadata$reference_index))

expect_equal(sce@metadata$mapping_tool, "alevin-fry")

Probably a good idea to add checks that the salmon version and reference index aren't empty to ensure the cmd_info.json was read in properly.

yup, good idea... made me realize the line after is wrong though: should be is.null!

Oh, actually, I know why I wasn't testing the reference index: it would fail with some versions where that data wasn't reported properly... so null values there are possible. (this was also why I wasn't testing for salmon version probably. But that is there (if sometimes wrong) in all cases.

allyhawkins · 2021-07-29T17:38:40Z

R/read_alevin.R

+  if (file.exists(quant_json_path)){
+    collate_info <- jsonlite::read_json(collate_json_path)
+  } else {
+    collate_info <- list()


Suggested change

if (file.exists(quant_json_path)){

collate_info <- jsonlite::read_json(collate_json_path)

} else {

collate_info <- list()

Based on my comment above, we could remove this.

jashapiro added 13 commits July 27, 2021 15:43

Factor out tximport steps

5fd0a58

Document and add metadata skeleton

79380eb

Start to get info from af json files

eafbf1b

Add some metadata parsing

06d8519

Add more metadata

d129150

Add a couple of tests

4f6feca

few more tests

37acf26

rename read_usa_mode

661ab33

redocument

f043128

factor out metadata reading

be0349b

as originally planned

move transcript_type to main function

0e287ec

Minor documentation updates

615b7ef

Fix extra argument

f98eaaa

jashapiro requested a review from allyhawkins July 28, 2021 19:44

allyhawkins approved these changes Jul 29, 2021

View reviewed changes

fix salmon version reading and add some more checks

290a226

jashapiro merged commit 2dc5cd3 into main Jul 30, 2021

jashapiro deleted the jashapiro/read-alevin-metadata branch July 30, 2021 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reading alevin metadata (and refactor) #26

Add reading alevin metadata (and refactor) #26

jashapiro commented Jul 28, 2021

allyhawkins left a comment

allyhawkins Jul 29, 2021

allyhawkins Jul 29, 2021

jashapiro Jul 30, 2021

allyhawkins Jul 29, 2021

jashapiro Jul 30, 2021

allyhawkins Jul 29, 2021

jashapiro Jul 30, 2021

allyhawkins Jul 29, 2021

allyhawkins Jul 29, 2021

jashapiro Jul 30, 2021

jashapiro Jul 30, 2021

allyhawkins Jul 29, 2021

	meta <- list(salmon_version = cmd_info[['salmon_version']],
	meta <- list(salmon_version = cmd_info$salmon_version,

	read_alevin_metadata <- function(quant_dir){
	read_alevin_metadata <- function(quant_dir, which_counts = c("spliced", "unspliced"){

	meta$usa_mode <- quant_info[['usa_mode']]
	meta$usa_mode <- quant_info[['usa_mode']]
	meta$transcript_type <- which_counts

Add reading alevin metadata (and refactor) #26

Add reading alevin metadata (and refactor) #26

Conversation

jashapiro commented Jul 28, 2021

Questions for reviewers

allyhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment