Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reading alevin metadata (and refactor) #26

Merged
merged 14 commits into from
Jul 30, 2021

Conversation

jashapiro
Copy link
Member

The main point of this PR was to add reading some metadata from alevin and alein-fry outputs and incorporate that data into the SingleCellExperiment metadata slot. While I was waiting for the output files to be made more consistent, I ended up doing a bit of refactoring as well.

The biggest change on the refactoring front was moving the tximport reads to a separate function in order to simplify the main read_alevin() function. I also ended up renaming read_usa_mode() to reflect that we might use it in the future for other tasks (namely CITE-seq).

The metadata reading happens in a separate function as well, for cleanliness.

Questions for reviewers

I'd appreciate a close look at the metadata fields I chose to include. Are there other fields that seem worth including?

We may want to rename our index files to make the origin of the data (ensembl release) a bit more clear, but that is outside the scope of this PR/issue.

There is one choice I made that I am not fully comfortable with, which was to add the transcript_type only upon collapsing calls (so that lives in the read_alevin() function instead of read_alevin_metadata()). I can see that we might want to handle that differently, but I wasn't sure the best way.

related to #17

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are generally good changes. To answer your question about what should we include or not include in the metadata, I think you got most of everything we would want to capture. I added a suggestion about adding in the tx2gene file name for if we are using alevin since that contains the version information and also could be beneficial to have. The other thing I thought of that I'm going back and forth on is the number of cells along with general metrics like mean/ median umis per cell and genes per cell. Both alevin and alevin-fry output the number of cells that are considered and alevin has a alevin_meta_info.json file that outputs the mean umis per cell and mean genes per cell. I know we might calculate some of these in the QC report and I'm assuming that we will add them on to the sce at that point? But is it worth grabbing this information from the alevin report here? I could see an argument for both sides here.

The other question you had was about adding the transcript_type within the read_alevin function. This may only be a half solution, but one possibility is to use which_counts as input to the read_alevin_metadata function and then add it in the function. I'm not sure it makes a huge difference but if you want to add it there instead? Right now, which_counts is a required input and I think it would be helpful to have that in all metadata where you currently have it for only those that are in usa or intron mode.

R/read_alevin.R Outdated
counts <- collapse_intron_counts(mtx, which_counts)
return(counts)
# Create a metadata list
meta <- list(salmon_version = cmd_info[['salmon_version']],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta <- list(salmon_version = cmd_info[['salmon_version']],
meta <- list(salmon_version = cmd_info$salmon_version,

This technically has a colon at the end of only the most recent salmon versions. To account for the variation we can use the $ instead. I submitted an issue on Salmon's repo about this bug COMBINE-lab/salmon#691.

R/read_alevin.R Outdated
read_alevin_metadata <- function(quant_dir){
cmd_info_path <- file.path(quant_dir, "cmd_info.json")
permit_json_path <- file.path(quant_dir, "generate_permit_list.json")
collate_json_path <- file.path(quant_dir, "collate.json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this file? We don't need any of the metadata from it so unless we just want to check it exists for completeness of the run I think we can remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess we never use it...

meta$af_permit_type <- permit_info[['permit-list-type']]
meta$af_resolution <- quant_info[['resolution_strategy']]
meta$usa_mode <- quant_info[['usa_mode']]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta$alevin_tx2gene <- cmd_info[['tgMap']]

Do we want to add a line to grab the tx2gene map information if we are using alevin? That will also help grab some of the version information since the names of those files have the ensembl version in them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this, but for --rad mode this might not actually be there (the argument is no longer required, as the file is unused). There should be an equivalent in the quant.json file, but there isn't a specific field, so we would have to parse it out of the command line call stored in "cmd". I don't love that.

I think we probably want to explore other ways of passing along this information along the workflow... not sure the best way as of now.

#' with NULL values for missing elements.
#'
#' @noRd
read_alevin_metadata <- function(quant_dir){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
read_alevin_metadata <- function(quant_dir){
read_alevin_metadata <- function(quant_dir, which_counts = c("spliced", "unspliced"){

If we want to add transcript type during this part why don't we include it as an argument here and then append it later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing this (it was there in a pervious version) but I think my logic was that this function is (by its name) about reading metadata, so anything not read in could be added separately.

meta$alevinfry_version <- permit_info[['version_str']]
meta$af_permit_type <- permit_info[['permit-list-type']]
meta$af_resolution <- quant_info[['resolution_strategy']]
meta$usa_mode <- quant_info[['usa_mode']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
meta$usa_mode <- quant_info[['usa_mode']]
meta$usa_mode <- quant_info[['usa_mode']]
meta$transcript_type <- which_counts

If you like my earlier suggestion of including the transcript type in this function.

expect_s4_class(sce, "SingleCellExperiment")
expect_equal(dim(sce), sce_af_size)
# check that column names are barcodes
col_barcode <- str_detect(colnames(sce), "^[ACGT]+$")
expect_true(all(col_barcode))
# check metadata
expect_equal(sce@metadata$mapping_tool, "alevin-fry")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_equal(sce@metadata$mapping_tool, "alevin-fry")
expect_false(is.na(sce@metadata$salmon_version))
expect_false(is.na(sce@metadata$reference_index))
expect_equal(sce@metadata$mapping_tool, "alevin-fry")

Probably a good idea to add checks that the salmon version and reference index aren't empty to ensure the cmd_info.json was read in properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, good idea... made me realize the line after is wrong though: should be is.null!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, I know why I wasn't testing the reference index: it would fail with some versions where that data wasn't reported properly... so null values there are possible. (this was also why I wasn't testing for salmon version probably. But that is there (if sometimes wrong) in all cases.

R/read_alevin.R Outdated
Comment on lines 175 to 178
if (file.exists(quant_json_path)){
collate_info <- jsonlite::read_json(collate_json_path)
} else {
collate_info <- list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (file.exists(quant_json_path)){
collate_info <- jsonlite::read_json(collate_json_path)
} else {
collate_info <- list()

Based on my comment above, we could remove this.

@jashapiro jashapiro merged commit 2dc5cd3 into main Jul 30, 2021
@jashapiro jashapiro deleted the jashapiro/read-alevin-metadata branch July 30, 2021 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants