-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reading alevin metadata (and refactor) #26
Conversation
as originally planned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are generally good changes. To answer your question about what should we include or not include in the metadata, I think you got most of everything we would want to capture. I added a suggestion about adding in the tx2gene file name for if we are using alevin since that contains the version information and also could be beneficial to have. The other thing I thought of that I'm going back and forth on is the number of cells along with general metrics like mean/ median umis per cell and genes per cell. Both alevin and alevin-fry output the number of cells that are considered and alevin has a alevin_meta_info.json
file that outputs the mean umis per cell and mean genes per cell. I know we might calculate some of these in the QC report and I'm assuming that we will add them on to the sce at that point? But is it worth grabbing this information from the alevin report here? I could see an argument for both sides here.
The other question you had was about adding the transcript_type
within the read_alevin
function. This may only be a half solution, but one possibility is to use which_counts
as input to the read_alevin_metadata
function and then add it in the function. I'm not sure it makes a huge difference but if you want to add it there instead? Right now, which_counts
is a required input and I think it would be helpful to have that in all metadata where you currently have it for only those that are in usa or intron mode.
R/read_alevin.R
Outdated
counts <- collapse_intron_counts(mtx, which_counts) | ||
return(counts) | ||
# Create a metadata list | ||
meta <- list(salmon_version = cmd_info[['salmon_version']], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta <- list(salmon_version = cmd_info[['salmon_version']], | |
meta <- list(salmon_version = cmd_info$salmon_version, |
This technically has a colon at the end of only the most recent salmon versions. To account for the variation we can use the $ instead. I submitted an issue on Salmon's repo about this bug COMBINE-lab/salmon#691.
R/read_alevin.R
Outdated
read_alevin_metadata <- function(quant_dir){ | ||
cmd_info_path <- file.path(quant_dir, "cmd_info.json") | ||
permit_json_path <- file.path(quant_dir, "generate_permit_list.json") | ||
collate_json_path <- file.path(quant_dir, "collate.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need this file? We don't need any of the metadata from it so unless we just want to check it exists for completeness of the run I think we can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we never use it...
meta$af_permit_type <- permit_info[['permit-list-type']] | ||
meta$af_resolution <- quant_info[['resolution_strategy']] | ||
meta$usa_mode <- quant_info[['usa_mode']] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta$alevin_tx2gene <- cmd_info[['tgMap']] |
Do we want to add a line to grab the tx2gene map information if we are using alevin? That will also help grab some of the version information since the names of those files have the ensembl version in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of this, but for --rad
mode this might not actually be there (the argument is no longer required, as the file is unused). There should be an equivalent in the quant.json
file, but there isn't a specific field, so we would have to parse it out of the command line call stored in "cmd"
. I don't love that.
I think we probably want to explore other ways of passing along this information along the workflow... not sure the best way as of now.
#' with NULL values for missing elements. | ||
#' | ||
#' @noRd | ||
read_alevin_metadata <- function(quant_dir){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_alevin_metadata <- function(quant_dir){ | |
read_alevin_metadata <- function(quant_dir, which_counts = c("spliced", "unspliced"){ |
If we want to add transcript type during this part why don't we include it as an argument here and then append it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing this (it was there in a pervious version) but I think my logic was that this function is (by its name) about reading metadata, so anything not read in could be added separately.
meta$alevinfry_version <- permit_info[['version_str']] | ||
meta$af_permit_type <- permit_info[['permit-list-type']] | ||
meta$af_resolution <- quant_info[['resolution_strategy']] | ||
meta$usa_mode <- quant_info[['usa_mode']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta$usa_mode <- quant_info[['usa_mode']] | |
meta$usa_mode <- quant_info[['usa_mode']] | |
meta$transcript_type <- which_counts |
If you like my earlier suggestion of including the transcript type in this function.
expect_s4_class(sce, "SingleCellExperiment") | ||
expect_equal(dim(sce), sce_af_size) | ||
# check that column names are barcodes | ||
col_barcode <- str_detect(colnames(sce), "^[ACGT]+$") | ||
expect_true(all(col_barcode)) | ||
# check metadata | ||
expect_equal(sce@metadata$mapping_tool, "alevin-fry") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_equal(sce@metadata$mapping_tool, "alevin-fry") | |
expect_false(is.na(sce@metadata$salmon_version)) | |
expect_false(is.na(sce@metadata$reference_index)) | |
expect_equal(sce@metadata$mapping_tool, "alevin-fry") |
Probably a good idea to add checks that the salmon version and reference index aren't empty to ensure the cmd_info.json
was read in properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, good idea... made me realize the line after is wrong though: should be is.null
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually, I know why I wasn't testing the reference index: it would fail with some versions where that data wasn't reported properly... so null values there are possible. (this was also why I wasn't testing for salmon version probably. But that is there (if sometimes wrong) in all cases.
R/read_alevin.R
Outdated
if (file.exists(quant_json_path)){ | ||
collate_info <- jsonlite::read_json(collate_json_path) | ||
} else { | ||
collate_info <- list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (file.exists(quant_json_path)){ | |
collate_info <- jsonlite::read_json(collate_json_path) | |
} else { | |
collate_info <- list() |
Based on my comment above, we could remove this.
The main point of this PR was to add reading some metadata from alevin and alein-fry outputs and incorporate that data into the SingleCellExperiment
metadata
slot. While I was waiting for the output files to be made more consistent, I ended up doing a bit of refactoring as well.The biggest change on the refactoring front was moving the
tximport
reads to a separate function in order to simplify the mainread_alevin()
function. I also ended up renamingread_usa_mode()
to reflect that we might use it in the future for other tasks (namely CITE-seq).The metadata reading happens in a separate function as well, for cleanliness.
Questions for reviewers
I'd appreciate a close look at the metadata fields I chose to include. Are there other fields that seem worth including?
We may want to rename our index files to make the origin of the data (ensembl release) a bit more clear, but that is outside the scope of this PR/issue.
There is one choice I made that I am not fully comfortable with, which was to add the
transcript_type
only upon collapsing calls (so that lives in theread_alevin()
function instead ofread_alevin_metadata()
). I can see that we might want to handle that differently, but I wasn't sure the best way.related to #17