-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yeast expression data for database #937
Comments
This is the spreadsheet we will work from |
@ahmad00m has read the paper and had some clarification questions; next step is to look at the data to see whether this can be mapped to GRNsight. |
@ahmad00m will need to do some clustering with stem as the next step; he will also seek the structure the file for stem C |
@ahmad00m installed stem and got the interface window running but when trying to browse data into stem no files were found. The address of the file was also inputted manually to check whether that would fix the problem but that didn't work either. The file was saved as (Tab-delimited) (.txt) but was not found. |
@ahmad00m, is the file somewhere I can grab? I'll try it on my machine before the meeting today. |
I attached the file here. The formatting is a bit different because we couldn't do ANOVA test, but I think it should e fine for stem. |
I got it to work on my machine. We can troubleshoot during the meeting. However, stem doesn't work with replicates. You need to take the average of the replicate data for each time point and just load the average into stem, not the replicates. Also note that since the data has a 0 timepoint, we can leave the default setting for normalization. |
We are also going to need to standardize the IDs in this file. There is a mix of standard names, systematic names, and internal SGD IDs in the file. Yeastract has a tool here: http://www.yeastract.com/formorftogene.php, although I'm not sure it will do the SGD IDs. |
We have discovered that there are duplicate and triplicate rows in the data that need to be removed. Some rows are unique, some are duplicated, and some are triplicated. We need to get rid of the redundant rows, and then standardize the IDs. Also, when I opened the .txt file in Excel, it converted some of the IDs to dates. |
@ahmad00m has put together a Python script that flags duplicates and we have decided to keep the duplicate row that has the systematic ID. All other rows can be discarded. |
@ahmad00m had finished up tidying the file, but when I try to find the standard names of each gene, I get a smaller number compared to the systematic names of the genes as if some genes are lost in the process. I have 4,926 genes for systematic names and I get 4,856 genes for the standard name. I was wondering what I would need to do to resolve this problem. |
If a standard name does not exist for a given systematic name, then use the systematic name for the standard name in that instance. I would spot check a few of these in SGD just to make sure that this is what's happening. |
I have fixed the problem. I believe there was an issue with my code. Now everything works and the numbers for both the standard name and the systematic name match. Also, I have attached the excel file and the final version of the clean data in .txt format. |
@ahmad00m will start committing the duplication removal script to https://github.com/dondi/GRNsight-archive under a folder within a scripts folder. In addition to the script, @ahmad00m can also commit a README.md that describes what the script does. Recommended structure is as follows:
|
Systematic name regex (so far): (when using grep, the parentheses and question mark need to be escaped: |
@ahmad00m saved duplication scripts on GRNsight-archive repository. I tried to clean the file using grep and then remove duplicates which resulted in 4,848 genes. My guess for low number of genes is that some of the gene expression data might use names other than the systemic names. I checked for that by just running the duplicate_remover code on the original file and there were 5,569 genes which suggests some data has been lost by only selecting for systematic names. I inputed the standard names for those genes using http://www.yeastract.com/formorftogene.php website and then ran stem. I believe the next step for me would be try to figure out a way to find the systematic names for those of other names in the file and try to write a more sophisticated code for selecting systematic names from the file. Here is the file for systematic names and no duplicates Here is the file for no duplicate genes Also here is a picture of the stem result. |
The next step here is to get more specific information on the duplicated expression data rows:
Possible approaches:
|
Just a quick question. if there are 2 duplicates both with systematic names, e.g. YLR391W and YLR391W-A does it matter which one to keep? |
Those are two different genes, keep both. |
@ahmad00m wrote a code to keep the preferred ID's but there is an issue with returning them. I'm hoping to resolve this issue in the meeting. Also, I determined that there are 664 unique genes with SGD ID's which need to be changed to the systematic names (they are unique values with only SGD name). So, I need to find a way to change these ID's to systematic ID. |
|
@ahmad00m contacted SGD website help desk and I got a website that contains all the information (SGD ID, Systematic name, Standard name) in tab delimited text format. Here is the link to the website YeastMine. |
@ahmad00m, please make a list of IDs that are the exceptions you mentioned. I'd like to investigate what the problems are for those IDs that would result in gaps. The "results.txt" file should be committed to the GRNsight archive. It should be given a more descriptive name. Put it into it's own directory named something like "source data" (I don't remember our naming conventions off the top of my head", and then make a README.md file in that directory that describes how the data were obtained, the date, and what is in the file for future reference. The data are a snapshot in time, so if we need to do this again, we should have instructions on what to do. |
The file attached contains the ID's with gaps for their standard names. There are 1357 of them. |
Wow, 1357 seems like a lot, is this the total from Yeastmine or the total from your dataset (or both)? I did a spot check on the first 10 IDs in the list by looking them up directly on the SGD webpage. Some of them are designated "dubious" as in unlikely to code for a real protein, but some of them were simply "uncharacterized" meaning that no one has studied them yet. A couple had "reserved" names which is on the way to getting a real "standard" name. I think we can safely copy over the systematic name to be the standard name for these. I'm not going to be able to make the meeting tomorrow. Would @ahmad00m make a summary of what he has done to the dataset? I'm looking for something like:
|
This is the total from Yeastmine. I did not use my dataset yet. I believe I can convert all the SGD ID's to systematic names and then use systematic names to to find standard names on Yeastmine web page. Then, if the standard name doesn't exist I can use the systematic name as the standard name as you suggested. @ahmad00m will make a summary of what processes I have done on the dataset. |
@ahmad00m and @dondi reviewed the status of this issue at the meeting and first resolved a few bugs and technical questions in his current code. @dondi also sketched out how @ahmad00m can use the ID-mapping file that he acquired from SGD to identify the systematic ID and/or standard name given an SGD ID. (this file has also been uploaded to GRNsight-archive) @ahmad00m will work on these bug fixes and post a follow-up message with the summary requested by @kdahlquist |
@ahmad00m finished up writing the code to replace SGD ID with Systematic names. However, I found out there are 45 ID's out of 696 SGD ID's have no equivalent systematic name in the file obtained from SGD website Helpdesk. So, I can look up these ID's and change them manually. After looking up these 45 SGD ID's the file will be ready and cleaned to be used in stem. The summary is as follows:
|
@ahmad00m finally cleaned the original file and found 5543 genes. Then, I found the standard names using Yeastract and used the systematic names for those which didn't have standard names. I have attached the final dataset below. Also, I tried running stem using the instructions from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_9, but I got an error saying "All genes filtered". Hopefully we can trouble shoot this during the meeting. |
@ahmad00m, there was a problem with the way you formatted the file.
Before we move onto the next step, you need to write up a protocol for all the steps you carried out to go from the original file to this one. I want to review that and follow the steps myself to make sure I can replicate your results. After that, the next step would be to generate candidate gene regulatory networks using Yeastract. It looks like out of the 8 significant patterns, 4 are generally up before returning to baseline and 4 are generally down before returning to baseline. In terms of looking for networks, it might work to group the genes from the 4 up and 4 down clusters. |
Even though we don't need the standard names to run stem, we will want them. I noticed some odd standard names in the file:
|
Here is the final summary of cleaning the data including the documentation of the steps take.
Here is the documentation of the steps taken to clean the expression data. Also, the code is ready to be pushed to GitHub. Should @ahmad00m upload the documentation and the codes to GRNsight-archive? Moreover, if @kdahlquist wants to confirm the steps I can email the codes before pushing them to GRNsight-archive. Here is the final version of the file. (It is a bit different than last one because this one includes the expression data for mitochondrial genes which was decided to be included) |
@ahmad00m , you can upload the code to the GRNsight-archive. If it needs to be modified in the future, that's OK. It is preferable to keep it in the repository. Since GitHub keeps track of all versions, it's better to keep it there as opposed to having the only copy be on your computer. |
@ahmad00m pushed the codes to GRNsight-archive. So, they can be accessed for testing. |
@ahmad00m ran STEM and saved the results. I also tried analysing the results and continued up until generating the regulation matrix in YEASTRACT ;however, no matrix was created after a while and I did not get any errors either. I hope to troubleshoot this during the meeting so I can continue with visualising the model with GRNsight and determine which one would be appropriate to pursue further for modeling. |
@ahmad00m updated the codes and the documentation for replacing the ID's. I also added the original expression data to GRNsight-archive. Moreover, I tried to create the regulation matrix but the website doesn't return any matrices, so I'm hoping to troubleshoot that during the meeting later today. |
Some notes from the 2/14/22 meeting:
|
@ahmad00m finished creating the adjacency matrices for the first four significant profiles using YEASTRACT database. Also, the new documentation containing all the steps up until visualizing the GRN on GRNsight will soon be pushed to GRNsight-archive for review. |
Here is the link to the complete Documentation |
Follow-up wrap-up comments:
|
|
Initial review of the documentation looks good; it will need a “validation test” where someone who is unfamiliar with the process seeks to follow the instructions in order to accomplish the same result. Tentatively this looks like a good match for @ahmad00m to go over with @Sarronnn, minimizing intervention until they discover something that needs to be clarified in the documentation |
Closing because it is complete and live in v6.0.7 |
Opening this issue for @ahmad00m to record tasks for preparing a new expression dataset for the back-end database.
We are going to use data from this paper: Apweiler, E., Sameith, K., Margaritis, T., Brabers, N., van de Pasch, L., Bakker, L. V., ... & Kemmeren, P. (2012). Yeast glucose pathways converge on the transcriptional regulation of trehalose biosynthesis. BMC genomics, 13(1), 1-14. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-239
We will focus on the wild type data because that is the one for which they did the timecourse. @ahmad00m should begin by reading the paper. We will then work on analyzing the data and preparing it for the database insertion.
We are roughly going to follow the project outline from the Fall 2019 Biological Databases course. Of particular interest are:
https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Data_Analysis and https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Quality_Assurance
The text was updated successfully, but these errors were encountered: