Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yeast expression data for database #937

Closed
kdahlquist opened this issue Sep 29, 2021 · 43 comments
Closed

yeast expression data for database #937

kdahlquist opened this issue Sep 29, 2021 · 43 comments

Comments

@kdahlquist
Copy link
Collaborator

Opening this issue for @ahmad00m to record tasks for preparing a new expression dataset for the back-end database.

We are going to use data from this paper: Apweiler, E., Sameith, K., Margaritis, T., Brabers, N., van de Pasch, L., Bakker, L. V., ... & Kemmeren, P. (2012). Yeast glucose pathways converge on the transcriptional regulation of trehalose biosynthesis. BMC genomics, 13(1), 1-14. https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-239

We will focus on the wild type data because that is the one for which they did the timecourse. @ahmad00m should begin by reading the paper. We will then work on analyzing the data and preparing it for the database insertion.

We are roughly going to follow the project outline from the Fall 2019 Biological Databases course. Of particular interest are:
https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Data_Analysis and https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Quality_Assurance

@kdahlquist
Copy link
Collaborator Author

This is the spreadsheet we will work from
GSE33097_s257_final.xlsx

@dondi
Copy link
Owner

dondi commented Oct 7, 2021

@ahmad00m has read the paper and had some clarification questions; next step is to look at the data to see whether this can be mapped to GRNsight.

@dondi
Copy link
Owner

dondi commented Oct 14, 2021

@ahmad00m will need to do some clustering with stem as the next step; he will also seek the structure the file for stem C

@ahmad00m
Copy link
Collaborator

@ahmad00m installed stem and got the interface window running but when trying to browse data into stem no files were found. The address of the file was also inputted manually to check whether that would fix the problem but that didn't work either. The file was saved as (Tab-delimited) (.txt) but was not found.
Screenshot 2021-10-20 at 9 10 24 pm

@kdahlquist
Copy link
Collaborator Author

@ahmad00m, is the file somewhere I can grab? I'll try it on my machine before the meeting today.

@ahmad00m
Copy link
Collaborator

I attached the file here. The formatting is a bit different because we couldn't do ANOVA test, but I think it should e fine for stem.
(WT1)_stem.txt

@kdahlquist
Copy link
Collaborator Author

I got it to work on my machine. We can troubleshoot during the meeting. However, stem doesn't work with replicates. You need to take the average of the replicate data for each time point and just load the average into stem, not the replicates.

Also note that since the data has a 0 timepoint, we can leave the default setting for normalization.

@kdahlquist
Copy link
Collaborator Author

We are also going to need to standardize the IDs in this file. There is a mix of standard names, systematic names, and internal SGD IDs in the file. Yeastract has a tool here: http://www.yeastract.com/formorftogene.php, although I'm not sure it will do the SGD IDs.

@kdahlquist
Copy link
Collaborator Author

We have discovered that there are duplicate and triplicate rows in the data that need to be removed. Some rows are unique, some are duplicated, and some are triplicated. We need to get rid of the redundant rows, and then standardize the IDs.

Also, when I opened the .txt file in Excel, it converted some of the IDs to dates.

@dondi
Copy link
Owner

dondi commented Oct 28, 2021

@ahmad00m has put together a Python script that flags duplicates and we have decided to keep the duplicate row that has the systematic ID. All other rows can be discarded.

@ahmad00m
Copy link
Collaborator

ahmad00m commented Nov 4, 2021

@ahmad00m had finished up tidying the file, but when I try to find the standard names of each gene, I get a smaller number compared to the systematic names of the genes as if some genes are lost in the process. I have 4,926 genes for systematic names and I get 4,856 genes for the standard name. I was wondering what I would need to do to resolve this problem.

@kdahlquist
Copy link
Collaborator Author

If a standard name does not exist for a given systematic name, then use the systematic name for the standard name in that instance. I would spot check a few of these in SGD just to make sure that this is what's happening.

@ahmad00m
Copy link
Collaborator

ahmad00m commented Nov 4, 2021

I have fixed the problem. I believe there was an issue with my code. Now everything works and the numbers for both the standard name and the systematic name match. Also, I have attached the excel file and the final version of the clean data in .txt format.
GSE33097_s257_final-4-original.xlsx
FINALfile.txt

@dondi
Copy link
Owner

dondi commented Nov 4, 2021

@ahmad00m will start committing the duplication removal script to https://github.com/dondi/GRNsight-archive under a folder within a scripts folder. In addition to the script, @ahmad00m can also commit a README.md that describes what the script does. Recommended structure is as follows:

  • GRNsight-archive
    • documents
    • scripts
      • duplicate_expression_remover
        • wt_stemtest.py
        • (other files)
        • README.md

@dondi
Copy link
Owner

dondi commented Nov 4, 2021

Systematic name regex (so far): Y[A-P][LR][0-9][0-9][0-9][WwCc](-[A-Z])?

(when using grep, the parentheses and question mark need to be escaped: Y[A-P][LR][0-9][0-9][0-9][WwCc]\(-[A-Z]\)\?)

@ahmad00m
Copy link
Collaborator

@ahmad00m saved duplication scripts on GRNsight-archive repository. I tried to clean the file using grep and then remove duplicates which resulted in 4,848 genes. My guess for low number of genes is that some of the gene expression data might use names other than the systemic names. I checked for that by just running the duplicate_remover code on the original file and there were 5,569 genes which suggests some data has been lost by only selecting for systematic names. I inputed the standard names for those genes using http://www.yeastract.com/formorftogene.php website and then ran stem. I believe the next step for me would be try to figure out a way to find the systematic names for those of other names in the file and try to write a more sophisticated code for selecting systematic names from the file.

Here is the file for systematic names and no duplicates
standard_system_clean_file.txt

Here is the file for no duplicate genes
testfordupgene.txt

Also here is a picture of the stem result.

Screenshot 2021-11-11 at 7 34 38 am

@dondi
Copy link
Owner

dondi commented Nov 12, 2021

The next step here is to get more specific information on the duplicated expression data rows:

  • Modify the duplicate remover so that it chooses to keep, preferentially:
    • Rows with systematic ID
    • Rows with standard name
    • Rows with SGD ID only
  • Counts of the latter two rows should be determined, so that we know the amount of work involved in mapping the SGD ID-only rows to the systematic ID
  • Once this is done, we should then have a full non-duplicated file, all of which are keyed by systematic ID

Possible approaches:

  • Build a new list while preferentially tracking the IDs found
  • Build multiple lists depending on the matching ID
  • Build a dictionary using a tuple-ized version of the expression data as key where the value is the list of IDs under which that expression data was found

@ahmad00m
Copy link
Collaborator

Just a quick question. if there are 2 duplicates both with systematic names, e.g. YLR391W and YLR391W-A does it matter which one to keep?

@kdahlquist
Copy link
Collaborator Author

Those are two different genes, keep both.

@ahmad00m
Copy link
Collaborator

@ahmad00m wrote a code to keep the preferred ID's but there is an issue with returning them. I'm hoping to resolve this issue in the meeting. Also, I determined that there are 664 unique genes with SGD ID's which need to be changed to the systematic names (they are unique values with only SGD name). So, I need to find a way to change these ID's to systematic ID.

@ahmad00m
Copy link
Collaborator

  • The script needs to be debugged to reduce the ID to a single one
  • Look of the 58 systematic names to check whether they are duplicates or the same gene and then write the correct ID
  • Ask the SGD help desk for a tool to convert the SGD ID to the systematic name

@ahmad00m
Copy link
Collaborator

@ahmad00m contacted SGD website help desk and I got a website that contains all the information (SGD ID, Systematic name, Standard name) in tab delimited text format. Here is the link to the website YeastMine.
Here is the file that contains all the ID's:
results.txt
However, I found some gaps for standard names in the file.
For now, I can use this file to replace the SGD ID's with their systematic names by modifying the script.
If there is anything else I need to do please let me know.

@kdahlquist
Copy link
Collaborator Author

@ahmad00m, please make a list of IDs that are the exceptions you mentioned. I'd like to investigate what the problems are for those IDs that would result in gaps.

The "results.txt" file should be committed to the GRNsight archive. It should be given a more descriptive name. Put it into it's own directory named something like "source data" (I don't remember our naming conventions off the top of my head", and then make a README.md file in that directory that describes how the data were obtained, the date, and what is in the file for future reference. The data are a snapshot in time, so if we need to do this again, we should have instructions on what to do.

@ahmad00m
Copy link
Collaborator

ahmad00m commented Dec 1, 2021

The file attached contains the ID's with gaps for their standard names. There are 1357 of them.
filewithnoGaps.txt

@kdahlquist
Copy link
Collaborator Author

Wow, 1357 seems like a lot, is this the total from Yeastmine or the total from your dataset (or both)? I did a spot check on the first 10 IDs in the list by looking them up directly on the SGD webpage. Some of them are designated "dubious" as in unlikely to code for a real protein, but some of them were simply "uncharacterized" meaning that no one has studied them yet. A couple had "reserved" names which is on the way to getting a real "standard" name.

I think we can safely copy over the systematic name to be the standard name for these.

I'm not going to be able to make the meeting tomorrow. Would @ahmad00m make a summary of what he has done to the dataset? I'm looking for something like:

  • total records in the unprocessed dataset
  • total of unique records in processed dataset
  • total in processed dataset that did not have standard name and had to use the systematic name for that.

@ahmad00m
Copy link
Collaborator

ahmad00m commented Dec 2, 2021

This is the total from Yeastmine. I did not use my dataset yet. I believe I can convert all the SGD ID's to systematic names and then use systematic names to to find standard names on Yeastmine web page. Then, if the standard name doesn't exist I can use the systematic name as the standard name as you suggested.

@ahmad00m will make a summary of what processes I have done on the dataset.

@dondi
Copy link
Owner

dondi commented Dec 3, 2021

@ahmad00m and @dondi reviewed the status of this issue at the meeting and first resolved a few bugs and technical questions in his current code. @dondi also sketched out how @ahmad00m can use the ID-mapping file that he acquired from SGD to identify the systematic ID and/or standard name given an SGD ID. (this file has also been uploaded to GRNsight-archive)

@ahmad00m will work on these bug fixes and post a follow-up message with the summary requested by @kdahlquist

@ahmad00m
Copy link
Collaborator

ahmad00m commented Dec 4, 2021

@ahmad00m finished up writing the code to replace SGD ID with Systematic names. However, I found out there are 45 ID's out of 696 SGD ID's have no equivalent systematic name in the file obtained from SGD website Helpdesk. So, I can look up these ID's and change them manually. After looking up these 45 SGD ID's the file will be ready and cleaned to be used in stem.

The summary is as follows:

  • total records in the unprocessed dataset: 12,983
  • total of unique records in processed dataset: Around 5,599
  • total in processed dataset that did not have standard name and had to use the systematic name for that: I believe I can find all the standard names from Yeastract website. (To avoid further confusion I will not be using the file obtained from SGD Helpdesk to transform systematic names to standard names but rather obtain the ID's directly from Yeastract website.

@ahmad00m
Copy link
Collaborator

@ahmad00m finally cleaned the original file and found 5543 genes. Then, I found the standard names using Yeastract and used the systematic names for those which didn't have standard names. I have attached the final dataset below. Also, I tried running stem using the instructions from https://xmlpipedb.cs.lmu.edu/biodb/fall2019/index.php/Week_9, but I got an error saying "All genes filtered". Hopefully we can trouble shoot this during the meeting.

FINALUNIQUEIDS.txt

Screenshot 2022-01-18 at 11 53 20 am

@kdahlquist
Copy link
Collaborator Author

@ahmad00m, there was a problem with the way you formatted the file.

  • "SPOT" needs to just be an index of 1 to 5544
  • "Gene Symbol" is actually the systematic name for yeast, e.g., YAL001C
    • Note that nomenclature can vary widely and while we try to use the correct names for things, not everyone does
    • Note that the actual standard name does not need to be included in the file you use for stem
  • I made these changes and was able to run the file.

Before we move onto the next step, you need to write up a protocol for all the steps you carried out to go from the original file to this one. I want to review that and follow the steps myself to make sure I can replicate your results.

After that, the next step would be to generate candidate gene regulatory networks using Yeastract. It looks like out of the 8 significant patterns, 4 are generally up before returning to baseline and 4 are generally down before returning to baseline. In terms of looking for networks, it might work to group the genes from the 4 up and 4 down clusters.

@kdahlquist
Copy link
Collaborator Author

Even though we don't need the standard names to run stem, we will want them. I noticed some odd standard names in the file:

  • e.g., ATSÊ1.00
  • You should always do a "visual inspection" of the file to see if there are any obvious problems. Just open and scroll down. You can see the issue on row 18.

@ahmad00m
Copy link
Collaborator

ahmad00m commented Jan 31, 2022

Here is the final summary of cleaning the data including the documentation of the steps take.

  • The question about including the mitochondrial gene expression data was proposed during the meeting and it was decided to be included in the final version of the cleaned file.
  • Also, the decision was made to use the 0m expression data and normalize data option in stem for clustering gene expressions.
  • A final confirmation test was on the original file containing around 13,000 genes to determine the unique expression data regardless of of their ID's which confirmed the total UNIQUE expression data of 5,569 genes.
  • There is one gene that could not be found on YEASTRACT website which represents a small nucleolar RNA that is required from pre-mRNA processing. So, my question is to whether include this data in the final version of expression data. YNCG0013W is the name if this particular gene.
  • Other than that, the ID's in expression data are sorted and ready for phase two.

Here is the documentation of the steps taken to clean the expression data.
Documentation_of_gene_expression_data.docx

Also, the code is ready to be pushed to GitHub. Should @ahmad00m upload the documentation and the codes to GRNsight-archive?

Moreover, if @kdahlquist wants to confirm the steps I can email the codes before pushing them to GRNsight-archive.

Here is the final version of the file. (It is a bit different than last one because this one includes the expression data for mitochondrial genes which was decided to be included)
Unique_systematic_ID.txt

@kdahlquist
Copy link
Collaborator Author

@ahmad00m , you can upload the code to the GRNsight-archive. If it needs to be modified in the future, that's OK. It is preferable to keep it in the repository. Since GitHub keeps track of all versions, it's better to keep it there as opposed to having the only copy be on your computer.

@ahmad00m
Copy link
Collaborator

@ahmad00m pushed the codes to GRNsight-archive. So, they can be accessed for testing.

@ahmad00m
Copy link
Collaborator

ahmad00m commented Feb 7, 2022

@ahmad00m ran STEM and saved the results. I also tried analysing the results and continued up until generating the regulation matrix in YEASTRACT ;however, no matrix was created after a while and I did not get any errors either. I hope to troubleshoot this during the meeting so I can continue with visualising the model with GRNsight and determine which one would be appropriate to pursue further for modeling.

@ahmad00m
Copy link
Collaborator

@ahmad00m updated the codes and the documentation for replacing the ID's. I also added the original expression data to GRNsight-archive. Moreover, I tried to create the regulation matrix but the website doesn't return any matrices, so I'm hoping to troubleshoot that during the meeting later today.

@kdahlquist
Copy link
Collaborator Author

kdahlquist commented Feb 15, 2022

Some notes from the 2/14/22 meeting:

  • the problem with YEASTRACT was that there were leading spaces on the gene lists that @ahmad00m input into the regulation matrix tool. Deleting the spaces fixed the problem.
  • @ahmad00m will generate a total of four networks from YEASTRACT from the first four significant profiles found in stem.
  • he will use the setting DNA binding evidence AND expression evidence to increase the stringency and decrease the overall number of edges.
  • The target number of genes in the network is 15. He should take the top 15 transcription factor hits to generate the network and check to see if they are all connected. If they are, then he's done. If not, he can add/subtract genes to get a connected network of ~15
  • The output from YEASTRACT needs to be formatted to be compatible with GRNsight. The data needs to be transposed, alphabatized right to left and top to bottom, and the "p" needs to be removed from the gene names. Cell A1 needs to say "cols regulators/rows targets".
  • @ahmad00m will work with @Onariaginosa to generate 4 alternate networks with the same genes from the SGD database she is making. We expect there to potentially be fewer edges from the SGD data, based on work done a couple years ago.
  • @kdahlquist will review the documentation for data processing (probably at the end of the week.)
  • @ahmad00m needs to go back and make sure that he's got all the screenshots and stem data from the run that we are using.

@ahmad00m
Copy link
Collaborator

@ahmad00m finished creating the adjacency matrices for the first four significant profiles using YEASTRACT database. Also, the new documentation containing all the steps up until visualizing the GRN on GRNsight will soon be pushed to GRNsight-archive for review.

@ahmad00m
Copy link
Collaborator

Here is the link to the complete Documentation

@dondi
Copy link
Owner

dondi commented Feb 22, 2022

Follow-up wrap-up comments:

  • Seek to port the .docx to Markdown (.md) for easier viewing and editing
  • Rearrange scripts folder to reflect the dataset targeted by a particular set of scripts as, e.g., authorname-year-data

@ahmad00m
Copy link
Collaborator

ahmad00m commented Mar 7, 2022

  • The documentation is now available in Markdown format
  • The Scripts folder has now been rearranged and placed in a more descriptive directory

@dondi
Copy link
Owner

dondi commented Mar 8, 2022

Initial review of the documentation looks good; it will need a “validation test” where someone who is unfamiliar with the process seeks to follow the instructions in order to accomplish the same result. Tentatively this looks like a good match for @ahmad00m to go over with @Sarronnn, minimizing intervention until they discover something that needs to be clarified in the documentation

@kdahlquist
Copy link
Collaborator Author

Closing because it is complete and live in v6.0.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants