Skip to content

Commit

Permalink
Merge pull request #1008 from dondi/beta
Browse files Browse the repository at this point in the history
v6.0.4
  • Loading branch information
ahmad00m authored Dec 7, 2022
2 parents 4105a20 + ca79174 commit 846e9ab
Show file tree
Hide file tree
Showing 32 changed files with 1,711 additions and 788 deletions.
2 changes: 2 additions & 0 deletions .eslintrc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ rules:
- error
brace-style:
- error
- 1tbs
- allowSingleLine: true
comma-spacing:
- error
max-len:
Expand Down
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# dotenv environment variables file
.env
.env.test

lib-cov
*.seed
*.log
Expand All @@ -12,13 +16,16 @@ lib-cov
documents/developer_documents/testing_script_generator/GRNsightTestingDocument.pdf
web-client/public/js/grnsight.min.js


pids
logs
results
/.idea

database/network-database/script-results
database/network-database/source-files
database/expression-database/script-results
database/expression-database/source-files

npm-debug.log
node_modules
Expand Down
88 changes: 87 additions & 1 deletion database/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,87 @@
Here are the files pertaining to both the network and expression databases. Look within the README.md files of both folders for information pertinent to the schema that you intend to be using.
# GRNsight Database
Here are the files pertaining to both the network and expression databases. Look within the README.md files of both folders for information pertinent to the schema that you intend to be using.
## Setting up a local postgres GRNsight Database
1. Installing PostgreSQL on your computer
- MacOS and Windows can follow these [instructions](https://dondi.lmu.build/share/db/postgresql-setup-day.pdf) on how to install postgreSQL.
- Step 1 tells you how to install postgreSQL on your local machine, initialize a database, and how to start and stop running your database instance.
- If your terminal emits a message that looks like `initdb --locale=C -E UTF-8 location-of-cluster` from Step 1B, then your installer has initialized a database for you.
- Additionally, your installer may start the server for you upon installation. To start the server yourself run `pg_ctl start -D location-of-cluster`. To stop the server run `pg_ctl stop -D location-of-cluster`.
- Linux users
- The MacOS and Windows instructions will _probably_ not work for you. You can try at your own risk to check.
- Linux users can try these [instructions](https://www.geeksforgeeks.org/install-postgresql-on-linux/) and that should work for you (...maybe...). If it doesn't try googling instructions with your specific operating system. Sorry!
2. Loading data to your database
1. Adding the Schemas to your database.
1. Go into your database using the following command:

```
psql postgresql://localhost/postgres
```
From there, create the schemas using the following commands:

```
CREATE SCHEMA spring2022_network;
```
```
CREATE SCHEMA fall2021;
```
Once they are created you can exit your database using the command `\q`.
2. Once your schema's are created, you can add the table specifications using the following commands:

```
psql postgresql://localhost/postgres -f <path to GRNsight/database/network-database>/schema.sql
```
```
psql postgresql://localhost/postgres -f <path to GRNsight/database/expression-database>/schema.sql
```
Your database is now ready to accept expression and network data!

2. Loading the GRNsight Network Data to your local database
1. GRNsight generates Network Data from SGD through YeastMine. In order to run the script that generates these Network files, you must pip3 install the dependencies used. If you get an error saying that a module doesn't exist, just run `pip3 install <Module Name>` and it should fix the error. If the error persists and is found in a specific file on your machine, you might have to manually go into that file and alter the naming conventions of the dependencies that are used. _Note: So far this issue has only occured on Ubuntu 22.04.1, so you might be lucky and not have to do it!_

```
pip3 install pandas requests intermine tzlocal
```
Once the dependencies have been installed, you can run

```
python3 <path to GRNsight/database/network-database/scripts>/generate_network.py
```
This will take a while to get all of the network data and generate all of the files. This will create a folder full of the processed files in `database/network-database/script-results`.

2. Load the processed files into your database.

```
python3 <path to GRNsight/database/network-database/scripts>/loader.py | psql postgresql://localhost/postgres
```
This should output a bunch of COPY print statements to your terminal. Once complete your database is now loaded with the network data.

3. Loading the GRNsight Expression Data to your local database
1. Create a directory (aka folder) in the database/expression-database folder called `source-files`.

```
mkdir <path to GRNsight/database/expression-database>/source-files
```
2. Download the _"Expression 2020"_ folder from Box located in `GRNsight > GRNsight Expression > Expression 2020` to your newly created `source-files` folder
3. Run the pre-processing script on the data. This will create a folder full of the processed files in `database/expression-database/script-results`.

```
python3 <path to GRNsight/database/expression-database/scripts>/preprocessing.py
```
4. Load the processed files into your database.

```
python3 <path to GRNsight/database/expression-database/scripts>/loader.py | psql postgresql://localhost/postgres
```
This should output a bunch of COPY print statements to your terminal. Once complete your database is now loaded with the expression data.

60 changes: 60 additions & 0 deletions database/expression-database/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Expression Database

All files pertaining the expression database live within this directory.

## The basics

#### Schema

All network data is stored within the fall2021 schema on our Postgres database.

The schema is located within this directory at the top level in the file `schema.sql`. It defines the tables located within the fall2021 schema.

Usage:
To load to local database
```
psql postgresql://localhost/postgres -f schema.sql
```
To load to production database
```
psql <address to database> -f schema.sql
```

### Scripts

All scripts live within the subdirectory `scripts`, located in the top-level of the network database directory.

Any source files required to run the scripts live within the subdirectory `source-files`, located in the top-level of the network database directory. As source files may be large, you must create this directory yourself and add any source files you need to use there.

All generated results of the scripts live in the subdirectory `script-results`, located in the top-level of the network database directory. Currently, all scripts that generate code create the directory if it does not currently exist. When adding a new script that generates resulting code, best practice is to create the script-results directory and any subdirectories if it does not exist, in order to prevent errors and snafus for recently cloned repositories.

Within the scripts directory, there are the following files:

- `preprocessing.py`
- `loader.py`

#### Data Preprocessor(s)
*Note: Data Preprocessing is always specific to each dataset that you obtain. `preprocessing.py` is capable of preprocessing the specific Expression data files located in `source-files/Expression 2020`. Because these files are too large to be stored on github, access the direct source files on BOX and move them into this directory. If more data sources are to be added in the database, create a new directory in source-files for it, note it in this `README.md` file and create a new preprocessing script for that data source (if required). Please document the changes in this section so that future developers may use your work to recreate the database if ever required.*

* The script (`preprocessing.py`) is used to preprocess the data in `source-files/Expression 2020`. It parses through each file to construct the processed loader files, so that they are ready to load using `loader.py`. Please read through the code, as there are instructions on what to add within the comments. Good luck!
* The resulting processed loader files are located in `script-results/processed-expression` and the resulting processed loader files are located within `script-results/processed-loader-files`

Usage:
```
python3 preprocessing.py
```
#### Database Loader

This script (`loader.py`) is to be used to load your preprocessed expression data into the database.

This program generates direct SQL statements from the source files generated by the data preprocessor in order to populate a relational database with those files’ data

Usage:
To load to local database
```
python3 loader.py | psql postgresql://localhost/postgres
```
To load to production database
```
python3 loader.py | psql <path to database>
```
71 changes: 71 additions & 0 deletions database/expression-database/schema.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
CREATE TABLE fall2021.ref (
pubmed_id VARCHAR,
authors VARCHAR,
publication_year VARCHAR,
title VARCHAR,
doi VARCHAR,
ncbi_geo_id VARCHAR,
PRIMARY KEY(ncbi_geo_id, pubmed_id)
);

CREATE TABLE fall2021.gene (
gene_id VARCHAR, -- systematic like name
display_gene_id VARCHAR, -- standard like name
species VARCHAR,
taxon_id VARCHAR,
PRIMARY KEY(gene_id, taxon_id)
);

CREATE TABLE fall2021.expression_metadata (
ncbi_geo_id VARCHAR,
pubmed_id VARCHAR,
FOREIGN KEY (ncbi_geo_id, pubmed_id) REFERENCES fall2021.ref(ncbi_geo_id, pubmed_id),
control_yeast_strain VARCHAR,
treatment_yeast_strain VARCHAR,
control VARCHAR,
treatment VARCHAR,
concentration_value FLOAT,
concentration_unit VARCHAR,
time_value FLOAT,
time_unit VARCHAR,
number_of_replicates INT,
expression_table VARCHAR,
display_expression_table VARCHAR,
PRIMARY KEY(ncbi_geo_id, pubmed_id, time_value)
);
CREATE TABLE fall2021.expression (
gene_id VARCHAR,
taxon_id VARCHAR,
FOREIGN KEY (gene_id, taxon_id) REFERENCES fall2021.gene(gene_id, taxon_id),
-- ncbi_geo_id VARCHAR,
-- pubmed_id VARCHAR,
sort_index INT,
sample_id VARCHAR,
expression FLOAT,
time_point FLOAT,
dataset VARCHAR,
PRIMARY KEY(gene_id, sample_id)
-- FOREIGN KEY (ncbi_geo_id, pubmed_id, time_point) REFERENCES fall2021.expression_metadata(ncbi_geo_id, pubmed_id, time_value)
);
CREATE TABLE fall2021.degradation_rate (
gene_id VARCHAR,
taxon_id VARCHAR,
FOREIGN KEY (gene_id, taxon_id) REFERENCES fall2021.gene(gene_id, taxon_id),
ncbi_geo_id VARCHAR,
pubmed_id VARCHAR,
FOREIGN KEY (ncbi_geo_id, pubmed_id) REFERENCES fall2021.ref(ncbi_geo_id, pubmed_id),
PRIMARY KEY(gene_id, ncbi_geo_id, pubmed_id),
degradation_rate FLOAT
);

CREATE TABLE fall2021.production_rate (
gene_id VARCHAR,
taxon_id VARCHAR,
FOREIGN KEY (gene_id, taxon_id) REFERENCES fall2021.gene(gene_id, taxon_id),
ncbi_geo_id VARCHAR,
pubmed_id VARCHAR,
FOREIGN KEY (ncbi_geo_id, pubmed_id) REFERENCES fall2021.ref(ncbi_geo_id, pubmed_id),
PRIMARY KEY(gene_id, ncbi_geo_id, pubmed_id),
production_rate FLOAT
-- FOREIGN KEY (gene_id, ncbi_geo_id, pubmed_id) REFERENCES fall2021.degradation_rate(gene_id, ncbi_geo_id, pubmed_id) -- not sure if we want to link the generated production rate to it's original degradation rate
);
Loading

0 comments on commit 846e9ab

Please sign in to comment.