Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test import of GRNmap input workbooks for expression data sheets #734

Closed
kdahlquist opened this issue Feb 19, 2019 · 78 comments
Closed

Test import of GRNmap input workbooks for expression data sheets #734

kdahlquist opened this issue Feb 19, 2019 · 78 comments

Comments

@kdahlquist
Copy link
Collaborator

When writing up issue #733, it occurred to me that we need to test the input of a full GRNmap workbook, in terms of file format and missing stuff. I'm not sure that errors or warnings are given for erroneous workbooks. We have a full suite of tests for the "network" and "network_optimized_weights" sheets. Hopefully some of that can be leveraged with regard to the rest of the worksheets.

@dondi
Copy link
Owner

dondi commented Mar 26, 2019

There is a good chunk of pre-existing code which can be used as a pattern for fulfilling this issue. @afiller1 should look at warnings*-tests.js and errors*-tests.js for examples of how to set up error and warning test functions. @kdahlquist can help with creating erroneous workbooks.

@afiller1 can then look at the existing .xlsx-parsing code to see how errors and warnings are reported back.

@kdahlquist
Copy link
Collaborator Author

This wiki has a description of what should be in each worksheet:

https://openwetware.org/wiki/BIOL388/S19:Week_6#Creating_the_GRNmap_Input_Workbook

I can help by organizing categories of tests.

@afiller1
Copy link
Collaborator

afiller1 commented Apr 9, 2019

Just want to check on the procedure.
I made a new folder under test-files containing various workbooks that are each incorrect in different ways.
I created a file called expression-data-import-tests.js in the test folder that resembles the errors-graph-tests.js file. It runs through all the aforementioned files and makes sure that they throw errors/warnings.
I assume I should also be adding to the test.js file in the test folder, adding new errors and such that come up in the expression-data-import-tests code that I have been writing, correct?
Where will I write code that runs tests on actual input workbooks, that checks the data for errors in formatting and whatnot and delivers error/warning messages to the user? Is this incorporated in what I've already done? @dondi

@afiller1
Copy link
Collaborator

afiller1 commented Apr 9, 2019

GitHub is giving me some issues with pushing and pulling, so I'll talk to @dondi about that tomorrow. Should be quickly resolvable I imagine.

Test Files I added under test-files/expression-data-test-sheets:

  • expression_sheet_different_number_of_columns: makes one expression data sheet with more 30-timestamped columns than the other. Should not throw an error.
  • expression_sheet_incorrect_numbering: timestamp column labels do not match what is given in optimization_parameters sheet
  • expression_sheet_name_not_in_optparams: there is an expression data sheet with data for something not listed in optimization_parameters sheet
  • expression_sheet_name_not_present: there is something listed in optimization_parameters sheet for which there is not an expression_data sheet
  • expression_sheet_wrong_id_label: the "id" column label is incorrect

@kdahlquist do these look okay to you? Let me know what I need to add/fix. I have them all connected to errors right now...should any of them be warnings instead?

I have the basic test code that I based off of errors-graph-tests, but I would like to sit and walk through how/where to write the code that will actually iterate through the sheets to check for all of this and also that will alert the user of errors in their own imported data.

@kdahlquist
Copy link
Collaborator Author

At the meeting we are going to have to discuss the concept that some things that display OK in GRNsight, would throw an error in GRNmap. We need to separate the idea of a GRNmap validator vs. being able to display in GRNsight because GRNsight has a looser set of requirements for display.

When a user uploads an Excel spreadsheet, a pop-up could ask "Validate workbook for GRNmap?" with the options being yes/no and with a checkbox for "Remember my choice/ask me every time". We could put this option in one of the menus.

GRNsight only needs a "network" or "network_optimized_weights" sheet for display. These are subject to the tests already written.

Optionally, GRNsight needs a single "_log2_expression" or "_log2_optimized_expression" for node coloring, but if they are not present, then it's fine, you just won't see colors on the nodes. These are subject to the tests that @afiller1 is writing now.

All the other worksheets are not strictly required for GRNsight display, although we will write tests for them anyway.

So, the tests for the "_log2_expression" or "_log2_optimized_expression" worksheets should include the following:

  • a worksheet with either "_log2_expression" or "_log2_optimized_expression" exists.
    • If they do not, a warning should be given that says that the "_log2_expression" or "_log2_optimized_expression" worksheet was not detected. The network graph will display without node coloring.
  • sheet name follows proper naming convention with coming from the optimization_parameters sheet, followed by either "_log2_expression" or "_log2_optimized_expression".
    • Only one of these is absolutely required for node coloring. However, for workbook validation, there should be at least "_log2_expression" sheets for all named strains.
    • For GRNmap, case needs to be enforced.
    • For GRNmap, whatever gene is mentioned in the strain name e.g., CIN5 for "dcin5", needs to be in the id list.
  • "id" needs to be in cell A1.
  • List of genes in column A needs to match identically what is in the "network" sheet (including case), and in the same order.
  • Timepoint column headers need to be the same as in the "optimization_diagnostics" worksheet, and in the same order.
  • Number of replicates for timepoints can vary, but there needs to be at least one for each of the timepoints."
  • No missing column headers. Test first, last, and middle.
  • No missing gene names. Test first, last, and middle.
  • No empty columns. Test first, last, and middle.
  • No empty rows. Test first, last, and middle.
  • No stray data hanging out in other cells in the worksheet.
  • Missing datapoints treated as null.
  • At least one datapoint per gene and timepoint. You should test a worksheet that only has one replicate (1 column) per timepoint to see what happens, as well as ones where there is one datapoint, but multiple replicates (missing data).

Some of these tests are similar to what was written to validate the network, so you can see what that's like.

@kdahlquist
Copy link
Collaborator Author

For the worksheets formatted as two-columns, they are not (yet) used by GRNsight for anything, so the purpose of checking them is just for validation. Test the following:

  • worksheet exists
  • worksheet names are identical to what is required (also, all lowercase letters)
  • "id" in cell A1 (all lowercase)
  • list of genes in column A is identical to list in "network" or "network_optimized_weights" sheet. Same spelling, same case, same order
  • column header in cell B1 is exactly what is expected
  • numerical values in column B
  • no missing values allowed
  • no missing rows allowed
  • no missing columns allowed
  • no stray data in another cell outside what is expected

@kdahlquist
Copy link
Collaborator Author

Here is a list of worksheets and their requirements by GRNmap or GRNsight:

  • Required by GRNsight
    • "network" and/or "network_optimized_weights"
      • GRNsight has looser rules than GRNmap. GRNsight allows asymmetrical matrices and the gene names to be in different orders across the top and bottom.
  • Required by GRNsight for node coloring (but if they are not present, network graph should still display without node coloring)
    • at least 1 _log2_expression or _log2_optimized_expression worksheet
    • in order to validate these for GRNmap, I think we are going to need to require the "optimization_parameters" sheet, as well. GRNmap has stricter requirements than GRNsight for which of these sheets are present
  • Not required by GRNsight, but required by GRNmap for an input workbook
    • production_rates
    • degradation_rates
    • network_weights
    • threshold_b
    • optimization_parameters
  • Not required by GRNsight, but would be present in a GRNmap output workbook
    • all sheets named above
    • _sigmas
    • network_optimized_weights
    • optimized_production_rates
    • optimized_degradation_rates
    • optimized_threshold_b
    • optimization_parameters

@dondi
Copy link
Owner

dondi commented Apr 9, 2019

After talking through the prior comments, in terms of development/code impact, the key takeaway is to make sure that the aspects that need to be separated should also be callable separately in the code. The following groupings should be invokable separately. This works well at the unit test level also because that means test cases can be very specific and narrow (e.g., expression sheets only; network sheets only).

We would develop this "engine" first, with accompanying tests, then when the raw validation functionality is present, we can then tackle how this becomes visible to the user. User choices would then drive which validation functions are used when a workbook is loaded.

Another tie-in, now to @mihirsamdarshi's work, is that the user's preferences would be new additional properties in grnState, and displaying and updating of this value should follow the MVC paradigm.

With regard to import/export, GRNsight benefits from having a compiler-style "intermediate representation" which is the destination of imports, then exports derive from this intermediate representation independent of the original imported files. The only impact is that not all formats contain all of the information expected by a workbook; this missing information cannot be available in the export format. The approach very much follows how compilers have a "front end" and a "back end."

@afiller1
Copy link
Collaborator

afiller1 commented Apr 16, 2019

I added many new Excel files with differing errors for test purposes. @kdahlquist , I would like to go over these files with you to make sure I've done them correctly and that they test what is required. These files are in test-files/expression-data-test-sheets. I also updated the test code, test/expression_data_import_test, to go through each of these Excel files and throw the appropriate errors/warnings. What I have currently is a rough draft of this test code, and I may need a little help with correctly categorizing the errors and correctly writing the errors in test.js (@dondi ). I am sick and will not be at the meeting today, but please let me know what I need to alter/improve for next week. If you want to check out anything I've worked on, it's all pushed to my branch.

@dondi
Copy link
Owner

dondi commented Apr 16, 2019

We did a quick review of some .xlsx files and have these notes:

  • The .xlsx files look right but can be even more minimal (under the condition below). i.e., if a warning depends solely on, say, on something in the wt_log2_expression worksheet, then the test workbook for that warning should only have that worksheet.
  • These "incomplete" workbooks won't suffice for the generalized "import" function of course; the key here is to organize the "checker" functions so that they can be tested piecemeal. This way, we can isolate very specific functionalities---they are truly unit tests.
    If this needs further explanation, let me know. Generally things are in the right direction and the above bullets will help truly minimize the test files to precisely the condition/situation they are testing.

@afiller1
Copy link
Collaborator

  • For missing-gene-name error, should I keep all the sheets? It seems they're necessary.
  • Based on what you said above, should I keep or delete the network sheets in all these test files?

I simplified the worksheets as described. I need to go back and remove some of the removed sheet names from optparams (I realized my mistake just now).

@dondi
Copy link
Owner

dondi commented Apr 23, 2019

After today's code review and in light of the comments made last week, we can add the following details:

  • The "uber" import function for GRNmap workbooks is the function called processGRNmap
  • processGRNmap calls parseSheet which should probably be renamed parseNetworkSheet eventually
  • There is a parseAdditionalSheets function which is a "sub-uber" function that itself calls additional parse* functions

Given this structure, the immediate task to close out the semester is:

  • Make the sub-parse* functions visible to test code
  • This lets the test code invoke these functions in isolation
  • And thus this lets you create test workbooks with only the conditions and sheets that you are testing

As a future follow-up, we can then restructure processGRNmap so that it has a more readily-readable flow, without the "cruft" of evolving from older code.

@dondi
Copy link
Owner

dondi commented Apr 30, 2019

General milestone for end-of-semester is to produce a new overall structure that can serve as the foundation for adding error/warning checks and tests for them. Full error/warning and test coverage can be done later. Preserve existing functionality, but implement a few (< 5) new error/warning checks for expression data. Future work should then entail just adding more checks and tests without having to restructure the code significantly.

The primary goal of the restructuring is to isolate cohesive groups of checks (rather than, say, the existing monolithic approach in parseSheet ) so that these checks can be tested at a finer granularity, with test workbook files only containing exactly the data with error/warning conditions that are being tested.

@afiller1
Copy link
Collaborator

This week, I worked more on the expression-data-controller file in controllers/importers. Currently, we use spreadsheet-controller and additional-sheet-parser to check through the network sheets and the other sheets in any given workbook. The issue here is that errors/warnings and their messages only exist currently for network sheets. additional-sheet-parser reads through the other sheets, but it doesn't have any errors or warning messages for those other sheets. What I'm doing in expression-data-controller is separating the expression data sheet checks from the rest of the sheets so that, for testing and scanning, we can just look through an expression sheet. Since not ALL sheets are required for GRNSight to function, it will be useful for us to look at the sheets separately like this. I'm trying to get errors to pop up for faulty expression sheets using my expression-data-controller code. It isn't complete and currently fails to accept workbooks that should work, but I have somewhat of a structure. Needs debugging.

@afiller1
Copy link
Collaborator

Beware, also, that I commented out var additionalData = parseAdditionalSheets(sheet); and replaced it with var additionalData = parseExpressionSheets(sheet); in spreadsheet-controller, so that might be causing some of the issues. But I'm trying to see what happens when we're only trying to look through the expression data sheet, so I did this trying to isolate that.

@afiller1
Copy link
Collaborator

afiller1 commented Sep 4, 2019

This week, I spent a good amount of time making my own branch up to date with the Master (I had not added in any updates in a very long time). Then I re-added all of my changes to the testing back to that. What I currently have is all committed to my branch. I again went over the list of things we want to check in the expression sheets and I narrowed down the data I have stored in those sheets even more, so that the testing can be more specific. I then went through each sheet and added its corresponding error to expression-data-import-tests.js. Then, I looked at test.js and added the errors that didn’t already exist into that code, although I don’t know if I did that right. I also had to go through uninstalling and reinstalling Node; I’m still having issues here that I need some help with, because npm run start-dev causes an error. In addition, I made expression-sheet-parser.js, which takes only the expression sheet parsing capabilities from additional-sheet-parser.js and separates them. My question is...where exactly do I need to edit to make sure that the appropriate checks are running? My estimate is in spreadsheet-controller.js, under the TODO about adding file validation. Will I add the functionality that examines the expression sheets in the expression-sheet-parser.js ? Right now it really only checks whether there IS an expression sheet and parses it.

@dondi
Copy link
Owner

dondi commented Sep 4, 2019

I can address the maintenance level issues here but may need to dig deeper for the more specific code questions. Some thoughts on what you did to get updated and back up to speed:

  • Because we are gearing up for a release, I suggest merging beta into your branch. That way, your tests and work will also include changes that were made over summer (master doesn’t have those)
  • When the node installation goes awry, the first thing I try is to delete the node_modules folder then re-run npm install (do this after merging beta)
  • If this doesn’t work, I re-delete node_modules as well as package-lock.json then try again. This almost always fixes things, although it does change a committed file. Whether or not those changes need to be committed is an on-going issue, but typically I commit the last version before releasing
  • If this still doesn’t work, then we’ll have to look deeper; see if there are lingering library dependencies or old software that is somehow affecting the setup. The general idea is that the same software should always run the same; if one person is seeing different behavior from another, there must be a difference lurking somewhere
  • Since, in your case, you are seeing an error message when running npm run start-dev, it will be helpful to also document that error. The error may point directly to the problem—the issue is being able to interpret the error in order to discern its root cause

General answer to your software question is to “do the checks whenever the information required to perform the check becomes available.” But this can vary depending on specifics; probably best to finalize this with the code in front of us.

@dondi
Copy link
Owner

dondi commented Sep 4, 2019

We reviewed the existing code at the meeting and clarified the separate objectives of adding the new error-checking functionality as well as restructuring the parsing code so it is less monolithic. @afiller1 was given some degrees of freedom for this restructuring as long as the first objective is also met; to be reviewed by @dondi and @kdahlquist as progress moves along.

@dondi
Copy link
Owner

dondi commented Sep 11, 2019

@dondi and @afiller1 reviewed the work so far and @dondi clarified some of the errors that @afiller1 was seeing. The core issue is that JavaScript doesn't have the same strictness with types and so different modules are free (perhaps too free) to export anything that is assigned by the programmer. This leads to potential inconsistencies in how modules are used. @afiller1 will (for now) need to just read the module exports closely in order to get a precise understanding of the values being exported and how they are being used.

@dondi
Copy link
Owner

dondi commented Sep 18, 2019

@afiller1 is getting progressively more acclimated to the code and has started to reorganize its structure. @dondi told @afiller1 about controllers and how the import controller is structured, projecting how spreadsheet controller should morph into a similar modular approach, and finally how the spreadsheet controller can one day become an xlsxToGrnsight function that is a peer of the SIF and GraphML importers.

@dondi
Copy link
Owner

dondi commented Apr 20, 2020

Let’s go ahead and convert the demo files into network objects and return these hardcoded objects from this point on. Demos 3 and 4 may need a temporary code revision so that the expression data can be captured from the correct sheet (or, a temporary version of the demo files can be used with the correct sheet name in order to capture the network object in its entirety).

For the Object.assign line, if it turns out that this line is actually unnecessary/superfluous, then let's go ahead and remove it.

@Onariaginosa
Copy link
Collaborator

I attempted to convert the demo files into network object. These objects were quite large, and the console did not deep copy them, so I had to manually deep copy each network object. Because of this, I created a file called demo-networks.js that stored each of the 4 network objects and processed these networks into json objects. I then called the demo-network functions in spreadsheet-controller.js. However I am getting an error every single time I call any of the 4 demos. The error is attached.
Screenshot from 2020-04-26 23-40-18

@Onariaginosa
Copy link
Collaborator

Here are also images of what happens in my terminal, and the demo-networks.js file itself.

Screenshot from 2020-04-26 23-47-27
Screenshot from 2020-04-26 23-46-39

@dondi
Copy link
Owner

dondi commented Apr 27, 2020

The error came from a misalignment between what was being exported and how it was being used. The export is a function but when it was used, it wasn’t being called (i.e., demoNetworks was accessed rather than demoNetworks()).

@Onariaginosa
Copy link
Collaborator

I fixed the demo files network objects so that they display with node coloring and no errors.

@Onariaginosa
Copy link
Collaborator

I was going over the expression sheet tester files and adding the species name and taxon ids. I noticed a tester file, expression_sheet_incorrect_numbering.xlsx that was not used in any of the tests. It regards the time points of the expression data. Currently the time periods are 15 15 15 30 30 30, which is incorrect. I am currently making a test to check for duplicate time periods. It is currently an error for duplicate time periods, but should it be a warning?

Additionally, what are the rules for the time points? Do they have to be increasing in some specific way? @dondi and I were discussing this on slack, and wanted @kdahlquist opinion on the matter.

@kdahlquist
Copy link
Collaborator Author

Duplicate time periods are actually there by design. They represent replicate datapoints for that time period, i.e., each "15" represents data from an independent experiment.

Time periods should be increasing. The lowest number could be 0, there shouldn't be any negative numbers. The intervals between the numbers could be anything.

@dondi
Copy link
Owner

dondi commented Sep 11, 2020

  • Check for non-monotonic time points
  • Check for negative time points
  • Classify both issues as errors

@kdahlquist
Copy link
Collaborator Author

kdahlquist commented Sep 11, 2020 via email

@Onariaginosa
Copy link
Collaborator

I previously asked about the integers and real numbers because I forgot that JavaScript categorizes all numbers as numbers, and not Integers or Floats. Sorry!

@Onariaginosa
Copy link
Collaborator

I created tests that check for non-monotonic time points, negative time points, and non numerical time points. I also checked them in my local server to ensure that the errors displayed correctly. I also created the respective tester files for these tests.

@dondi
Copy link
Owner

dondi commented Sep 18, 2020

@Onariaginosa will continue the testing work and look toward issuing a PR as soon as current hardware issues are resolved. Because beta has been updated, when you are back on board try to find a time to merge beta into your branch when feasible.

@Onariaginosa
Copy link
Collaborator

image

During my first 2 hour session, I attempted to set up GRNsight on my local machine. I'm not sure if I did it correctly, because whenever I attempt to run my local server, GRNsight crashes. This happens on the Onariaginosa branch, and in beta. Master works correctly, though. I did go through all of the expression sheet tester files and added the species name and taxon id, so those warnings should not appear anymore. I couldn't test this because my local sever isn't working. I should be ready to do a pull request as soon as the screenshotted warning is resolved.

@dondi
Copy link
Owner

dondi commented Sep 24, 2020

We triaged this today and the fix ended up being the classic “delete node_modules entirely and run npm install again” approach.

@dondi dondi changed the title Test import of GRNmap input workbooks for sheets other than "network" or "network_optimized_weights" Test import of GRNmap input workbooks for expression data sheets Oct 2, 2020
@dondi dondi assigned kdahlquist and unassigned Onariaginosa Oct 2, 2020
@dondi dondi assigned ahmad00m and unassigned kdahlquist Jan 21, 2021
@dondi
Copy link
Owner

dondi commented Mar 10, 2021

@ahmad00m will need some collab/onboarding time with @Onariaginosa and/or @igreen1 to learn about npm install or other commands required to get set up with this.

@ahmad00m
Copy link
Collaborator

@Onariaginosa helped @ahmad00m to run the tests on beta. No errors were found so I think this issue is completely resolved.

@dondi dondi closed this as completed Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants