-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi line headers #379
Comments
To be clear, the following code would achieve the same results (code is a bit bulkier due to readxl's automatic numbering for duplicate column names). I've also converted the linked xls files to xlsx so that they can be read with readxl.
Often the first header can also consist of merged cells. Probably this will be fixed with #355 |
This is a common feature in spreadsheets 😐. I doubt readxl itself will gain this functionality as there are external packages that specialize in it. tidyxl is on CRAN and under active development. jailbreakr is a GitHub-only project I've had a small hand in. Maybe see if either of those gets the job done. I don't immediately see a natural readxl-ish way to expose complicated header management, but I'm happy to hear ideas. btw |
The reason that I've converted the xls files was because the xls as such couldn't be read by readxl. This is probably related to another issue as the data provider is the same (#374). I've looked at both tidyxl and jailbreakr, but they seem complicated to handle something that is indeed quite common. Just shooting some ideas here. Could it be possible to work with an index argument to say which columns (normal) or rows (new) contain information on the variables (examples below). The reasoning is simple. Most of the tidyverse code works best when the data is in a long format (e.g. ggplot and dplyr). Most excel data (and in extension csv data) is in a wide format since this is more natural in a spreadsheet editor. Currently, the read functions of the tidyverse cover the simplest use case where there can be multiple long columns (each column header is a variable name), but maximum one wide column (this is a row really where each column header is not a variable name but an observation of a variable). The goal of most people will be to convert this wide format to a long format. Therefore, the newly proposed index argument can be used to indicate which columns are actually long columns and which rows should be 'gathered' (to use tidyr lingo). In index would simply be a column or a row that is representing a variable. Some example data:
What most people probably want is this:
A proposition for a code sample that would extract this:
A normal table can be read with Similarly, you can also name columns/rows with There are a few issues to be solved. Most of them concern the conventions of the top left block (the two empty cells and the headers 'Vehicle' and 'Sex'. Another one is the interaction/duplication between This would also invoke some competition between readr/readxl and tidyr. However, some of the information might be lost otherwise. This would be in favor of including the tidyr This would add some complexity to readr/readxl and it is up to you to judge if it fits within the scope of readxl. There are a few other options:
It would also depend on how doable this is in the current readxl code framework. Let me know what you think. |
@danielsjf Since you are interested in this problem, the Idaho State election records have several interesting layouts to think about. You are right that tidyxl and unpivotr are complex. I'm interested in your ideas on how they could be made more intuitive -- feel free to comment in those repos, the Google group, or email me. |
is there any news regarding a package that allows to read excel files with multiple line headers? |
I think tidyxl and unpivotr are your best bets for the foreseeable future. |
Many thanks jennybc. Unfortunately I could not find an example that explains how to tidy danielsjf's simple test data from above using tidyxl and unpivotr. Do I miss something? |
Just a precision about first post with example for future reference. The files that can be download here (https://clients.rte-france.com/servlets/ProdGroupeServlet?annee=2016) are not excel files but tabulation delimited files with an .xls extension. Not the correct one for example about |
Oh sorry, no my suggestion was general. I haven't looked into this specific challenge. |
@captcoma I'd be glad to work through @danielsjf's test data with you over in the (admittedly empty) Google Group. As cderv points out, they aren't Excel files but that doesn't matter for unpivotr, and they can be made into Excel files for the sake of tidyxl. |
I do care about this, but it remains out of scope for readxl. Thanks for the helpful and concrete discussion, but its not currently on the roadmap here. |
I made a long explanation of this issue for a class I teach, and provided a readxl approach to handling it, based on suggestions by @jennybc :) Basically you read the header area and the data rectangle separately. The header area is read with the merged value in the top-left cell of the area, with other previously merged cells empty. Then you transpose the headers and use fill to move the headers into those cells. Then you merge the headers and add them to the data rectangle. https://howisonlab.github.io/datawrangling/Handling_multi_indexes.html#a-tidyverse-solution |
Some excel/csv files contain multi line headers. The question is both for readxl and readr as they probably share some code. Could this be supported? I know tibbles don't support this, but they could be concatenated with an underscore for example. A bit similar to how tidyr::unite works for columns, but this time for rows.
It happens quite often in published excels. There are also a few stack overflow questions related to this:
https://stackoverflow.com/questions/43252489/read-excel-with-two-line-headers-in-r
https://stackoverflow.com/questions/11987103/read-csv-with-two-headers-into-a-data-frame
https://stackoverflow.com/questions/2293131/reading-in-files-with-two-header-rows
https://stackoverflow.com/questions/17797840/reading-two-line-headers-in-r
An excel example can be found here:
https://clients.rte-france.com/servlets/ProdGroupeServlet?annee=2016
Notice that these xls files itself won't open with the current version of readxl.
The text was updated successfully, but these errors were encountered: