Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi line headers #379

Closed
danielsjf opened this issue Aug 8, 2017 · 12 comments
Closed

Multi line headers #379

danielsjf opened this issue Aug 8, 2017 · 12 comments

Comments

@danielsjf
Copy link

Some excel/csv files contain multi line headers. The question is both for readxl and readr as they probably share some code. Could this be supported? I know tibbles don't support this, but they could be concatenated with an underscore for example. A bit similar to how tidyr::unite works for columns, but this time for rows.

It happens quite often in published excels. There are also a few stack overflow questions related to this:
https://stackoverflow.com/questions/43252489/read-excel-with-two-line-headers-in-r
https://stackoverflow.com/questions/11987103/read-csv-with-two-headers-into-a-data-frame
https://stackoverflow.com/questions/2293131/reading-in-files-with-two-header-rows
https://stackoverflow.com/questions/17797840/reading-two-line-headers-in-r

An excel example can be found here:
https://clients.rte-france.com/servlets/ProdGroupeServlet?annee=2016
Notice that these xls files itself won't open with the current version of readxl.

@danielsjf
Copy link
Author

To be clear, the following code would achieve the same results (code is a bit bulkier due to readxl's automatic numbering for duplicate column names). I've also converted the linked xls files to xlsx so that they can be read with readxl.

library(readxl)
data_head <- read_excel('ProductionGroupe_2016-semestre1.xlsx', n_max = 1)
data <- read_excel('ProductionGroupe_2016-semestre1.xlsx', skip = 1)
colnames(data) <- stringr::str_c(colnames(data_head) %>% 
                                   stringr::str_replace('__[:digit:]*',''), 
                                 colnames(data), 
                                 sep = '_/_')

Often the first header can also consist of merged cells. Probably this will be fixed with #355

@jennybc
Copy link
Member

jennybc commented Aug 8, 2017

This is a common feature in spreadsheets 😐.

I doubt readxl itself will gain this functionality as there are external packages that specialize in it. tidyxl is on CRAN and under active development. jailbreakr is a GitHub-only project I've had a small hand in. Maybe see if either of those gets the job done.

I don't immediately see a natural readxl-ish way to expose complicated header management, but I'm happy to hear ideas.

btw read_excel() can read .xls, so no need to convert

@danielsjf
Copy link
Author

danielsjf commented Aug 8, 2017

The reason that I've converted the xls files was because the xls as such couldn't be read by readxl. This is probably related to another issue as the data provider is the same (#374).

I've looked at both tidyxl and jailbreakr, but they seem complicated to handle something that is indeed quite common. Just shooting some ideas here. Could it be possible to work with an index argument to say which columns (normal) or rows (new) contain information on the variables (examples below).

The reasoning is simple. Most of the tidyverse code works best when the data is in a long format (e.g. ggplot and dplyr). Most excel data (and in extension csv data) is in a wide format since this is more natural in a spreadsheet editor. Currently, the read functions of the tidyverse cover the simplest use case where there can be multiple long columns (each column header is a variable name), but maximum one wide column (this is a row really where each column header is not a variable name but an observation of a variable).

The goal of most people will be to convert this wide format to a long format. Therefore, the newly proposed index argument can be used to indicate which columns are actually long columns and which rows should be 'gathered' (to use tidyr lingo). In index would simply be a column or a row that is representing a variable.

Some example data:

A B
Vehicle Sex A1 B2 C3 D4
Car M 1 2 3 4
Car F 5 6 7 8
Bike M 9 10 11 12
Bike F 13 14 15 16

What most people probably want is this:

Vehicle Sex Var 1 Var 2 Value
Car M A A1 1
Car M A B2 2
Car M B C3 3
.. .. .. .. ..
Bike F B D4 16

A proposition for a code sample that would extract this:

library(readxl)
path <- "example.xls"
read_excel(path,
           index = c(index_col(1:2), index_row(1:2))) # column 1 and 2 are real columns, row 1 and 2 are columns that should be gathered

A normal table can be read with read_excel(path, index = c(index_col(1:ncol(.)))).

Similarly, you can also name columns/rows with read_excel(path, index = c(index_col(Vehicle = 1, Sex = 2), index_row(Var1 = 1, Var2 = 2))).

There are a few issues to be solved. Most of them concern the conventions of the top left block (the two empty cells and the headers 'Vehicle' and 'Sex'. Another one is the interaction/duplication between col_names and index. col_types would still work if the named index rows are used.

This would also invoke some competition between readr/readxl and tidyr. However, some of the information might be lost otherwise. This would be in favor of including the tidyr gather() functionality directly in readr/readxl.

This would add some complexity to readr/readxl and it is up to you to judge if it fits within the scope of readxl. There are a few other options:

  • The use case does not occur often enough and add hoc solutions like my lines of code are sufficient.
  • Make a new package to cover this functionality that works on top of readxl/readr. However, some information might be lost in the conversion from xls(x) to tibble such as duplicate column names or column types. The advantage would be that it would work on readr and readxl simultaneously.
  • Develop/use a whole new package that reads the xls without depending directly on readxl. This package would be more complex (such as tidyxl) and cover the more extreme use cases. In this case, there might be some code duplication. This is already obvious in tidyxl since they refer to some of the authors of readxl. Furthermore, this could be a very complex solution given the frequency of occurrence of these problems.

It would also depend on how doable this is in the current readxl code framework.

Let me know what you think.

@nacnudus
Copy link
Contributor

nacnudus commented Sep 7, 2017

@danielsjf Since you are interested in this problem, the Idaho State election records have several interesting layouts to think about. You are right that tidyxl and unpivotr are complex. I'm interested in your ideas on how they could be made more intuitive -- feel free to comment in those repos, the Google group, or email me.

@captcoma
Copy link

is there any news regarding a package that allows to read excel files with multiple line headers?

@jennybc
Copy link
Member

jennybc commented Dec 19, 2017

I think tidyxl and unpivotr are your best bets for the foreseeable future.

@captcoma
Copy link

Many thanks jennybc. Unfortunately I could not find an example that explains how to tidy danielsjf's simple test data from above using tidyxl and unpivotr. Do I miss something?

@cderv
Copy link

cderv commented Dec 20, 2017

Just a precision about first post with example for future reference.

The files that can be download here (https://clients.rte-france.com/servlets/ProdGroupeServlet?annee=2016) are not excel files but tabulation delimited files with an .xls extension. Not the correct one for example about readxl.

@jennybc
Copy link
Member

jennybc commented Dec 20, 2017

Unfortunately I could not find an example that explains how to tidy danielsjf's simple test data from above using tidyxl and unpivotr. Do I miss something?

Oh sorry, no my suggestion was general. I haven't looked into this specific challenge.

@nacnudus
Copy link
Contributor

@captcoma I'd be glad to work through @danielsjf's test data with you over in the (admittedly empty) Google Group. As cderv points out, they aren't Excel files but that doesn't matter for unpivotr, and they can be made into Excel files for the sake of tidyxl.

@jennybc
Copy link
Member

jennybc commented Apr 15, 2018

I do care about this, but it remains out of scope for readxl. Thanks for the helpful and concrete discussion, but its not currently on the roadmap here.

@jameshowison
Copy link

I made a long explanation of this issue for a class I teach, and provided a readxl approach to handling it, based on suggestions by @jennybc :) Basically you read the header area and the data rectangle separately. The header area is read with the merged value in the top-left cell of the area, with other previously merged cells empty. Then you transpose the headers and use fill to move the headers into those cells. Then you merge the headers and add them to the data rectangle.

https://howisonlab.github.io/datawrangling/Handling_multi_indexes.html#a-tidyverse-solution

@lock lock bot locked and limited conversation to collaborators Nov 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants