Multi line headers #379

danielsjf · 2017-08-08T09:10:02Z

Some excel/csv files contain multi line headers. The question is both for readxl and readr as they probably share some code. Could this be supported? I know tibbles don't support this, but they could be concatenated with an underscore for example. A bit similar to how tidyr::unite works for columns, but this time for rows.

It happens quite often in published excels. There are also a few stack overflow questions related to this:
https://stackoverflow.com/questions/43252489/read-excel-with-two-line-headers-in-r
https://stackoverflow.com/questions/11987103/read-csv-with-two-headers-into-a-data-frame
https://stackoverflow.com/questions/2293131/reading-in-files-with-two-header-rows
https://stackoverflow.com/questions/17797840/reading-two-line-headers-in-r

An excel example can be found here:
https://clients.rte-france.com/servlets/ProdGroupeServlet?annee=2016
Notice that these xls files itself won't open with the current version of readxl.

danielsjf · 2017-08-08T09:50:01Z

To be clear, the following code would achieve the same results (code is a bit bulkier due to readxl's automatic numbering for duplicate column names). I've also converted the linked xls files to xlsx so that they can be read with readxl.

library(readxl)
data_head <- read_excel('ProductionGroupe_2016-semestre1.xlsx', n_max = 1)
data <- read_excel('ProductionGroupe_2016-semestre1.xlsx', skip = 1)
colnames(data) <- stringr::str_c(colnames(data_head) %>% 
                                   stringr::str_replace('__[:digit:]*',''), 
                                 colnames(data), 
                                 sep = '_/_')

Often the first header can also consist of merged cells. Probably this will be fixed with #355

jennybc · 2017-08-08T14:53:57Z

This is a common feature in spreadsheets 😐.

I doubt readxl itself will gain this functionality as there are external packages that specialize in it. tidyxl is on CRAN and under active development. jailbreakr is a GitHub-only project I've had a small hand in. Maybe see if either of those gets the job done.

I don't immediately see a natural readxl-ish way to expose complicated header management, but I'm happy to hear ideas.

btw read_excel() can read .xls, so no need to convert

danielsjf · 2017-08-08T17:14:38Z

The reason that I've converted the xls files was because the xls as such couldn't be read by readxl. This is probably related to another issue as the data provider is the same (#374).

I've looked at both tidyxl and jailbreakr, but they seem complicated to handle something that is indeed quite common. Just shooting some ideas here. Could it be possible to work with an index argument to say which columns (normal) or rows (new) contain information on the variables (examples below).

The reasoning is simple. Most of the tidyverse code works best when the data is in a long format (e.g. ggplot and dplyr). Most excel data (and in extension csv data) is in a wide format since this is more natural in a spreadsheet editor. Currently, the read functions of the tidyverse cover the simplest use case where there can be multiple long columns (each column header is a variable name), but maximum one wide column (this is a row really where each column header is not a variable name but an observation of a variable).

The goal of most people will be to convert this wide format to a long format. Therefore, the newly proposed index argument can be used to indicate which columns are actually long columns and which rows should be 'gathered' (to use tidyr lingo). In index would simply be a column or a row that is representing a variable.

Some example data:

		A		B
Vehicle	Sex	A1	B2	C3	D4
Car	M	1	2	3	4
Car	F	5	6	7	8
Bike	M	9	10	11	12
Bike	F	13	14	15	16

What most people probably want is this:

Vehicle	Sex	Var 1	Var 2	Value
Car	M	A	A1	1
Car	M	A	B2	2
Car	M	B	C3	3
..	..	..	..	..
Bike	F	B	D4	16

A proposition for a code sample that would extract this:

library(readxl)
path <- "example.xls"
read_excel(path,
           index = c(index_col(1:2), index_row(1:2))) # column 1 and 2 are real columns, row 1 and 2 are columns that should be gathered

A normal table can be read with read_excel(path, index = c(index_col(1:ncol(.)))).

Similarly, you can also name columns/rows with read_excel(path, index = c(index_col(Vehicle = 1, Sex = 2), index_row(Var1 = 1, Var2 = 2))).

There are a few issues to be solved. Most of them concern the conventions of the top left block (the two empty cells and the headers 'Vehicle' and 'Sex'. Another one is the interaction/duplication between col_names and index. col_types would still work if the named index rows are used.

This would also invoke some competition between readr/readxl and tidyr. However, some of the information might be lost otherwise. This would be in favor of including the tidyr gather() functionality directly in readr/readxl.

This would add some complexity to readr/readxl and it is up to you to judge if it fits within the scope of readxl. There are a few other options:

The use case does not occur often enough and add hoc solutions like my lines of code are sufficient.
Make a new package to cover this functionality that works on top of readxl/readr. However, some information might be lost in the conversion from xls(x) to tibble such as duplicate column names or column types. The advantage would be that it would work on readr and readxl simultaneously.
Develop/use a whole new package that reads the xls without depending directly on readxl. This package would be more complex (such as tidyxl) and cover the more extreme use cases. In this case, there might be some code duplication. This is already obvious in tidyxl since they refer to some of the authors of readxl. Furthermore, this could be a very complex solution given the frequency of occurrence of these problems.

It would also depend on how doable this is in the current readxl code framework.

Let me know what you think.

nacnudus · 2017-09-07T21:24:14Z

@danielsjf Since you are interested in this problem, the Idaho State election records have several interesting layouts to think about. You are right that tidyxl and unpivotr are complex. I'm interested in your ideas on how they could be made more intuitive -- feel free to comment in those repos, the Google group, or email me.

captcoma · 2017-12-19T23:34:36Z

is there any news regarding a package that allows to read excel files with multiple line headers?

jennybc · 2017-12-19T23:38:20Z

I think tidyxl and unpivotr are your best bets for the foreseeable future.

captcoma · 2017-12-20T04:22:59Z

Many thanks jennybc. Unfortunately I could not find an example that explains how to tidy danielsjf's simple test data from above using tidyxl and unpivotr. Do I miss something?

cderv · 2017-12-20T07:23:12Z

Just a precision about first post with example for future reference.

The files that can be download here (https://clients.rte-france.com/servlets/ProdGroupeServlet?annee=2016) are not excel files but tabulation delimited files with an .xls extension. Not the correct one for example about readxl.

jennybc · 2017-12-20T07:52:06Z

Unfortunately I could not find an example that explains how to tidy danielsjf's simple test data from above using tidyxl and unpivotr. Do I miss something?

Oh sorry, no my suggestion was general. I haven't looked into this specific challenge.

nacnudus · 2017-12-20T11:54:32Z

@captcoma I'd be glad to work through @danielsjf's test data with you over in the (admittedly empty) Google Group. As cderv points out, they aren't Excel files but that doesn't matter for unpivotr, and they can be made into Excel files for the sake of tidyxl.

jennybc · 2018-04-15T21:01:39Z

I do care about this, but it remains out of scope for readxl. Thanks for the helpful and concrete discussion, but its not currently on the roadmap here.

jameshowison · 2018-11-07T14:48:01Z

I made a long explanation of this issue for a class I teach, and provided a readxl approach to handling it, based on suggestions by @jennybc :) Basically you read the header area and the data rectangle separately. The header area is read with the merged value in the top-left cell of the area, with other previously merged cells empty. Then you transpose the headers and use fill to move the headers into those cells. Then you merge the headers and add them to the data rectangle.

https://howisonlab.github.io/datawrangling/Handling_multi_indexes.html#a-tidyverse-solution

jennybc closed this as completed Apr 15, 2018

brianwdavis mentioned this issue Jun 18, 2018

Collapse multiple rows of headers (feature request) #486

Closed

lock bot locked and limited conversation to collaborators Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi line headers #379

Multi line headers #379

danielsjf commented Aug 8, 2017

danielsjf commented Aug 8, 2017

jennybc commented Aug 8, 2017

danielsjf commented Aug 8, 2017 •

edited

Loading

nacnudus commented Sep 7, 2017

captcoma commented Dec 19, 2017

jennybc commented Dec 19, 2017

captcoma commented Dec 20, 2017

cderv commented Dec 20, 2017

jennybc commented Dec 20, 2017

nacnudus commented Dec 20, 2017

jennybc commented Apr 15, 2018

jameshowison commented Nov 7, 2018

Multi line headers #379

Multi line headers #379

Comments

danielsjf commented Aug 8, 2017

danielsjf commented Aug 8, 2017

jennybc commented Aug 8, 2017

danielsjf commented Aug 8, 2017 • edited Loading

nacnudus commented Sep 7, 2017

captcoma commented Dec 19, 2017

jennybc commented Dec 19, 2017

captcoma commented Dec 20, 2017

cderv commented Dec 20, 2017

jennybc commented Dec 20, 2017

nacnudus commented Dec 20, 2017

jennybc commented Apr 15, 2018

jameshowison commented Nov 7, 2018

danielsjf commented Aug 8, 2017 •

edited

Loading