如何將多個(excel)文件讀入R?

[英]How can I reading multiple (excel) files into R?


I have hundreds of medium sized Excel files (between 5000 and 50.0000 rows with about 100 columns) to load into R. They have a well-defined naming pattern, like x_1.xlsx, x_2.xlsx, etc.

我有數百個中等大小的Excel文件(5000到50.0000行,大約100列)加載到R中。它們具有明確定義的命名模式,如x_1.xlsx,x_2.xlsx等。

How can I load these files into R in the fastest, most straightforward way?

如何以最快,最直接的方式將這些文件加載​​到R中?

1 个解决方案

#1


48  

With list.files you can create a list of all the filenames in your workingdirectory. Next you can use lapply to loop over that list and read each file with the read_excel function from the readxl package:

使用list.files,您可以在工作目錄中創建所有文件名的列表。接下來,您可以使用lapply循環遍歷該列表,並使用readxl包中的read_excel函數讀取每個文件:

library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)

This method can off course also be used with other file reading functions like read.csv or read.table. Just replace read_excel with the appropriate file reading function and make sure you use the correct pattern in list.files.

此方法當然也可以與其他文件讀取函數一起使用,如read.csv或read.table。只需將read_excel替換為相應的文件讀取功能,並確保在list.files中使用正確的模式。

If you also want to include the files in subdirectories, use:

如果您還想將文件包含在子目錄中,請使用:

file.list <- list.files(pattern='*.xlsx', recursive = TRUE)

Other possible packages for reading Excel-files: openxlsx & xlsx

用於讀取Excel文件的其他可能包:openxlsx和xlsx


Supposing the columns are the same for each file, you can bind them together in one dataframe with bind_rows from :

假設每個文件的列相同,您可以使用dplyr中的bind_rows將它們綁定在一個數據框中:

library(dplyr)
df <- bind_rows(df.list, .id = "id")

or with rbindlist from :

或者使用data.table中的rbindlist:

library(data.table)
df <- rbindlist(df.list, idcol = "id")

Both have the option to add a id column for identifying the separate datasets.

兩者都可以選擇添加id列來標識單獨的數據集。


Update: If you don't want a numeric identifier, just use sapply with simplify = FALSE to read the files in file.list:

更新:如果您不想要數字標識符,只需使用sapply with simplify = FALSE來讀取file.list中的文件:

df.list <- sapply(file.list, read.csv, simplify=FALSE)

When using bind_rows from dplyr or rbindlist from data.table, the id column now contains the filenames.

當使用來自dplyr的bind_rows或來自data.table的rbindlist時,id列現在包含文件名。

Even another approach is using the purrr-package:

甚至另一種方法是使用purrr-package:

library(purrr)
file.list <- list.files(pattern='*.csv')
file.list <- setNames(file.list, file.list) # only needed when you need an id-column with the file-names

df <- map_df(file.list, read.csv, .id = "id")

Other approaches to getting a named list: If you don't want just a numeric identifier, than you can assign the filenames to the dataframes in the list before you bind them together. There are several ways to do this:

獲取命名列表的其他方法:如果您不想只是一個數字標識符,那么在將它們綁定在一起之前,您可以將文件名分配給列表中的數據幀。做這件事有很多種方法:

# with the 'attr' function from base R
attr(df.list, "names") <- file.list
# with the 'names' function from base R
names(df.list) <- file.list
# with the 'setattr' function from the 'data.table' package
setattr(df.list, "names", file.list)

Now you can bind the list of dataframes together in one dataframe with rbindlist from data.table or bind_rows from dplyr. The id column will now contain the filenames instead of a numeric indentifier.

現在,您可以使用data.table中的rbindlist或dplyr中的bind_rows將數據幀列表綁定在一個數據幀中。 id列現在將包含文件名而不是數字標識符。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2015/10/01/6a9627ab0393ff5d454b03f02ef93845.html



 
  © 2014-2022 ITdaan.com 联系我们: