如何使用R“readLines”命令從大文件中讀取選定的行並將它們寫入數據框?

[英]How can I read selected rows from a large file using the R “readLines” command and write them to a data frame?


I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.

我從事數據清理工作。我有一個函數可以識別大型輸入文件中的壞行(對於我的ram大小來說,太大而無法一次讀取)並將壞行的行號作為向量badRows返回。這個功能似乎有效。

I am now trying to read just the bad rows into a data frame, so far unsuccessfully.

我現在正試圖將壞行讀入數據幀,到目前為止還沒有成功。

My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.

我目前的方法是在我的文件的打開連接上使用read.table,使用在讀取的每一行之間跳過的行數向量。對於連續的壞行,此數字為零。

I calculate skipVec as:

我將skipVec計算為:

(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1

But for the moment I am just handing my function a skipVec vector of all zeros.

但目前我只是將我的函數交給一個全零的skipVec向量。

If my logic is correct, this should return all the rows. It does not. Instead I get an error:

如果我的邏輯是正確的,這應該返回所有行。它不是。相反,我得到一個錯誤:

"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"

“read.table中的錯誤(con,skip = pass,nrow = 1,header = TRUE,sep =”“):輸入中沒有可用的行”

My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.

我目前的功能基於Miron Kursa(“mbq”)的功能,我在這里找到了它。

My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.

我的問題有點重復,但我認為他的功能有效,所以我以某種方式打破了它。我仍然試圖理解打開文件和打開文件連接之間的區別,我懷疑問題出在某處,或者我使用了lapply。

I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.

我在RStudio 0.97.551下運行R 3.0.1,在一台老式的Windows XP SP3機器上運行3gig。石器時代,我知道。

Here is the code that produces the error message above:

以下是產生上述錯誤消息的代碼:

# Make a small small test data frame, write it to a file, and read it back in 
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))  
testThis.DF 

# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF  <- lapply(skipVec, FUN=function(pass){
  read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)

The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.

在關閉命令之前發生錯誤。如果我將readLines命令從lapply和函數中拉出來並且只是將它自己粘在一起,我仍然會得到同樣的錯誤。

1 个解决方案

#1


5  

If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:

如果不是通過lapply運行read.table而是手動運行前幾次迭代,您將看到發生了什么:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  nnn fff
1   2  aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  X2 X3 bb
1  3  5 cc

Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:

因為header = TRUE,它不是每次迭代讀取的一行而是兩行,所以你最終會比你想象的更快地用完行,這里是第三次迭代:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : 
  no lines available in input

Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:

現在,這可能仍然不是解決問題的有效方法,但這是您可以修復當前代碼的方法:

write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
  line <- read.table(con, nrow = 1, header = FALSE, sep = "",
                     row.names = 1)
  if (pass) NULL else line
  })
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)

Some clues towards higher speeds:

一些提高速度的線索:

  1. use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
  2. 使用scan而不是read.table。將數據作為字符讀取,並且僅在最后,將數據放入字符矩陣或data.frame后,將type.convert應用於每列。
  3. Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.
  4. 如果它更短,則不要在skipVec上循環,而是在其rle上循環。因此,您將能夠一次讀取或跳過大量的行。

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2013/10/06/3750c41edce3eb6476b46aea42991399.html



 
  © 2014-2022 ITdaan.com 联系我们: