I have a very large Data Table
with two columns. And I wish to apply a custom function on a particular column. The code to generate the problem is as follows:
我有一個非常大的數據表,有兩列。我希望在特定列上應用自定義函數。生成問題的代碼如下:
require(data.table)
X <- rep("This is just random text", 1e5)
data <- data.frame(1:1e5, replicate(1, X, simplify=FALSE), stringsAsFactors=FALSE)
colnames(data) <- paste("X", seq_len(ncol(data)), sep="")
DT <- as.data.table(data)
Now, we have a large data table which looks like
現在,我們有一個大型數據表,看起來像
| X1 | X2 |
|----|-------------------------|
| 1 | This is just random text|
| 2 | This is just random text|
| 3 | This is just random text|
| 4 | This is just random text|
| .. | ... |
What if I want to do some vector operation on any of this column considering in mind that this data.table will be of very large size (approx ~100M
rows).
如果我想在這個列的任何一個上做一些向量操作,考慮到這個data.table將具有非常大的大小(大約約100M行),該怎么辦?
Let's take an example of X1 column. Suppose, I want to apply the following function on it:
我們以X1列為例。假設,我想在其上應用以下功能:
Fun4X1 <- function(x){return(x+x*2)}
And a very complex NLP
function on X2 column which looks something like
和X2列上非常復雜的NLP函數看起來像
Fun4X2 <- function(x){
require(stringr)
return(str_split(x, " ")[[1]][1])
}
How shall I go about doing this for a large dataset? Please suggest the min. time consuming approach as my Function
is itself very complex.
我該如何為大型數據集執行此操作?請建議分鍾。耗時的方法因為我的功能本身非常復雜。
P.S. I have tried foreach
, sapply
, and of course for-loop
and all are very slow on a pretty good hardware system.
附:我已經嘗試了foreach,sapply,當然還有for循環,並且在一個非常好的硬件系統上都很慢。
4
The approach should be no different than applying any other in-built (or package-loaded) function to a specific column in a data.table
: Use a list(fun(variable), otherfun(othervariable))
type of construct. You can also name the resulting columns if so desired, otherwise they will be named "V1", "V2" and so on.
該方法應該與將任何其他內置(或包加載)函數應用於data.table中的特定列無關:使用列表(fun(variable),otherfun(othervariable))類型的構造。如果需要,您也可以命名結果列,否則它們將被命名為“V1”,“V2”等。
In other words, for your problem you can do:
換句話說,對於您的問題,您可以:
DT[, list(X1 = Fun4X1(X1), X2 = Fun4X2(X2))]
I suspect, however, that a lot of your slowdown might be due to the functions you are actually using. Compare the following slight refinements:
但是,我懷疑你的減速很多可能是由於你實際使用的功能造成的。比較以下細微的改進:
Fun4X2.old <- function(x){
require(stringr)
return(str_split(x, " ")[[1]][1])
}
Fun4X2.new1 <- function(x) {
vapply(strsplit(x, " "),
function(y) y[1], character(1))
}
Fun4X2.new2 <- function(x) {
vapply(strsplit(x, " ", fixed=TRUE),
function(y) y[1], character(1))
}
Fun4X2.sub <- function(x) sub("(.+?) .*", "\\1", x)
X <- rep("This is just random text", 1e5)
system.time(out1 <- Fun4X2.old(X))
# user system elapsed
# 18.838 0.000 18.659
system.time(out2 <- Fun4X2.new1(X))
# user system elapsed
# 0.000 0.000 0.944
system.time(out3 <- Fun4X2.new2(X))
# user system elapsed
# 1.584 0.000 0.270
system.time(out4 <- Fun4X2.sub(X))
# user system elapsed
# 0.000 0.000 0.222
One last note, regarding your comment here:
最后一點,關於你在這里的評論:
@AnandaMahto I am looking for something similar to this but if I use your solution then the output on text column in not vectorized and I get same output even if I have different text in each row
@AnandaMahto我正在尋找類似的東西但是如果我使用你的解決方案然后文本列上的輸出沒有矢量化我得到相同的輸出,即使我在每行有不同的文本
Incidentally, your original Fun4X2()
(renamed Fun4X2.old()
above) exhibits the same behavior.
順便提一下,您原來的Fun4X2()(上面重命名為Fun4X2.old())表現出相同的行為。
DT2 <- data.table(X1 = 1:4, X2 = c("a b c", "d e f", "g h i", "j k l"))
DT2[, list(Fun4X1(X1), Fun4X2.old(X2))]
# V1 V2
# 1: 3 a
# 2: 6 a
# 3: 9 a
# 4: 12 a
DT2[, list(Fun4X1(X1), Fun4X2.new1(X2))]
# V1 V2
# 1: 3 a
# 2: 6 d
# 3: 9 g
# 4: 12 j
1
Check out the snowfall package, http://cran.r-project.org/web/packages/snowfall/snowfall.pdf, for parallel computing. You can set up a local cluster and utilize all of your cores. I've found that by using sfApply
from this package it has reduced most of my computing times by 5X
查看降雪包,http://cran.r-project.org/web/packages/snowfall/snowfall.pdf,了解並行計算。您可以設置本地群集並使用所有核心。我發現通過在這個軟件包中使用sfApply,它將我的大部分計算時間減少了5倍
(I have an 8-core, so it would be 8 times faster, but there is obviously the costs of loading the data into the cluster and collecting it at the end).
(我有一個8核,所以它會快8倍,但顯然是將數據加載到集群並在最后收集它的成本)。
e.g.
例如
install.packages('snowfall')
require(snowfall)
sfInit( parallel=TRUE, cpus=4 )
sfExport(list=c('DT','Fun4X1','Fun4X2'))
sfApply(DT,1,function(X) return(c(fun4X1(X[1]),fun4X2(X[2]))))
sfStop()
With apply
takes 25.07 sec , with sfApply
takes 9.11 sec on my machine
申請需要25.07秒,sfApply在我的機器上需要9.11秒
1
You can use the fast and vectorized function sub
for the second problem:
您可以使用快速和矢量化函數子來解決第二個問題:
Fun4X2 <- function(x) sub("(.+?) .*", "\\1", x)
head(Fun4X2(DT[,X2]))
# [1] "This" "This" "This" "This" "This" "This"
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2014/01/22/7257701e7454d9552a3bd091d209fc05.html。