如何連接(合並)數據幀(內、外、左、右)?

[英]How to join (merge) data frames (inner, outer, left, right)?


Given two data frames:

給定兩個數據幀:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

我如何做數據庫風格,即sql風格,加入?也就是說,我如何得到:

  • An inner join of df1 and df2:
    Return only the rows in which the left table have matching keys in the right table.
  • df1和df2的內部連接:只返回左側表在右側表中匹配鍵的行。
  • An outer join of df1 and df2:
    Returns all rows from both tables, join records from the left which have matching keys in the right table.
  • df1和df2的外部連接:從兩個表中返回所有的行,從左邊的記錄中加入匹配鍵在右邊表中的記錄。
  • A left outer join (or simply left join) of df1 and df2
    Return all rows from the left table, and any rows with matching keys from the right table.
  • df1和df2的左外連接(或簡單的左連接)返回左表中的所有行,以及從右表中匹配鍵的任何行。
  • A right outer join of df1 and df2
    Return all rows from the right table, and any rows with matching keys from the left table.
  • df1和df2的右外連接將從右表返回所有行,並從左表返回任何與匹配鍵匹配的行。

Extra credit:

額外學分:

How can I do a SQL style select statement?

如何執行SQL樣式選擇語句?

12 个解决方案

#1


960  

By using the merge function and its optional parameters:

通過使用合並函數及其可選參數:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

內部連接:merge(df1, df2)將為這些示例工作,因為R會自動地通過通用變量名連接到框架,但是您很可能想要指定merge(df1, df2, by =“CustomerId”),以確保您只匹配您想要的字段。你也可以用by。x和。如果匹配變量在不同的數據幀中有不同的名稱,則y參數。

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

外部連接:merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId",全部)。x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

右外:合並(x = df1, y = df2, by = "CustomerId",全部)。y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

交叉連接:合並(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

與內部連接一樣,您可能希望顯式地將“CustomerId”傳遞給R作為匹配的變量。我認為最好是明確地聲明你想要合並的標識符;如果輸入數據,幀的變化是出乎意料的,並且以后更容易讀取,那就更安全了。

#2


171  

I would recommend checking out Gabor Grothendieck's sqldf package, which allows you to express these operations in SQL.

我建議檢查Gabor Grothendieck的sqldf包,它允許您用SQL表達這些操作。

library(sqldf)

## inner join
df3 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              JOIN df2 USING(CustomerID)")

## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              LEFT JOIN df2 USING(CustomerID)")

I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).

我發現SQL語法比它的R等效(但這可能只是反映了我的RDBMS偏見)更簡單、更自然。

See Gabor's sqldf GitHub for more information on joins.

有關連接的更多信息,請參見Gabor的sqldf GitHub。

#3


156  

There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):

有數據。內連接的表方法,它是非常時間和內存有效的(對於一些較大的數據來說是必需的):

library(data.table)

dt1 <- data.table(df1, key = "CustomerId") 
dt2 <- data.table(df2, key = "CustomerId")

joined.dt1.dt.2 <- dt1[dt2]

merge also works on data.tables (as it is generic and calls merge.data.table)

合並也適用於數據。表(因為它是通用的,調用merge.data.table)

merge(dt1, dt2)

data.table documented on stackoverflow:
How to do a data.table merge operation
Translating SQL joins on foreign keys to R data.table syntax
Efficient alternatives to merge for larger data.frames R
How to do a basic left outer join with data.table in R?

數據。表格記錄在stackoverflow:如何做數據。表合並操作將SQL連接轉換為R數據的外鍵。表語法有效的替代方法來合並更大的數據。框架R如何做一個基本的左外連接和數據。在R表嗎?

Yet another option is the join function found in the plyr package

另一個選項是在plyr包中找到的聯接函數。

library(plyr)

join(df1, df2,
     type = "inner")

#   CustomerId Product   State
# 1          2 Toaster Alabama
# 2          4   Radio Alabama
# 3          6   Radio    Ohio

Options for type: inner, left, right, full.

類型選擇:內、左、右、滿。

From ?join: Unlike merge, [join] preserves the order of x no matter what join type is used.

加入:與合並不同,[join]保留了x的順序,無論使用什么連接類型。

#4


131  

You can do joins as well using Hadley Wickham's awesome dplyr package.

你也可以使用Hadley Wickham的很棒的dplyr包。

library(dplyr)

#make sure that CustomerId cols are both type numeric
#they ARE not using the provided code in question and dplyr will complain
df1$CustomerId <- as.numeric(df1$CustomerId)
df2$CustomerId <- as.numeric(df2$CustomerId)

Mutating joins: add columns to df1 using matches in df2

#inner
inner_join(df1, df2)

#left outer
left_join(df1, df2)

#right outer
right_join(df1, df2)

#alternate right outer
left_join(df2, df1)

#full join
full_join(df1, df2)

Filtering joins: filter out rows in df1, don't modify columns

semi_join(df1, df2) #keep only observations in df1 that match in df2.
anti_join(df1, df2) #drops all observations in df1 that match in df2.

#5


67  

There are some good examples of doing this over at the R Wiki. I'll steal a couple here:

在R Wiki上有一些很好的例子。我在這里偷了一對:

Merge Method

合並方法

Since your keys are named the same the short way to do an inner join is merge():

由於您的密鑰被命名為相同的,所以進行內部連接的短方法是merge():

merge(df1,df2)

a full inner join (all records from both tables) can be created with the "all" keyword:

一個完整的內部連接(來自兩個表的所有記錄)可以用“all”關鍵字創建:

merge(df1,df2, all=TRUE)

a left outer join of df1 and df2:

df1和df2的左外連接:

merge(df1,df2, all.x=TRUE)

a right outer join of df1 and df2:

df1和df2的右外連接:

merge(df1,df2, all.y=TRUE)

you can flip 'em, slap 'em and rub 'em down to get the other two outer joins you asked about :)

你可以翻轉它們,拍打它們,然后把它們揉成一團,這樣你就可以得到另外兩個外部連接了。

Subscript Method

下標的方法

A left outer join with df1 on the left using a subscript method would be:

使用下標方法在左側與df1連接的左外連接將是:

df1[,"State"]<-df2[df1[ ,"Product"], "State"]

The other combination of outer joins can be created by mungling the left outer join subscript example. (yeah, I know that's the equivalent of saying "I'll leave it as an exercise for the reader...")

外部聯接的另一個組合可以通過mungling的左外連接子腳本示例創建。(是的,我知道這相當於說“我把它留給讀者做練習……”)

#6


59  

New in 2014:

新2014年:

Especially if you're also interested in data manipulation in general (including sorting, filtering, subsetting, summarizing etc.), you should definitely take a look at dplyr, which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types. It even offers quite an elaborate SQL interface, and even a function to convert (most) SQL code directly into R.

特別是如果您還對數據操作(包括排序、篩選、子設置、匯總等)感興趣,那么您一定要看一下dplyr,它附帶了各種功能,這些功能都是為了方便您的工作,特別是數據框架和某些其他數據庫類型。它甚至提供了相當復雜的SQL接口,甚至還提供了直接將(大多數)SQL代碼轉換成R的函數。

The four joining-related functions in the dplyr package are (to quote):

dplyr包中的四個連接功能(引用):

  • inner_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are matching values in y, and all columns from x and y
  • inner_join(x, y, by = NULL, copy = FALSE,…):返回x中的所有行,其中包含y中的匹配值,以及來自x和y的所有列。
  • left_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x, and all columns from x and y
  • left_join(x, y, by = NULL, copy = FALSE,…):從x返回所有行,從x和y返回所有列。
  • semi_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are matching values in y, keeping just columns from x.
  • (x, y, by = NULL, copy = FALSE,…):返回x中的所有行,其中有y的匹配值,只保留x的列。
  • anti_join(x, y, by = NULL, copy = FALSE, ...): return all rows from x where there are not matching values in y, keeping just columns from x
  • anti_join(x, y, by = NULL, copy = FALSE,…):返回x中的所有行,其中不匹配y中的值,只保留x中的列。

It's all here in great detail.

這些都是非常詳細的。

Selecting columns can be done by select(df,"column"). If that's not SQL-ish enough for you, then there's the sql() function, into which you can enter SQL code as-is, and it will do the operation you specified just like you were writing in R all along (for more information, please refer to the dplyr/databases vignette). For example, if applied correctly, sql("SELECT * FROM hflights") will select all the columns from the "hflights" dplyr table (a "tbl").

選擇列可以通過select(df,“column”)來完成。如果這對您來說還不夠,那么還有sql()函數,您可以輸入sql代碼,它將執行您指定的操作,就像您一直在R中編寫的那樣(為了獲得更多信息,請參考dplyr/數據庫vignette)。例如,如果應用正確,sql(“SELECT * FROM hflights”)將選擇“hflights”dplyr表中的所有列(“tbl”)。

#7


52  

Update on data.table methods for joining datasets. See below examples for each type of join. There are two methods, one from [.data.table when passing second data.table as the first argument to subset, another way is to use merge function which dispatched to fast data.table method.

更新的數據。連接數據集的表方法。請參見下面的示例,以了解每種類型的連接。有兩種方法,一種來自(.data)。傳遞第二個數據時的表。表作為子集的第一個參數,另一種方法是使用發送到快速數據的合並函數。表的方法。

Update on 2016-04-01 - and it isn't April Fools joke!
In 1.9.7 version of data.table joins are now capable to use existing index which tremendously reduce the timing of a join. Below code and benchmark does NOT use data.table indices on join. If you are looking for near real-time join you should use data.table indices.

更新:2016-04-01 -這可不是愚人節玩笑!1.9.7版本的數據。表連接現在可以使用已經存在的索引,這極大地減少了連接的時間。下面的代碼和基准不使用數據。表索引加入。如果您正在尋找接近實時的連接,您應該使用數據。表索引。

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2L, 4L, 7L), State = c(rep("Alabama", 2), rep("Ohio", 1))) # one value changed to show full outer join

library(data.table)

dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
setkey(dt1, CustomerId)
setkey(dt2, CustomerId)
# right outer join keyed data.tables
dt1[dt2]

setkey(dt1, NULL)
setkey(dt2, NULL)
# right outer join unkeyed data.tables - use `on` argument
dt1[dt2, on = "CustomerId"]

# left outer join - swap dt1 with dt2
dt2[dt1, on = "CustomerId"]

# inner join - use `nomatch` argument
dt1[dt2, nomatch=0L, on = "CustomerId"]

# anti join - use `!` operator
dt1[!dt2, on = "CustomerId"]

# inner join
merge(dt1, dt2, by = "CustomerId")

# full outer join
merge(dt1, dt2, by = "CustomerId", all = TRUE)

# see ?merge.data.table arguments for other cases

Below benchmark tests base R, sqldf, dplyr and data.table.
Benchmark tests unkeyed/unindexed datasets. You can get even better performance if you are using keys on your data.tables or indexes with sqldf. Base R and dplyr does not have indexes or keys so I did not include that scenario in benchmark.
Benchmark is performed on 5M-1 rows datasets, there are 5M-2 common values on join column so each scenario (left, right, full, inner) can be tested and join is still not trivial to perform.

在基准測試基礎上,R, sqldf, dplyr和data.table。基准測試unkeyed /去數據集取消建立索引。如果您在數據上使用鍵,則可以獲得更好的性能。使用sqldf的表或索引。Base R和dplyr沒有索引或鍵,所以我沒有在基准測試中包含該場景。基准測試是在5M-1行數據集上執行的,在join列上有5M-2個通用值,因此每個場景(左、右、滿、內)都可以進行測試,而join仍然不容易執行。

library(microbenchmark)
library(sqldf)
library(dplyr)
library(data.table)

n = 5e6
set.seed(123)
df1 = data.frame(x=sample(n,n-1L), y1=rnorm(n-1L))
df2 = data.frame(x=sample(n,n-1L), y2=rnorm(n-1L))
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)

# inner join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x"),
               sqldf = sqldf("SELECT * FROM df1 INNER JOIN df2 ON df1.x = df2.x"),
               dplyr = inner_join(df1, df2, by = "x"),
               data.table = dt1[dt2, nomatch = 0L, on = "x"])
#Unit: milliseconds
#       expr        min         lq      mean     median        uq       max neval
#       base 15546.0097 16083.4915 16687.117 16539.0148 17388.290 18513.216    10
#      sqldf 44392.6685 44709.7128 45096.401 45067.7461 45504.376 45563.472    10
#      dplyr  4124.0068  4248.7758  4281.122  4272.3619  4342.829  4411.388    10
# data.table   937.2461   946.0227  1053.411   973.0805  1214.300  1281.958    10

# left outer join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x", all.x = TRUE),
               sqldf = sqldf("SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.x = df2.x"),
               dplyr = left_join(df1, df2, by = c("x"="x")),
               data.table = dt2[dt1, on = "x"])
#Unit: milliseconds
#       expr       min         lq       mean     median         uq       max neval
#       base 16140.791 17107.7366 17441.9538 17414.6263 17821.9035 19453.034    10
#      sqldf 43656.633 44141.9186 44777.1872 44498.7191 45288.7406 47108.900    10
#      dplyr  4062.153  4352.8021  4780.3221  4409.1186  4450.9301  8385.050    10
# data.table   823.218   823.5557   901.0383   837.9206   883.3292  1277.239    10

# right outer join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x", all.y = TRUE),
               sqldf = sqldf("SELECT * FROM df2 LEFT OUTER JOIN df1 ON df2.x = df1.x"),
               dplyr = right_join(df1, df2, by = "x"),
               data.table = dt1[dt2, on = "x"])
#Unit: milliseconds
#       expr        min         lq       mean     median        uq       max neval
#       base 15821.3351 15954.9927 16347.3093 16044.3500 16621.887 17604.794    10
#      sqldf 43635.5308 43761.3532 43984.3682 43969.0081 44044.461 44499.891    10
#      dplyr  3936.0329  4028.1239  4102.4167  4045.0854  4219.958  4307.350    10
# data.table   820.8535   835.9101   918.5243   887.0207  1005.721  1068.919    10

# full outer join
microbenchmark(times = 10L,
               base = merge(df1, df2, by = "x", all = TRUE),
               #sqldf = sqldf("SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.x = df2.x"), # not supported
               dplyr = full_join(df1, df2, by = "x"),
               data.table = merge(dt1, dt2, by = "x", all = TRUE))
#Unit: seconds
#       expr       min        lq      mean    median        uq       max neval
#       base 16.176423 16.908908 17.485457 17.364857 18.271790 18.626762    10
#      dplyr  7.610498  7.666426  7.745850  7.710638  7.832125  7.951426    10
# data.table  2.052590  2.130317  2.352626  2.208913  2.470721  2.951948    10

#8


25  

dplyr is very good and performant. In addition to the other answers on it, here was/is its status as of

dplyr非常好,性能也很好。除了上面的其他答案,這里是/是它的狀態。

v0.1.3 (4/2014)

v0.1.3(4/2014)

  • has inner_join, left_join, semi_join, anti_join
  • 有inner_join、left_join、sem_join、anti_join ?
  • outer_join not implemented yet, fallback is use base::merge() (or plyr::join())
  • outer_join還沒有實現,回退是使用base::merge()(或plyr::join())
  • Hadley mentioning other advantages here
  • 哈德利在這里提到了其他優勢。
  • one minor feature merge currently has that dplyr doesn't is the ability to have separate by.x,by.y columns as e.g. Python pandas does.
  • 一個小的特征合並目前有一個dplyr不是有能力分開的。x,by。y列就像Python里的熊貓一樣。
  • Implement right_join and outer_join is tagged for v0.3 (presumably at least 2015 or beyond)
  • 實現right_join和outer_join被標記為v0.3(假定至少在2015或以上)

Per hadley's comments in that issue:

每哈德利在這個問題上的評論:

  • right_join(x,y) is the same as left_join(y,x) in terms of the rows, just the columns will be different orders. Easily worked around with select(new_column_order)
  • right_join(x,y)與left_join(y,x)的行相同,只是列將是不同的順序。輕松處理select(new_column_order)
  • outer_join is basically union(left_join(x, y), right_join(x, y)) - i.e. preserve all rows in both data frames.
  • outer_join基本上是union(left_join(x, y), right_join(x, y)),即在兩個數據幀中保留所有行。

#9


18  

In joining two data frames with ~1 million rows each, one with 2 columns and the other with ~20, I've surprisingly found merge(..., all.x = TRUE, all.y = TRUE) to be faster then dplyr::full_join(). This is with dplyr v0.4

在連接兩個數據幀時,每一行有100萬行,其中一個有2個列,另一個有~20個,我意外地發現了合並(……,所有。x = TRUE。要更快,那么dplyr::full_join()。這是dplyr v0.4。

Merge takes ~17 seconds, full_join takes ~65 seconds.

合並需要17秒,full_join需要65秒。

Some food for though, since I generally default to dplyr for manipulation tasks.

一些食物,因為我通常默認為dplyr處理任務。

#10


17  

For the case of a left join with a 0..*:0..1 cardinality or a right join with a 0..1:0..* cardinality it is possible to assign in-place the unilateral columns from the joiner (the 0..1 table) directly onto the joinee (the 0..* table), and thereby avoid the creation of an entirely new table of data. This requires matching the key columns from the joinee into the joiner and indexing+ordering the joiner's rows accordingly for the assignment.

對於左連接的情況,用0。*:0。1的基數或與0. 1 .1的右連接。*基數性,可以將來自joiner的單側列(0.. ..)分配給內部。直接進入joinee (0..)*表),從而避免創建一個全新的數據表。這就要求將joinee中的鍵列與joiner和索引+匹配,並相應地排序joiner的行。

If the key is a single column, then we can use a single call to match() to do the matching. This is the case I'll cover in this answer.

如果鍵是單個列,那么我們可以使用單個調用來匹配()進行匹配。這就是我在這個答案里要講的。

Here's an example based on the OP, except I've added an extra row to df2 with an id of 7 to test the case of a non-matching key in the joiner. This is effectively df1 left join df2:

這里有一個基於OP的示例,除了我在df2中添加了一個額外的行,它的id為7,以測試joiner中的一個非匹配鍵的情況。這是有效的df1左連接df2:

df1 <- data.frame(CustomerId=1:6,Product=c(rep('Toaster',3L),rep('Radio',3L)));
df2 <- data.frame(CustomerId=c(2L,4L,6L,7L),State=c(rep('Alabama',2L),'Ohio','Texas'));
df1[names(df2)[-1L]] <- df2[match(df1[,1L],df2[,1L]),-1L];
df1;
##   CustomerId Product   State
## 1          1 Toaster    <NA>
## 2          2 Toaster Alabama
## 3          3 Toaster    <NA>
## 4          4   Radio Alabama
## 5          5   Radio    <NA>
## 6          6   Radio    Ohio

In the above I hard-coded an assumption that the key column is the first column of both input tables. I would argue that, in general, this is not an unreasonable assumption, since, if you have a data.frame with a key column, it would be strange if it had not been set up as the first column of the data.frame from the outset. And you can always reorder the columns to make it so. An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded, although I suppose it's just replacing one assumption with another. Concision is another advantage of integer indexing, as well as speed. In the benchmarks below I'll change the implementation to use string name indexing to match the competing implementations.

在上面,我硬編碼了一個假設,即鍵列是兩個輸入表的第一列。我認為,總的來說,這並不是一個不合理的假設,因為,如果您有一個具有鍵列的數據框架,那么如果它沒有被設置為數據的第一列,那么就會很奇怪。你總是可以重新排列這些列。這種假設的一個有利結果是,鍵列的名稱不必硬編碼,盡管我認為它只是將一個假設替換為另一個假設。簡潔是整數索引和速度的另一個優點。在下面的基准測試中,我將更改實現以使用字符串名稱索引來匹配競爭的實現。

I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table. Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient.

我認為這是一個特別合適的解決方案,如果您有幾個表,您想要在一個大的表上留下連接。重復地為每次合並重建整個表將是不必要的和低效的。

On the other hand, if you need the joinee to remain unaltered through this operation for whatever reason, then this solution cannot be used, since it modifies the joinee directly. Although in that case you could simply make a copy and perform the in-place assignment(s) on the copy.

另一方面,如果您需要joinee在這個操作中保持不變,不管出於什么原因,那么這個解決方案不能被使用,因為它直接修改了joinee。雖然在這種情況下,您可以簡單地復制並在副本上執行就地分配。


As a side note, I briefly looked into possible matching solutions for multicolumn keys. Unfortunately, the only matching solutions I found were:

作為一個側重點,我簡要地研究了各種各樣的密鑰的可能匹配解決方案。不幸的是,我找到的唯一匹配的解決方案是:

  • inefficient concatenations. e.g. match(interaction(df1$a,df1$b),interaction(df2$a,df2$b)), or the same idea with paste().
  • 低效率的串連。例如match(交互作用(df1$a,df1$b),交互(df2$a,df2$b)),或與paste()相同的想法。
  • inefficient cartesian conjunctions, e.g. outer(df1$a,df2$a,`==`) & outer(df1$b,df2$b,`==`).
  • 無效的笛卡兒連詞,例如:外(df1$a,df2$a, ' == ') &外(df1$b,df2$b, ' == ')。
  • base R merge() and equivalent package-based merge functions, which always allocate a new table to return the merged result, and thus are not suitable for an in-place assignment-based solution.
  • base R merge()和等價的基於包的合並函數,它總是分配一個新表來返回合並的結果,因此不適合基於位置的基於分配的解決方案。

For example, see Matching multiple columns on different data frames and getting other column as result, match two columns with two other columns, Matching on multiple columns, and the dupe of this question where I originally came up with the in-place solution, Combine two data frames with different number of rows in R.

例如,看到匹配多個列在不同的數據幀和其他列結果,匹配兩列與另外兩個列,匹配多個列,和欺騙我最初提出的這個問題就地解決方案,結合兩個不同的數據幀的行數R。


Benchmarking

I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question.

我決定做我自己的基准測試,看看如何與這個問題中提供的其他解決方案進行比較。

Testing code:

測試代碼:

library(microbenchmark);
library(data.table);
library(sqldf);
library(plyr);
library(dplyr);

solSpecs <- list(
    merge=list(testFuncs=list(
        inner=function(df1,df2,key) merge(df1,df2,key),
        left =function(df1,df2,key) merge(df1,df2,key,all.x=T),
        right=function(df1,df2,key) merge(df1,df2,key,all.y=T),
        full =function(df1,df2,key) merge(df1,df2,key,all=T)
    )),
    data.table.unkeyed=list(argSpec='data.table.unkeyed',testFuncs=list(
        inner=function(dt1,dt2,key) dt1[dt2,on=key,nomatch=0L,allow.cartesian=T],
        left =function(dt1,dt2,key) dt2[dt1,on=key,allow.cartesian=T],
        right=function(dt1,dt2,key) dt1[dt2,on=key,allow.cartesian=T],
        full =function(dt1,dt2,key) merge(dt1,dt2,key,all=T,allow.cartesian=T) ## calls merge.data.table()
    )),
    data.table.keyed=list(argSpec='data.table.keyed',testFuncs=list(
        inner=function(dt1,dt2) dt1[dt2,nomatch=0L,allow.cartesian=T],
        left =function(dt1,dt2) dt2[dt1,allow.cartesian=T],
        right=function(dt1,dt2) dt1[dt2,allow.cartesian=T],
        full =function(dt1,dt2) merge(dt1,dt2,all=T,allow.cartesian=T) ## calls merge.data.table()
    )),
    sqldf.unindexed=list(testFuncs=list( ## note: must pass connection=NULL to avoid running against the live DB connection, which would result in collisions with the residual tables from the last query upload
        inner=function(df1,df2,key) sqldf(paste0('select * from df1 inner join df2 using(',paste(collapse=',',key),')'),connection=NULL),
        left =function(df1,df2,key) sqldf(paste0('select * from df1 left join df2 using(',paste(collapse=',',key),')'),connection=NULL),
        right=function(df1,df2,key) sqldf(paste0('select * from df2 left join df1 using(',paste(collapse=',',key),')'),connection=NULL) ## can't do right join proper, not yet supported; inverted left join is equivalent
        ##full =function(df1,df2,key) sqldf(paste0('select * from df1 full join df2 using(',paste(collapse=',',key),')'),connection=NULL) ## can't do full join proper, not yet supported; possible to hack it with a union of left joins, but too unreasonable to include in testing
    )),
    sqldf.indexed=list(testFuncs=list( ## important: requires an active DB connection with preindexed main.df1 and main.df2 ready to go; arguments are actually ignored
        inner=function(df1,df2,key) sqldf(paste0('select * from main.df1 inner join main.df2 using(',paste(collapse=',',key),')')),
        left =function(df1,df2,key) sqldf(paste0('select * from main.df1 left join main.df2 using(',paste(collapse=',',key),')')),
        right=function(df1,df2,key) sqldf(paste0('select * from main.df2 left join main.df1 using(',paste(collapse=',',key),')')) ## can't do right join proper, not yet supported; inverted left join is equivalent
        ##full =function(df1,df2,key) sqldf(paste0('select * from main.df1 full join main.df2 using(',paste(collapse=',',key),')')) ## can't do full join proper, not yet supported; possible to hack it with a union of left joins, but too unreasonable to include in testing
    )),
    plyr=list(testFuncs=list(
        inner=function(df1,df2,key) join(df1,df2,key,'inner'),
        left =function(df1,df2,key) join(df1,df2,key,'left'),
        right=function(df1,df2,key) join(df1,df2,key,'right'),
        full =function(df1,df2,key) join(df1,df2,key,'full')
    )),
    dplyr=list(testFuncs=list(
        inner=function(df1,df2,key) inner_join(df1,df2,key),
        left =function(df1,df2,key) left_join(df1,df2,key),
        right=function(df1,df2,key) right_join(df1,df2,key),
        full =function(df1,df2,key) full_join(df1,df2,key)
    )),
    in.place=list(testFuncs=list(
        left =function(df1,df2,key) { cns <- setdiff(names(df2),key); df1[cns] <- df2[match(df1[,key],df2[,key]),cns]; df1; },
        right=function(df1,df2,key) { cns <- setdiff(names(df1),key); df2[cns] <- df1[match(df2[,key],df1[,key]),cns]; df2; }
    ))
);

getSolTypes <- function() names(solSpecs);
getJoinTypes <- function() unique(unlist(lapply(solSpecs,function(x) names(x$testFuncs))));
getArgSpec <- function(argSpecs,key=NULL) if (is.null(key)) argSpecs$default else argSpecs[[key]];

initSqldf <- function() {
    sqldf(); ## creates sqlite connection on first run, cleans up and closes existing connection otherwise
    if (exists('sqldfInitFlag',envir=globalenv(),inherits=F) && sqldfInitFlag) { ## false only on first run
        sqldf(); ## creates a new connection
    } else {
        assign('sqldfInitFlag',T,envir=globalenv()); ## set to true for the one and only time
    }; ## end if
    invisible();
}; ## end initSqldf()

setUpBenchmarkCall <- function(argSpecs,joinType,solTypes=getSolTypes(),env=parent.frame()) {
    ## builds and returns a list of expressions suitable for passing to the list argument of microbenchmark(), and assigns variables to resolve symbol references in those expressions
    callExpressions <- list();
    nms <- character();
    for (solType in solTypes) {
        testFunc <- solSpecs[[solType]]$testFuncs[[joinType]];
        if (is.null(testFunc)) next; ## this join type is not defined for this solution type
        testFuncName <- paste0('tf.',solType);
        assign(testFuncName,testFunc,envir=env);
        argSpecKey <- solSpecs[[solType]]$argSpec;
        argSpec <- getArgSpec(argSpecs,argSpecKey);
        argList <- setNames(nm=names(argSpec$args),vector('list',length(argSpec$args)));
        for (i in seq_along(argSpec$args)) {
            argName <- paste0('tfa.',argSpecKey,i);
            assign(argName,argSpec$args[[i]],envir=env);
            argList[[i]] <- if (i%in%argSpec$copySpec) call('copy',as.symbol(argName)) else as.symbol(argName);
        }; ## end for
        callExpressions[[length(callExpressions)+1L]] <- do.call(call,c(list(testFuncName),argList),quote=T);
        nms[length(nms)+1L] <- solType;
    }; ## end for
    names(callExpressions) <- nms;
    callExpressions;
}; ## end setUpBenchmarkCall()

harmonize <- function(res) {
    res <- as.data.frame(res); ## coerce to data.frame
    for (ci in which(sapply(res,is.factor))) res[[ci]] <- as.character(res[[ci]]); ## coerce factor columns to character
    for (ci in which(sapply(res,is.logical))) res[[ci]] <- as.integer(res[[ci]]); ## coerce logical columns to integer (works around sqldf quirk of munging logicals to integers)
    ##for (ci in which(sapply(res,inherits,'POSIXct'))) res[[ci]] <- as.double(res[[ci]]); ## coerce POSIXct columns to double (works around sqldf quirk of losing POSIXct class) ----- POSIXct doesn't work at all in sqldf.indexed
    res <- res[order(names(res))]; ## order columns
    res <- res[do.call(order,res),]; ## order rows
    res;
}; ## end harmonize()

checkIdentical <- function(argSpecs,solTypes=getSolTypes()) {
    for (joinType in getJoinTypes()) {
        callExpressions <- setUpBenchmarkCall(argSpecs,joinType,solTypes);
        if (length(callExpressions)<2L) next;
        ex <- harmonize(eval(callExpressions[[1L]]));
        for (i in seq(2L,len=length(callExpressions)-1L)) {
            y <- harmonize(eval(callExpressions[[i]]));
            if (!isTRUE(all.equal(ex,y,check.attributes=F))) {
                ex <<- ex;
                y <<- y;
                solType <- names(callExpressions)[i];
                stop(paste0('non-identical: ',solType,' ',joinType,'.'));
            }; ## end if
        }; ## end for
    }; ## end for
    invisible();
}; ## end checkIdentical()

testJoinType <- function(argSpecs,joinType,solTypes=getSolTypes(),metric=NULL,times=100L) {
    callExpressions <- setUpBenchmarkCall(argSpecs,joinType,solTypes);
    bm <- microbenchmark(list=callExpressions,times=times);
    if (is.null(metric)) return(bm);
    bm <- summary(bm);
    res <- setNames(nm=names(callExpressions),bm[[metric]]);
    attr(res,'unit') <- attr(bm,'unit');
    res;
}; ## end testJoinType()

testAllJoinTypes <- function(argSpecs,solTypes=getSolTypes(),metric=NULL,times=100L) {
    joinTypes <- getJoinTypes();
    resList <- setNames(nm=joinTypes,lapply(joinTypes,function(joinType) testJoinType(argSpecs,joinType,solTypes,metric,times)));
    if (is.null(metric)) return(resList);
    units <- unname(unlist(lapply(resList,attr,'unit')));
    res <- do.call(data.frame,c(list(join=joinTypes),setNames(nm=solTypes,rep(list(rep(NA_real_,length(joinTypes))),length(solTypes))),list(unit=units,stringsAsFactors=F)));
    for (i in seq_along(resList)) res[i,match(names(resList[[i]]),names(res))] <- resList[[i]];
    res;
}; ## end testAllJoinTypes()

testGrid <- function(makeArgSpecsFunc,sizes,overlaps,solTypes=getSolTypes(),joinTypes=getJoinTypes(),metric='median',times=100L) {

    res <- expand.grid(size=sizes,overlap=overlaps,joinType=joinTypes,stringsAsFactors=F);
    res[solTypes] <- NA_real_;
    res$unit <- NA_character_;
    for (ri in seq_len(nrow(res))) {

        size <- res$size[ri];
        overlap <- res$overlap[ri];
        joinType <- res$joinType[ri];

        argSpecs <- makeArgSpecsFunc(size,overlap);

        checkIdentical(argSpecs,solTypes);

        cur <- testJoinType(argSpecs,joinType,solTypes,metric,times);
        res[ri,match(names(cur),names(res))] <- cur;
        res$unit[ri] <- attr(cur,'unit');

    }; ## end for

    res;

}; ## end testGrid()

Here's a benchmark of the example based on the OP that I demonstrated earlier:

下面是基於我之前演示過的OP的示例的基准:

## OP's example, supplemented with a non-matching row in df2
argSpecs <- list(
    default=list(copySpec=1:2,args=list(
        df1 <- data.frame(CustomerId=1:6,Product=c(rep('Toaster',3L),rep('Radio',3L))),
        df2 <- data.frame(CustomerId=c(2L,4L,6L,7L),State=c(rep('Alabama',2L),'Ohio','Texas')),
        'CustomerId'
    )),
    data.table.unkeyed=list(copySpec=1:2,args=list(
        as.data.table(df1),
        as.data.table(df2),
        'CustomerId'
    )),
    data.table.keyed=list(copySpec=1:2,args=list(
        setkey(as.data.table(df1),CustomerId),
        setkey(as.data.table(df2),CustomerId)
    ))
);
## prepare sqldf
initSqldf();
sqldf('create index df1_key on df1(CustomerId);'); ## upload and create an sqlite index on df1
sqldf('create index df2_key on df2(CustomerId);'); ## upload and create an sqlite index on df2

checkIdentical(argSpecs);

testAllJoinTypes(argSpecs,metric='median');
##    join    merge data.table.unkeyed data.table.keyed sqldf.unindexed sqldf.indexed      plyr    dplyr in.place         unit
## 1 inner  644.259           861.9345          923.516        9157.752      1580.390  959.2250 270.9190       NA microseconds
## 2  left  713.539           888.0205          910.045        8820.334      1529.714  968.4195 270.9185 224.3045 microseconds
## 3 right 1221.804           909.1900          923.944        8930.668      1533.135 1063.7860 269.8495 218.1035 microseconds
## 4  full 1302.203          3107.5380         3184.729              NA            NA 1593.6475 270.7055       NA microseconds

Here I benchmark on random input data, trying different scales and different patterns of key overlap between the two input tables. This benchmark is still restricted to the case of a single-column integer key. As well, to ensure that the in-place solution would work for both left and right joins of the same tables, all random test data uses 0..1:0..1 cardinality. This is implemented by sampling without replacement the key column of the first data.frame when generating the key column of the second data.frame.

這里我以隨機輸入數據為基准,嘗試不同的縮放比例和不同的鍵重疊模式。這個基准仍然局限於單列整數密鑰的情況。同樣,為了確保就地解決方案對相同表的左和右連接起作用,所有隨機測試數據都使用0..1:0..1基數。在生成第二個數據的關鍵列時,不需要替換第一個數據的關鍵列。

makeArgSpecs.singleIntegerKey.optionalOneToOne <- function(size,overlap) {

    com <- as.integer(size*overlap);

    argSpecs <- list(
        default=list(copySpec=1:2,args=list(
            df1 <- data.frame(id=sample(size),y1=rnorm(size),y2=rnorm(size)),
            df2 <- data.frame(id=sample(c(if (com>0L) sample(df1$id,com) else integer(),seq(size+1L,len=size-com))),y3=rnorm(size),y4=rnorm(size)),
            'id'
        )),
        data.table.unkeyed=list(copySpec=1:2,args=list(
            as.data.table(df1),
            as.data.table(df2),
            'id'
        )),
        data.table.keyed=list(copySpec=1:2,args=list(
            setkey(as.data.table(df1),id),
            setkey(as.data.table(df2),id)
        ))
    );
    ## prepare sqldf
    initSqldf();
    sqldf('create index df1_key on df1(id);'); ## upload and create an sqlite index on df1
    sqldf('create index df2_key on df2(id);'); ## upload and create an sqlite index on df2

    argSpecs;

}; ## end makeArgSpecs.singleIntegerKey.optionalOneToOne()

## cross of various input sizes and key overlaps
sizes <- c(1e1L,1e3L,1e6L);
overlaps <- c(0.99,0.5,0.01);
system.time({ res <- testGrid(makeArgSpecs.singleIntegerKey.optionalOneToOne,sizes,overlaps); });
##     user   system  elapsed
## 22024.65 12308.63 34493.19

I wrote some code to create log-log plots of the above results. I generated a separate plot for each overlap percentage. It's a little bit cluttered, but I like having all the solution types and join types represented in the same plot.

我編寫了一些代碼來創建上述結果的日志記錄圖。我為每個重疊百分比生成了一個單獨的圖。它有點混亂,但是我喜歡所有的解決方案類型和連接類型都在同一個圖中。

I used spline interpolation to show a smooth curve for each solution/join type combination, drawn with individual pch symbols. The join type is captured by the pch symbol, using a dot for inner, left and right angle brackets for left and right, and a diamond for full. The solution type is captured by the color as shown in the legend.

我使用樣條插值來顯示每個解決方案/連接類型組合的平滑曲線,用單個pch符號繪制。連接類型被pch符號捕獲,使用一個點用於左和右的內、左和右尖括號,以及一個完整的菱形。解決方案類型被顏色所捕獲,如圖所示。

plotRes <- function(res,titleFunc,useFloor=F) {
    solTypes <- setdiff(names(res),c('size','overlap','joinType','unit')); ## derive from res
    normMult <- c(microseconds=1e-3,milliseconds=1); ## normalize to milliseconds
    joinTypes <- getJoinTypes();
    cols <- c(merge='purple',data.table.unkeyed='blue',data.table.keyed='#00DDDD',sqldf.unindexed='brown',sqldf.indexed='orange',plyr='red',dplyr='#00BB00',in.place='magenta');
    pchs <- list(inner=20L,left='<',right='>',full=23L);
    cexs <- c(inner=0.7,left=1,right=1,full=0.7);
    NP <- 60L;
    ord <- order(decreasing=T,colMeans(res[res$size==max(res$size),solTypes],na.rm=T));
    ymajors <- data.frame(y=c(1,1e3),label=c('1ms','1s'),stringsAsFactors=F);
    for (overlap in unique(res$overlap)) {
        x1 <- res[res$overlap==overlap,];
        x1[solTypes] <- x1[solTypes]*normMult[x1$unit]; x1$unit <- NULL;
        xlim <- c(1e1,max(x1$size));
        xticks <- 10^seq(log10(xlim[1L]),log10(xlim[2L]));
        ylim <- c(1e-1,10^((if (useFloor) floor else ceiling)(log10(max(x1[solTypes],na.rm=T))))); ## use floor() to zoom in a little more, only sqldf.unindexed will break above, but xpd=NA will keep it visible
        yticks <- 10^seq(log10(ylim[1L]),log10(ylim[2L]));
        yticks.minor <- rep(yticks[-length(yticks)],each=9L)*1:9;
        plot(NA,xlim=xlim,ylim=ylim,xaxs='i',yaxs='i',axes=F,xlab='size (rows)',ylab='time (ms)',log='xy');
        abline(v=xticks,col='lightgrey');
        abline(h=yticks.minor,col='lightgrey',lty=3L);
        abline(h=yticks,col='lightgrey');
        axis(1L,xticks,parse(text=sprintf('10^%d',as.integer(log10(xticks)))));
        axis(2L,yticks,parse(text=sprintf('10^%d',as.integer(log10(yticks)))),las=1L);
        axis(4L,ymajors$y,ymajors$label,las=1L,tick=F,cex.axis=0.7,hadj=0.5);
        for (joinType in rev(joinTypes)) { ## reverse to draw full first, since it's larger and would be more obtrusive if drawn last
            x2 <- x1[x1$joinType==joinType,];
            for (solType in solTypes) {
                if (any(!is.na(x2[[solType]]))) {
                    xy <- spline(x2$size,x2[[solType]],xout=10^(seq(log10(x2$size[1L]),log10(x2$size[nrow(x2)]),len=NP)));
                    points(xy$x,xy$y,pch=pchs[[joinType]],col=cols[solType],cex=cexs[joinType],xpd=NA);
                }; ## end if
            }; ## end for
        }; ## end for
        ## custom legend
        ## due to logarithmic skew, must do all distance calcs in inches, and convert to user coords afterward
        ## the bottom-left corner of the legend will be defined in normalized figure coords, although we can convert to inches immediately
        leg.cex <- 0.7;
        leg.x.in <- grconvertX(0.275,'nfc','in');
        leg.y.in <- grconvertY(0.6,'nfc','in');
        leg.x.user <- grconvertX(leg.x.in,'in');
        leg.y.user <- grconvertY(leg.y.in,'in');
        leg.outpad.w.in <- 0.1;
        leg.outpad.h.in <- 0.1;
        leg.midpad.w.in <- 0.1;
        leg.midpad.h.in <- 0.1;
        leg.sol.w.in <- max(strwidth(solTypes,'in',leg.cex));
        leg.sol.h.in <- max(strheight(solTypes,'in',leg.cex))*1.5; ## multiplication factor for greater line height
        leg.join.w.in <- max(strheight(joinTypes,'in',leg.cex))*1.5; ## ditto
        leg.join.h.in <- max(strwidth(joinTypes,'in',leg.cex));
        leg.main.w.in <- leg.join.w.in*length(joinTypes);
        leg.main.h.in <- leg.sol.h.in*length(solTypes);
        leg.x2.user <- grconvertX(leg.x.in+leg.outpad.w.in*2+leg.main.w.in+leg.midpad.w.in+leg.sol.w.in,'in');
        leg.y2.user <- grconvertY(leg.y.in+leg.outpad.h.in*2+leg.main.h.in+leg.midpad.h.in+leg.join.h.in,'in');
        leg.cols.x.user <- grconvertX(leg.x.in+leg.outpad.w.in+leg.join.w.in*(0.5+seq(0L,length(joinTypes)-1L)),'in');
        leg.lines.y.user <- grconvertY(leg.y.in+leg.outpad.h.in+leg.main.h.in-leg.sol.h.in*(0.5+seq(0L,length(solTypes)-1L)),'in');
        leg.sol.x.user <- grconvertX(leg.x.in+leg.outpad.w.in+leg.main.w.in+leg.midpad.w.in,'in');
        leg.join.y.user <- grconvertY(leg.y.in+leg.outpad.h.in+leg.main.h.in+leg.midpad.h.in,'in');
        rect(leg.x.user,leg.y.user,leg.x2.user,leg.y2.user,col='white');
        text(leg.sol.x.user,leg.lines.y.user,solTypes[ord],cex=leg.cex,pos=4L,offset=0);
        text(leg.cols.x.user,leg.join.y.user,joinTypes,cex=leg.cex,pos=4L,offset=0,srt=90); ## srt rotation applies *after* pos/offset positioning
        for (i in seq_along(joinTypes)) {
            joinType <- joinTypes[i];
            points(rep(leg.cols.x.user[i],length(solTypes)),ifelse(colSums(!is.na(x1[x1$joinType==joinType,solTypes[ord]]))==0L,NA,leg.lines.y.user),pch=pchs[[joinType]],col=cols[solTypes[ord]]);
        }; ## end for
        title(titleFunc(overlap));
        readline(sprintf('overlap %.02f',overlap));
    }; ## end for
}; ## end plotRes()

titleFunc <- function(overlap) sprintf('R merge solutions: single-column integer key, 0..1:0..1 cardinality, %d%% overlap',as.integer(overlap*100));
plotRes(res,titleFunc,T);

R-merge-benchmark-single-column-integer-key-optional-one-to-one-99

R-merge-benchmark-single-column-integer-key-optional-one-to-one-50

R-merge-benchmark-single-column-integer-key-optional-one-to-one-1


Here's a second large-scale benchmark that's more heavy-duty, with respect to the number and types of key columns, as well as cardinality. For this benchmark I use three key columns: one character, one integer, and one logical, with no restrictions on cardinality (that is, 0..*:0..*). (In general it's not advisable to define key columns with double or complex values due to floating-point comparison complications, and basically no one ever uses the raw type, much less for key columns, so I haven't included those types in the key columns. Also, for information's sake, I initially tried to use four key columns by including a POSIXct key column, but the POSIXct type didn't play well with the sqldf.indexed solution for some reason, possibly due to floating-point comparison anomalies, so I removed it.)

這是第二個大型的基准測試,它對鍵列的數量和類型以及基數都更加嚴格。對於這個基准,我使用三個鍵列:一個字符、一個整數和一個邏輯,對基數沒有限制(即0..*:0..*)。(一般來說,由於浮點比較的復雜性,定義具有雙重或復雜值的鍵列是不可取的,而且基本上沒有人使用原始類型,更不用說對鍵列,所以我沒有將這些類型包含在鍵列中。另外,為了獲取信息,我最初嘗試使用4個鍵列,其中包括POSIXct鍵列,但是POSIXct類型與sqldf不太匹配。由於某些原因,可能是由於浮點比較異常,所以我刪除了它。

makeArgSpecs.assortedKey.optionalManyToMany <- function(size,overlap,uniquePct=75) {

    ## number of unique keys in df1
    u1Size <- as.integer(size*uniquePct/100);

    ## (roughly) divide u1Size into bases, so we can use expand.grid() to produce the required number of unique key values with repetitions within individual key columns
    ## use ceiling() to ensure we cover u1Size; will truncate afterward
    u1SizePerKeyColumn <- as.integer(ceiling(u1Size^(1/3)));

    ## generate the unique key values for df1
    keys1 <- expand.grid(stringsAsFactors=F,
        idCharacter=replicate(u1SizePerKeyColumn,paste(collapse='',sample(letters,sample(4:12,1L),T))),
        idInteger=sample(u1SizePerKeyColumn),
        idLogical=sample(c(F,T),u1SizePerKeyColumn,T)
        ##idPOSIXct=as.POSIXct('2016-01-01 00:00:00','UTC')+sample(u1SizePerKeyColumn)
    )[seq_len(u1Size),];

    ## rbind some repetitions of the unique keys; this will prepare one side of the many-to-many relationship
    ## also scramble the order afterward
    keys1 <- rbind(keys1,keys1[sample(nrow(keys1),size-u1Size,T),])[sample(size),];

    ## common and unilateral key counts
    com <- as.integer(size*overlap);
    uni <- size-com;

    ## generate some unilateral keys for df2 by synthesizing outside of the idInteger range of df1
    keys2 <- data.frame(stringsAsFactors=F,
        idCharacter=replicate(uni,paste(collapse='',sample(letters,sample(4:12,1L),T))),
        idInteger=u1SizePerKeyColumn+sample(uni),
        idLogical=sample(c(F,T),uni,T)
        ##idPOSIXct=as.POSIXct('2016-01-01 00:00:00','UTC')+u1SizePerKeyColumn+sample(uni)
    );

    ## rbind random keys from df1; this will complete the many-to-many relationship
    ## also scramble the order afterward
    keys2 <- rbind(keys2,keys1[sample(nrow(keys1),com,T),])[sample(size),];

    ##keyNames <- c('idCharacter','idInteger','idLogical','idPOSIXct');
    keyNames <- c('idCharacter','idInteger','idLogical');
    ## note: was going to use raw and complex type for two of the non-key columns, but data.table doesn't seem to fully support them
    argSpecs <- list(
        default=list(copySpec=1:2,args=list(
            df1 <- cbind(stringsAsFactors=F,keys1,y1=sample(c(F,T),size,T),y2=sample(size),y3=rnorm(size),y4=replicate(size,paste(collapse='',sample(letters,sample(4:12,1L),T)))),
            df2 <- cbind(stringsAsFactors=F,keys2,y5=sample(c(F,T),size,T),y6=sample(size),y7=rnorm(size),y8=replicate(size,paste(collapse='',sample(letters,sample(4:12,1L),T)))),
            keyNames
        )),
        data.table.unkeyed=list(copySpec=1:2,args=list(
            as.data.table(df1),
            as.data.table(df2),
            keyNames
        )),
        data.table.keyed=list(copySpec=1:2,args=list(
            setkeyv(as.data.table(df1),keyNames),
            setkeyv(as.data.table(df2),keyNames)
        ))
    );
    ## prepare sqldf
    initSqldf();
    sqldf(paste0('create index df1_key on df1(',paste(collapse=',',keyNames),');')); ## upload and create an sqlite index on df1
    sqldf(paste0('create index df2_key on df2(',paste(collapse=',',keyNames),');')); ## upload and create an sqlite index on df2

    argSpecs;

}; ## end makeArgSpecs.assortedKey.optionalManyToMany()

sizes <- c(1e1L,1e3L,1e5L); ## 1e5L instead of 1e6L to respect more heavy-duty inputs
overlaps <- c(0.99,0.5,0.01);
solTypes <- setdiff(getSolTypes(),'in.place');
system.time({ res <- testGrid(makeArgSpecs.assortedKey.optionalManyToMany,sizes,overlaps,solTypes); });
##     user   system  elapsed
## 38895.50   784.19 39745.53

The resulting plots, using the same plotting code given above:

所產生的情節,使用相同的標繪代碼:

titleFunc <- function(overlap) sprintf('R merge solutions: character/integer/logical key, 0..*:0..* cardinality, %d%% overlap',as.integer(overlap*100));
plotRes(res,titleFunc,F);

R-merge-benchmark-assorted-key-optional-many-to-many-99

R-merge-benchmark-assorted-key-optional-many-to-many-50

R-merge-benchmark-assorted-key-optional-many-to-many-1

#11


5  

For an inner join on all columns, you could also use fintersect from the data.table-package or intersect from the dplyr-package as an alternative to merge without specifying the by-columns. this will give the rows that are equal between two dataframes:

對於所有列的內部連接,您也可以使用fintersect從數據。從dplyr-package中作為一個替代合並,而不指定副列。這將給出在兩個dataframes之間相等的行:

merge(df1, df2)
#   V1 V2
# 1  B  2
# 2  C  3
dplyr::intersect(df1, df2)
#   V1 V2
# 1  B  2
# 2  C  3
data.table::fintersect(setDT(df1), setDT(df2))
#    V1 V2
# 1:  B  2
# 2:  C  3

Example data:

示例數據:

df1 <- data.frame(V1 = LETTERS[1:4], V2 = 1:4)
df2 <- data.frame(V1 = LETTERS[2:3], V2 = 2:3)

#12


5  

  1. Using merge function we can select the variable of left table or right table, same way like we all familiar with select statement in SQL (EX : Select a.* ...or Select b.* from .....)
  2. 使用合並函數,我們可以選擇左表或右表的變量,就像我們熟悉SQL中的select語句一樣(例:select a)。*……或選擇b。*從.....)
  3. We have to add extra code which will subset from the newly joined table .

    我們必須添加額外的代碼,這些代碼將來自新加入的表。

    • SQL :- select a.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId

      SQL:-選擇一個。*來自df1的內部連接df2 b on . customerid =b.CustomerId。

    • R :- merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df1)]

      R:-合並(df1, df2, by)。x =“CustomerId”。y =“CustomerId”)、名稱(df1)]

Same way

同樣的方式

  • SQL :- select b.* from df1 a inner join df2 b on a.CustomerId=b.CustomerId

    SQL:選擇b。*來自df1的內部連接df2 b on . customerid =b.CustomerId。

  • R :- merge(df1, df2, by.x = "CustomerId", by.y = "CustomerId")[,names(df2)]

    R:-合並(df1, df2, by)。x =“CustomerId”。y =“CustomerId”)、名稱(df2)]


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2009/08/19/721160ddfd2e6dca14e1b1930265112e.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com