在R中使用多核來分析GWAS數據

[英]Using multicore in R to analyse GWAS data


I am using R to analyze genome-wide association study data. I have about 500,000 potential predictor variables (single-nucleotide polymorphisms, or SNPs) and want to test the association between each of them and a continuous outcome (in this case low-density lipoprotein concentration in the blood).

我正在使用R來分析全基因組關聯研究數據。我有大約500,000個潛在的預測變量(單核苷酸多態性或SNP),並且想要測試它們之間的關聯和連續結果(在這種情況下血液中的低密度脂蛋白濃度)。

I have already written a script that does this without problem. To briefly explain, I have a data object, called "Data". Each row corresponds to a particular patient in the study. There are columns for age, gender, body mass index (BMI), and blood LDL concentration. There are also half a million other columns with the SNP data.

我已經編寫了一個沒有問題的腳本。簡要說明一下,我有一個名為“Data”的數據對象。每行對應於研究中的特定患者。有年齡,性別,體重指數(BMI)和血液LDL濃度的列。 SNP數據還有50萬個其他列。

I am currently using a for loop to run the linear model half a million times, as shown:

我目前正在使用for循環運行線性模型五十萬次,如圖所示:

# Repeat loop half a million times
for(i in 1:500000) {

# Select the appropriate SNP
SNP <- Data[i]

# For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod"
GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)

# For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results"
results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"]
results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"]
}

All of that works fine. However, I would really like to speed up my analysis. I've therefore been experimenting with the multicore, DoMC, and foreach packages.

所有這一切都很好。但是,我真的想加快我的分析。因此,我一直在嘗試使用多核,DoMC和foreach軟件包。

My question is, could someone please help me adapt this code using the foreach scheme?

我的問題是,有人可以幫助我使用foreach方案調整此代碼嗎?

I am running the script on a Linux server that apparently has 16 cores available. I've tried experimenting with the foreach package, and my results using it have been comparatively worse, meaning that it takes longer to run the analysis using foreach.

我在Linux服務器上運行該腳本,該服務器顯然有16個可用核心。我已經嘗試過使用foreach包,我使用它的結果相對更差,這意味着使用foreach運行分析需要更長的時間。

For example, I've tried saving the linear model objects as shown:

例如,我嘗試保存線性模型對象,如下所示:

library(doMC)
registerDoMC()
results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) }

This takes more than twice as long as using just a regular for loop. Any advice on how to do this better or more quickly would be appreciated! I understand that using the parallel version of lapply might be an option, but don't know how to do this either.

這比使用常規for循環花費的時間長兩倍多。任何有關如何更好或更快地做到這一點的建議將不勝感激!我知道使用lapply的並行版本可能是一個選項,但不知道如何做到這一點。

All the best,

祝一切順利,

Alex

1 个解决方案

#1


8  

To give you a startup: If you use Linux, you can do the multicore approach contained within the parallel package. Whereas you needed to set up the whole thing when using eg the foreach package, that's not necessary any more with this approach. Your code would be run on 16 cores by simply doing :

為您提供啟動:如果您使用Linux,則可以執行並行包中包含的多核方法。雖然你需要在使用例如foreach包時設置整個東西,但這種方法不再需要了。您只需執行以下操作即可在16個核心上運行代碼:

require(parallel)

mylm <- function(i){
  SNP <- Data[i]
  GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)
  #return the vector
  c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"],
    summary(GenoMod)$coefficients["Geno","Estimate"])
}

Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list
Result <- do.call(rbind,Out) # make list a matrix

Here you make a function that returns a vector with the wanted quantities, and apply the indices over this. I couldn't check this though as I don't have access to the data, but it should work.

在這里,您創建一個函數,返回具有所需數量的向量,並將索引應用於此。我無法檢查這一點,因為我無法訪問數據,但它應該工作。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2011/12/13/7251fe40bf75db4225b091e904898cd.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com