在學習了生信大神孟浩巍的知乎Live “學習Python, 做生信”之后,對第二部分的文件信息處理部分整理了如下的筆記。
一、fasta與fastq格式的轉換
1、首先需要了解FASTA和FASTQ格式的詳解
1)具體的詳解看知乎專欄的這篇文章,寫的很詳細。
https://zhuanlan.zhihu.com/p/20714540
2)關於FASTA
A adenosine C cytidine G guanine T thymidine N A/G/C/T (any) U uridine K G/T (keto) S G/C (strong) Y T/C (pyrimidine) M A/C (amino) W A/T (weak) R G/A (purine) B G/T/C D G/A/T H A/C/T V G/C/A - gap of indeterminate length
3)關於FASTQ
2、讀取fastq文件並進行相應的格式轉化
1)主要流程如下:
2)以下是沒有設置輸出並儲存為.fa文件的代碼
1 #-*- coding: UTF-8 -*- 2 with open( 'E:\genome_file\\test.fastq','r') as input_fastq: 3 for index, line in enumerate (input_fastq): 4 if index % 4 == 0: 5 print ">" + line.strip()[1:] 6 elif index % 4 == 1: 7 for i in range (0,len(line.strip()),40): 8 print line.strip()[i:i+40] 9 elif index % 4 == 2: 10 pass 11 elif index % 4 == 3: 12 pass
3)以下是設置了輸出fasta文件的代碼:
1 out_file = open(E:\genome_file\\test.fa) 2 3 with open( 'E:\genome_file\\test.fastq','r') as input_fastq: 4 5 for index, line in enumerate (input_fastq): 6 7 if index % 4 == 0: 8 out_file.write(">" + line.strip()[1:]+"\n" ) 9 10 elif index % 4 == 1: 11 for i in range (0,len(line.strip()),40): 12 out_file.write(line.strip()[i:i + 40] +"\n") 13 14 elif index % 4 == 2: 15 pass 16 17 elif index % 4 == 3: 18 pass
4)有幾點注意的地方這里提一下:
二、讀取fasta格式文件
1、分析過程
2、簡化版的主要代碼如下:
input_file = open( 'E:\genome_file\\test.fa','r') seq = "" header = input_file.readline().strip()[1:] for line in input_file: line = line.strip() if line[0] != ">": seq = seq + line else: print header print seq header = line[1:] seq = "" #打印最后一行 print header print seq input_file.close()
3、值得注意的幾個地方:
4、下面的代碼是匯總了上面的功能,直接通過函數形式fastq-->fasta格式文件的轉換
1 def read_fasta(file_path=""): 2 """ 3 Loading FASTA file and return a iterative object 4 """ 5 6 line = "" 7 8 try: 9 fasta_handle = open(file_path,"r") 10 except: 11 raise IOError("Your input FASTA file is not right!") 12 13 # make sure the file is not empty 14 while True: 15 line = fasta_handle.readline() 16 if line == "": 17 return 18 if line[0] == ">": 19 break 20 21 # when the file is not empty, we try to load FASTA file 22 while True: 23 if line[0] != ">": 24 raise ValueError("Records in Fasta files should start with '>' character") 25 title = line[1:].rstrip() 26 lines = [] 27 line = fasta_handle.readline()#這里是讀第二行了 28 while True:#循環只用來加載序列行 29 if not line: 30 break 31 if line[0] == ">": 32 break 33 lines.append(line.rstrip()) 34 line = fasta_handle.readline() 35 36 yield title,"".join(lines).replace(" ","").replace("\r","") 37 38 if not line: 39 return 40 41 fasta_handle.close() 42 assert False, "Your input FASTA file have format problem." 43 44 for header,seq in read_fasta(file_path=r"D:\data_file\test.fa"): 45 print header 46 print seq 47 #下面是后邊的練習中以hg19為例進行操作 48 hg19_genome = {} 49 for chr_name , chr_seq in read_fasta(file_path=r"D:/data_file/hg19_only_chromosome.fa"): 50 hg19_genome[chr_name] = chr_seq
其實不太明白的是這里邊最終是返回兩個head和seq兩個值了嗎?為什么后邊直接來兩個參數就能開始for循環?
三、計算人類基因組長度信息
1、下載人類參考基因組信息(UCSC網站)
2、利用讀取fasta的代碼讀取基因組序列
1)代碼如下:
1 #前面已經定義過的read_fasta函數這里不再重復寫。 2 hg19_genome = {} 3 4 for chr_name , chr_seq in read_fasta(file_path=r"D:/data_file/hg19_only_chromosome.fa"): 5 hg19_genome[chr_name] = chr_seq
2)注意幾點
3、統計每條序列的長度,並輸出 ;
1)下面這個是一個亂序的輸出
1 for chr_name in sorted(hg19_genome.keys()): 2 print chr_name, len(hg19_genome.get(chr_name))
2)下面的代碼能夠做到按順序輸出
1 hg19_total_len = 0 2 for index in range (1,26): 3 if index <=22: 4 chr_name = "chr%s" %index 5 elif index == 23: 6 chr_name = "chrX" 7 elif index == 24: 8 chr_name = "chrY" 9 elif index == 25: 10 chr_name = "chrM" 11 print chr_name, len(hg19_genome.get(chr_name)) 12 hg19_total_len = hg19_total_len + len(hg19_genome.get(chr_name)) 13 14 print "Total chromosome length is %s" %hg19_total_len
注意一點:一般從字典里邊根據Key來獲取內容,用.get(),這樣子在對象不存在的時候就不會報錯。
3、統計人類參考基因組的總長度,並輸出
在上面已經輸出,往回看。
4、統計每條染色體N的長度,並輸出(GC含量同理)
以N為例,其他同理。使用.count指令。
1 hg19_genome['chr22'].count("N")
5、提取基因組特定區域的序列,並輸出(支持+/-鏈)
1)思路分析
2)上代碼
1 chr_name = "chr1" 2 chr_start = 10000 3 chr_end = 10100 4 #特定區域截取 5 hg19_genome[chr_name][chr_start-1:chr_end-1].upper() 6 #互補鏈 7 import string 8 translate_table = string.maketrans("ATCG","TAGC") 9 a = hg19_genome[chr_name][chr_start-1:chr_end-1].upper() 10 a.translate(translate_table) 11 12 #反向鏈 13 a.translate(translate_table)[::-1]#反向
3)定義成函數方便使用
1 def con_rev(input_seq): 2 translate_table = string.maketrans("ATCG","TAGC") 3 output_seq = input_seq.upper().translate(translate_table)[::-1] 4 return output_seq 5 6 con_rev("aaTTCCGG")
四、計算人類基因組外顯子的長度(CDS區域同理)
1、下載人類參考轉錄組(UCSC table browser)
2、整體思路
3、上代碼
1 gene_exon_len_dict = {} 2 with open(r"E:\genome_file\hg19_refseq_gene_table","r") as input_file: 3 header = input_file.readline() 4 for line in input_file: 5 line = line.strip().split("\t") 6 7 exon_count = int(line[8]) 8 exon_start = line[9].split(",")#還有更好的方法進行index 9 exon_end = line[10].split(",") 10 exon_total_len = 0 11 gene_id = line[12] 12 for index in range (exon_count): 13 exon_total_len = exon_total_len + int(exon_end[index]) - int(exon_start[index]) 14 15 if gene_exon_len_dict.get(gene_id) is None:#注意:如果直接用dict[gene_id]是會出現key error的 16 gene_exon_len_dict[gene_id] = exon_total_len 17 18 elif gene_exon_len_dict[gene_id] < exon_total_len: 19 gene_exon_len_dict[gene_id] = exon_total_len 20 21 else: 22 pass 23 24 #注意啊調試的時候要加break,這是很重要的!! 25 26 print gene_exon_len_dict 27
1 exon_start = map(int,line[9].strip(",").split(",")) 2 exon_end = map(int,line[10].strip(",").split(","))
本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。