[翻译]  How to show corpus text in R tm package?

[CHINESE]  如何在R tm包中显示语料库文本?


I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package?

我是R和tm包中的新手,所以请原谅我的愚蠢问题;-)如何在R tm包中显示纯文本语料库的文本?

I've loaded a corpus with 323 plain text files in a corpus:

我在语料库中加载了一个包含323个纯文本文件的语料库:

 src <- DirSource("Korpora/technologie")
corpus <- Corpus(src)

But when I call the corpus with:

但是当我用语料库调用语料库时:

corpus[[1]]

I always get some output like this instead of the corpus text itself:

我总是得到这样的输出而不是语料库本身:

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 144
Content:  chars: 141
Content:  chars: 224
Content:  chars: 75
Content:  chars: 105

How can I show the text of the corpus?

如何显示语料库的文本?

Thanks!

谢谢!

UPDATE Reproducible sample: I've tried it with the built-in sample text:

更新可重复的样本:我已经使用内置的示例文本尝试了它:

> data("crude")
> crude
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
> crude[1]
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
> crude[[1]]
<<PlainTextDocument>>
Metadata:  15
Content:  chars: 527

How can I print the text of the documents?

如何打印文档文本?

UPDATE 2: Session Info:

更新2:会话信息:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-1  NLP_0.1-7

loaded via a namespace (and not attached):
[1] parallel_3.1.3 slam_0.1-32    tools_3.1.3   

7 个解决方案

#1


10  

You can try converting your corpus text into a dataframe, and accessing the required text from the dataframe itself. I have used the built-in sample data "crude" (from the tm package) as an example.

您可以尝试将语料库文本转换为数据框,并从数据框本身访问所需的文本。我使用内置的示例数据“crude”(来自tm包)作为示例。

data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)

dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

#2


35  

This works in mine, to print the content text, with latest version of tm,

这在我的作品中,用最新版本的tm打印内容文本,

corpus[[1]]$content

Note: More or less as suggested by Ricky in the previous comment. Sorry, I wanted to write comment, only my rep is only 25 (need min. of 50 rep to comment).

注意:Ricky在之前的评论中建议或多或少。对不起,我想写评论,只有我的代表只有25(需要最少50个回复评论)。

#3


7  

Here is a simple and direct way to display the text of a corpus:

这是一种显示语料库文本的简单直接方法:

strwrap(corpus[[1]])

For the crude data this will output

对于原始数据,这将输出

[1] "Diamond Shamrock Corp said that effective today it had cut its contract"      
[2] "prices for crude oil by 1.50 dlrs a barrel.  The reduction brings its posted" 
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."   
[4] "\"The price reduction today was made in the light of falling oil product"     
[5] "prices and a weak crude oil market,\" a company spokeswoman said.  Diamond is"
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"    
[7] "posted, prices over the last two days citing weak oil markets.  Reuter"

#4


3  

I can confirm that as of tm 0.6-1 the inspect does not print pretty. You can pair it with the qdap package that I maintain to convert easily to a data.frame as folows:

我可以确认,从0.6-1开始,检查不会打印漂亮。您可以将它与我维护的qdap包配对,以便轻松转换为data.frame,如下所示:

library(qdap)
as.data.frame(crude)

To make it more ike the old inspect behavior you can use:

为了更好地使用旧的检查行为,您可以使用:

as.data.frame(crude) %>%
    with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))

This looks like:

这看起来像:

Diamond Shamrock Corp said that effective today it had cut its
contract prices for crude oil by 1.50 dlrs a barrel. The reduction
brings its posted price for West Texas Intermediate to 16.00 dlrs a
barrel, the copany said. "The price reduction today was made in the
light of falling oil product prices and a weak crude oil market," a
company spokeswoman said. Diamond is the latest in a line of U.S. oil
companies that have cut its contract, or posted, prices over the last
two days citing weak oil markets. Reuter


OPEC may be forced to meet before a scheduled June session to
readdress its production cutting agreement if the organization wants
to halt the current slide in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy as OPEC
thought. They may need an emergency meeting to sort out the
problems," said Daniel Yergin, director of Cambridge Energy Research
Associates, CERA. Analysts and oil industry sources said the problem
OPEC faces is excess oil supply in world oil markets. "OPEC's problem
is not a price problem but a production issue and must be addressed
in that way," said Paul Mlotok, oil analyst with Salomon Brothers
Inc. He said the market's earlier optimism about OPE
.
.
.

#5


1  

From the tm Vignette, this works:

从tm Vignette,这可以工作:

writeLines(as.character(doc.corpus[[8]]))

writeLines(as.character(doc.corpus [[8]]))

Where '8' is whatever element number you wish

'8'是你想要的任何元素数

#6


1  

We can get the content of every item in the corpus.

我们可以获得语料库中每个项目的内容。

data("crude")
out <- sapply(crude, function(x){x$content})
out 

# optionally export
writeCorpus(out, "outputdir/", filenames = "corpus.txt")

#7


0  

> inspect(crude[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

$`reut-00001.xml`
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter

注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
© 2014-2018 ITdaan.com 粤ICP备14056181号