我如何以編程方式檢查HTML文檔

[英]How do I programatically inspect a HTML document


I have a database full of small HTML documents and I need to programatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

我有一個包含小型HTML文檔的數據庫,我需要以編程方式將幾個插入到帶有iText的PDF文檔或帶有Aspose.Words的Word文檔中。我需要保留HTML文檔中的任何格式(在合理范圍內,尊重標簽是必須的,像這樣的CSS是一個不錯的選擇)。

Both iText and Aspose work (roughly) along the lines:

iText和Aspose都可以(粗略地)工作:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

因此(我認為)我需要某種HTML解析器,我可以檢查字符串和樣式以插入到我的文檔中。

Can anybody suggest a good library or sensible approach to this problem? Platform is Java

任何人都可以建議一個好的圖書館或明智的方法解決這個問題嗎?平台是Java

5 个解决方案

#1


2  

HTMLparser is a good HTML parser.

HTMLparser是一個很好的HTML解析器。

I have used this to parse HTML on one of my projects.

我用它來解析我的一個項目上的HTML。

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

您可以編寫自己的過濾器來解析HTML所需的內容,因此不應該難以解析
標記

Yo can parse out CSS usin the CssSelectorNodeFilter

Yo可以在CssSelectorNodeFilter中解析CSS

#2


1  

If the HTML is "well-formed XML" (XHTML) why not use an XML parser (such as Xerces) and then inspect programatically the DOM tree.

如果HTML是“格式良好的XML”(XHTML),為什么不使用XML解析器(如Xerces),然后以編程方式檢查DOM樹。

#3


0  

Adobe Acrobat Pro allows you to grab sites via HTTP and does an excellent job of preserving the style and layout. I haven't used it from an API aspect, but it may be worth looking into.

Adobe Acrobat Pro允許您通過HTTP抓取網站,並且可以很好地保留樣式和布局。我沒有從API方面使用它,但它可能值得研究。

#4


0  

You'd probably be better off getting a component that goes directly from HTML to PDF, or Word, then to try to parse the HTML document and duplicate the formatting yourself based on the HTML. If you want to convert HTML to PDF, and you use .Net, Winnovative provides a good solution.

您可能最好將一個組件直接從HTML轉換為PDF或Word,然后嘗試解析HTML文檔並根據HTML自行復制格式。如果您想將HTML轉換為PDF,並使用.Net,Winnovative提供了一個很好的解決方案。

#5


0  

Check out the flying saucer xhtml renderer- they render well-formed XHTML files to PDF, and let you control the output using CSS.

查看飛碟xhtml渲染器 - 它們將格式良好的XHTML文件渲染為PDF,並讓您使用CSS控制輸出。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2008/10/20/712c3ce25a15694135dab9cb2b0e45ea.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com