Try Web-Harvest Manual
Author: admin
March 1, 2011
Web-Harvest is a Java open-source Web data extraction tools. It is able to collect the specified Web page and extracts from these pages have
with the data. Web-Harvest mainly used like XSLT, XQuery, regular expressions and other technologies to achieve these text / xml operation.
personal feeling, a good design concept of this tool, the use of written script to specify the html xml into xml, then use xml
parser to extract information from . Web information extraction in the preparation of this tool, we do not worry about page format changes will affect the extraction of information
results because some of the information collected through the configuration corresponding to the script to achieve, As long as we can modify the script, not with the change
code.
Here I follow the example of open source tools to bring their own rewrite of a script to extract yahoo search for information using Web-Harvest execute it
, can be extracted to “KMS “as a keyword search in yahoo search engine on the results obtained.
script xml:
KMS
http://search.yahoo.com/search?p = $ {search}
/ / big [.= ext / a / @ href
/ / ol / li
10
let $ name: = data ($ item / / div [1] / a [1])
let $ src: = data ($ item / / div [1] / a [1] / @ href)
let $ abs: = data ($ item / / div [2])
return
{normalize-space ($ name)}
{normalize-space ($ src)}
{normalize-space ($ abs)}
]]>
< br /> result xml:
KMSResearch
http:/ / rds.yahoo.com / _ylt = A0geuodL05lFpaEArQxXN yoA; _ylu = X3oDMTB2b2gzdDdtBGNvb
G8DZQRsA1dTMQRwb3MDMQRzZWMDc3IEdnRpZAM-/SIG = 11fph2 etm / EXP = 1167795403 / ** http% 3a / / www.kmshaircare.com/
Learn about each subbrand which has its own purpose and look to support your way of
life, mood, or whim.
< br />
.
.
.
< br /> Summer – KMS promotional items
http://rds.yahoo.com/_ylt=A0geupZ705lFwVkAMQZXN yoA; _ylu = X3oDMTExYm1vY2p0BGNvb
G8DZQRsA1dTMQRwb3MDMTAwBHNlYwNzcgR2dGlkAw–/SIG = 11 q4tb45p/EXP = 1167795451 / ** http% 3a / / kms-fra.com/en/pr oducts / sommer /
KMS Design. Special designs. Onpacks and Inpacks … KMS presents the smallest solar
charger available. … The KMS SoftFrisbee – this UFO is foldable! …
If you xml, xpath, xquery these techniques have understood after reading the help of Web-Harvest (http://web-harvest.sourceforge .net / manual.php), I believe the above script xml should not be difficult to understand.
in the entire trial process, I also found the Web-Harvest some of the problems, such as his use of the html page tagsoup cleaning
, will cause some of the format is not standardized Web data loss (such as google search page), I hope the developer Web-Harvest can
aware of this issue, after all, are now able to strictly abide by the norms of the page HTML4.0 not much more that has existed before the appearance xml
page. Now use the xml web information extraction technology is undoubtedly the best, and Web-Harvest has been set up for us a choice
extraction model, how to solve a large number of non-standard lossless xml page conversion, will be able to use this tool in relationship to the actual
key link.
Also, since I is limited, taken in the use of Web-Harvest Chinese website, have not found no garbled pages.
This article aims to initiate, to have more people concerned about the Web-Harvest this tool. Because Web-Harvest
There are many advanced applications, I have not studied; there are many areas for improvement. But at least it gave me a revelation, fully structured,
dynamic web information extraction can be achieved, but not difficult.
References:
Web-Harvest: http://web-harvest.sourceforge.net/
XPath tutorial: http:// www.zvon.org / xxl / XPathTutorial / Output_chi / i ntroduction.html
XQuery tutorial: http://www.w3pop.com/tech/school/xquery/default.as p
Tags: subbrand
Comments are closed.