Top Handbags, Shoes

Online shopping for Hobos Handbags Apparel; Accessories & more at everyday low prices.

  • Home
  • About
  • links

Try Web-Harvest Manual

Author: admin

March 1, 2011
Web-Harvest is a Java open-source Web data extraction tools. It is able to collect the specified Web page and extracts from these pages have
with the data. Web-Harvest mainly used like XSLT, XQuery, regular expressions and other technologies to achieve these text / xml operation.
personal feeling, a good design concept of this tool, the use of written script to specify the html xml into xml, then use xml
parser to extract information from . Web information extraction in the preparation of this tool, we do not worry about page format changes will affect the extraction of information
results because some of the information collected through the configuration corresponding to the script to achieve, As long as we can modify the script, not with the change
code.
Here I follow the example of open source tools to bring their own rewrite of a script to extract yahoo search for information using Web-Harvest execute it
, can be extracted to “KMS “as a keyword search in yahoo search engine on the results obtained.
script xml:

KMS

http://search.yahoo.com/search?p = $ {search}

/ / big [.= ext / a / @ href
/ / ol / li
10

let $ name: = data ($ item / / div [1] / a [1])
let $ src: = data ($ item / / div [1] / a [1] / @ href)
let $ abs: = data ($ item / / div [2])
return

{normalize-space ($ name)}
{normalize-space ($ src)}
{normalize-space ($ abs)}

]]>

< br /> result xml:

KMSResearch
http:/ / rds.yahoo.com / _ylt = A0geuodL05lFpaEArQxXN yoA; _ylu = X3oDMTB2b2gzdDdtBGNvb
G8DZQRsA1dTMQRwb3MDMQRzZWMDc3IEdnRpZAM-/SIG = 11fph2 etm / EXP = 1167795403 / ** http% 3a / / www.kmshaircare.com/
Learn about each subbrand which has its own purpose and look to support your way of
life, mood, or whim.
< br />
.
.
.

< br /> Summer – KMS promotional items
http://rds.yahoo.com/_ylt=A0geupZ705lFwVkAMQZXN yoA; _ylu = X3oDMTExYm1vY2p0BGNvb
G8DZQRsA1dTMQRwb3MDMTAwBHNlYwNzcgR2dGlkAw–/SIG = 11 q4tb45p/EXP = 1167795451 / ** http% 3a / / kms-fra.com/en/pr oducts / sommer /
KMS Design. Special designs. Onpacks and Inpacks … KMS presents the smallest solar
charger available. … The KMS SoftFrisbee – this UFO is foldable! …

If you xml, xpath, xquery these techniques have understood after reading the help of Web-Harvest (http://web-harvest.sourceforge .net / manual.php), I believe the above script xml should not be difficult to understand.
in the entire trial process, I also found the Web-Harvest some of the problems, such as his use of the html page tagsoup cleaning
, will cause some of the format is not standardized Web data loss (such as google search page), I hope the developer Web-Harvest can
aware of this issue, after all, are now able to strictly abide by the norms of the page HTML4.0 not much more that has existed before the appearance xml
page. Now use the xml web information extraction technology is undoubtedly the best, and Web-Harvest has been set up for us a choice
extraction model, how to solve a large number of non-standard lossless xml page conversion, will be able to use this tool in relationship to the actual
key link.
Also, since I is limited, taken in the use of Web-Harvest Chinese website, have not found no garbled pages.
This article aims to initiate, to have more people concerned about the Web-Harvest this tool. Because Web-Harvest
There are many advanced applications, I have not studied; there are many areas for improvement. But at least it gave me a revelation, fully structured,
dynamic web information extraction can be achieved, but not difficult.
References:
Web-Harvest: http://web-harvest.sourceforge.net/
XPath tutorial: http:// www.zvon.org / xxl / XPathTutorial / Output_chi / i ntroduction.html
XQuery tutorial: http://www.w3pop.com/tech/school/xquery/default.as p

Tags: subbrand

This entry was posted on Sunday, December 18th, 2011 at 8:13 pm and is filed under Fashion. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.

  • Categories

    • Fashion (633)
    • Shopping (2)
  • Categories

    • Fashion
    • Shopping
  • Blogroll

    • 2012 louis vuitton handbags
    • air max france
    • Cheap brand name products
    • cheap shoes online
    • christian louboutin sale
    • Christian Louboutin Shoes
    • GHD Straighteners Ireland
    • links
    • Louis Vuitton Handbags
    • Replica Watches
    • Tiffany
    • UGG Boots
    • wedding dresses
  • Recent Posts

    • “ToLOVE ru darkness” side story lost blowing the teacher really to force ah!
    • Nice animation “ToLOVE comic 17 comes OVA5 first screen open
    • [PF] ALTER ToLOVE third imperial female Renee Beria Day Luke
    • , 17 volumes, “ToLove” booklet comes with OVA Akihabara Hot
    • Tissot Watches Monster Beats ChanelReplica.com
    • Burberry watches cartierwatches Cartier Tank Divan W6300856
    • monster beats couple watches cartier tank watch women, pasha cart
    • Tissot Men Watches Couturier T035.617.16.051.00 $ 374.00
    • Luxury Watches by Top Fashion Designers
    • reasons why retail stores should buy handmade jew-jnqhunteryoulike
    • [Reserved] {handmade. } Winter rain. Winter Green. Needle inserted

designed by Christian Louboutin Shoes