<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MarkupAsAnApi &#187; screenscraping</title>
	<atom:link href="http://www.markupasanapi.com/tag/screenscraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.markupasanapi.com</link>
	<description>Publish once, publish everywhere</description>
	<lastBuildDate>Mon, 07 Sep 2009 03:56:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Playing with YQL</title>
		<link>http://www.markupasanapi.com/2009/05/17/playing-with-yql/</link>
		<comments>http://www.markupasanapi.com/2009/05/17/playing-with-yql/#comments</comments>
		<pubDate>Sun, 17 May 2009 01:26:45 +0000</pubDate>
		<dc:creator>halans</dc:creator>
				<category><![CDATA[Javascript]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[screenscraping]]></category>
		<category><![CDATA[Yahoo!]]></category>
		<category><![CDATA[yql]]></category>

		<guid isPermaLink="false">http://www.markupasanapi.com/?p=65</guid>
		<description><![CDATA[Note: this is an edited repost from Halans.com. Have been playing with Yahoo!&#8217;s YQL this weekend, querying the Sydney Ferries website. Pretty amazing what it allows you to do, though the Sydney Ferries site wasn&#8217;t the best site to start playing with I guess. I did have a need to have the ferry timetable on [...]<p>Post from <a href="http://www.halans.com">Jean-Jacques Halans</a> <a href="http://www.markupasanapi.com">MarkupAsAnApi</a> blog.<br/><br/><a href="http://www.markupasanapi.com/2009/05/17/playing-with-yql/">Playing with YQL</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Note: this is an edited repost from <a href="http://halans.com/2009/05/10/querying-the-next-sydney-ferry/">Halans.com</a>.</p>
<p>Have been playing with Yahoo!&#8217;s YQL this weekend, querying the Sydney Ferries website. Pretty amazing what it allows you to do, though the Sydney Ferries site wasn&#8217;t the best site to start playing with I guess. I did have a need to have the ferry timetable on my iPhone (especially the Neutral Bay service), so that&#8217;s why I put together <a title="Next ferry from Circular Quay" href="http://nextsydneyferry.com/">Next Sydney Ferry</a> this weekend.</p>
<p>The premise is pretty simple: when does the next ferry depart from Circular Quay? I had this wild idea to do cool stuff with it, but inspired by the simplicity of <a title="Next manly ferry" href="http://nextmanlyferry.com/">Next Manly Ferry</a>, I thought I&#8217;d start out pretty simple too. And it certainly still is a work in progress with plenty of bugs.</p>
<p>NextSydneyFerry.com parses the timetables of the SydneyFerries.info site using YQL and JSONP. No luck with any API, so it&#8217;s pretty fragile reading in the HTML table data. Wish they made an effort marking up the data a bit more helpful (as in markup-as-an-api). One of the URLs even has a typo (&#8220;weekemd&#8221;). I additionally notice YQL not returning any results for the same query, while at other times it would, so they have some bugs too. It was an interesting YQL experiment, but since the data is not too dynamic, I will probably switch to using a more static datastore, which would also be a lot more responsive, and which would be lighter on your mobile dataplan. This is very much a first version application, with limited functionality (it shows you <em>only</em> the next and thereafter ferry, and <em>only</em> departing from Circular Quay), some noteworthy points:</p>
<ul>
<li>it uses the server&#8217;s time so tourists don&#8217;t have to have their mobiles set to local time.</li>
<li>it updates the time in the background without a page refresh.</li>
<li>It allows you to bookmark your daily service.</li>
</ul>
<p>Sydney Ferries allows you to repurpose their content, as long as you add their copyright notice and don&#8217;t charge for it. But they could have made it a lot easier for people to do so.</p>
<p>Post from <a href="http://www.halans.com">Jean-Jacques Halans</a> <a href="http://www.markupasanapi.com">MarkupAsAnApi</a> blog.<br/><br/><a href="http://www.markupasanapi.com/2009/05/17/playing-with-yql/">Playing with YQL</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.markupasanapi.com/2009/05/17/playing-with-yql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Markup as an API</title>
		<link>http://www.markupasanapi.com/2007/10/08/markup-as-an-api/</link>
		<comments>http://www.markupasanapi.com/2007/10/08/markup-as-an-api/#comments</comments>
		<pubDate>Mon, 08 Oct 2007 10:26:16 +0000</pubDate>
		<dc:creator>halans</dc:creator>
				<category><![CDATA[Javascript]]></category>
		<category><![CDATA[Microformats]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[screenscraping]]></category>
		<category><![CDATA[semweb]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://www.markupasanapi.com/?p=4</guid>
		<description><![CDATA[HTML describes documents, and the link between documents. We read these documents, we print them, bookmark them for later retrieval. We might copy/paste content into another document, constructing a new one. If we wanted to automate this, we&#8217;d resort tot screen scraping. But this easily breaks, as there&#8217;s no standard or &#8220;contract&#8221; between the original [...]<p>Post from <a href="http://www.halans.com">Jean-Jacques Halans</a> <a href="http://www.markupasanapi.com">MarkupAsAnApi</a> blog.<br/><br/><a href="http://www.markupasanapi.com/2007/10/08/markup-as-an-api/">Markup as an API</a></p>
]]></description>
			<content:encoded><![CDATA[<p>HTML describes documents, and the link between documents.</p>
<p>
We read these documents, we print them, bookmark them for later retrieval. We might copy/paste content into another document, constructing a new one.
</p>
<p>
If we wanted to automate this, we&#8217;d resort tot screen scraping. But this easily breaks, as there&#8217;s no standard or &#8220;contract&#8221; between the original site and the screen scraper.
</p>
<p>
Or we&#8217;d go for duplicating content into a new format like XML, with an agreed upon format. That way we could build product price aggregators using SOA Web Services with SOAP, WSDL,&#8230; Or use a REST architecture which takes us a bit closer back to our original HTTP request.
</p>
<p>
Most popular formats for sharing (XML) data is RSS and Atom, were again we duplicate the content we publish online.
</p>
<p>
But we&#8217;ve come a long way last couple of years, towards Web Standards, pushed by organisations like the Web Standards Project (WaSP) and Web Standards Group (WSG). They promote standards for separation of content, styling and behaviour, and the use of semantic HTML.
</p>
<p>
And then there&#8217;s the W3C, who promotes the Semantic Web as knowledge representation, using Resource Description Framework. RDF is a general method of modelling information making statements about resources in triples. Triples represent a subject-predicate-object expression, for example JJ &#8211; isBornIn &#8211; Belgium.
</p>
<p>The W3C&#8217;s Web Ontology Language, or OWL, provides additional vocabulary and formal semantics, providing greater machine interpretability of Web content, but with added complexity.</p>
<p>But as of yet there isn&#8217;t much RDF data online, or ontologies are missing for many application domains. The W3C&#8217;s projects are rather academic, and aren&#8217;t close to any web developer&#8217;s mindset.</p>
<p>What is closer to the web developer though is semantic HTML, the correct use of heading levels and paragraphs to introduce structure, blockquotes and correct use of tables, for tabular data.</p>
<p>Now we add rich semantics, standardised Microformats. They are small pieces of metadata, within the markup, using CSS. They are discoverable, interpreted by machines.</p></p>
<p>Post from <a href="http://www.halans.com">Jean-Jacques Halans</a> <a href="http://www.markupasanapi.com">MarkupAsAnApi</a> blog.<br/><br/><a href="http://www.markupasanapi.com/2007/10/08/markup-as-an-api/">Markup as an API</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.markupasanapi.com/2007/10/08/markup-as-an-api/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Scraping HTML with innerHTML or jQuery</title>
		<link>http://www.markupasanapi.com/2007/09/11/scraping-html-with-innerhtml-or-jquery/</link>
		<comments>http://www.markupasanapi.com/2007/09/11/scraping-html-with-innerhtml-or-jquery/#comments</comments>
		<pubDate>Tue, 11 Sep 2007 08:54:30 +0000</pubDate>
		<dc:creator>halans</dc:creator>
				<category><![CDATA[Javascript]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[DOM]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[innerHTML]]></category>
		<category><![CDATA[jquery]]></category>
		<category><![CDATA[screen scraping]]></category>
		<category><![CDATA[screenscraping]]></category>

		<guid isPermaLink="false">http://www.markupasanapi.com/?p=5</guid>
		<description><![CDATA[A couple of nice write-ups on how to scrape HTML using innerHTML at Pathfinder Development: A common solution has been to proxy and scrape an application with a combination of XQuery and TagSoup (to fix the ugly, broken HTML, dontcha know), but it is possible to do this purely in the browser. or with jQuery, [...]<p>Post from <a href="http://www.halans.com">Jean-Jacques Halans</a> <a href="http://www.markupasanapi.com">MarkupAsAnApi</a> blog.<br/><br/><a href="http://www.markupasanapi.com/2007/09/11/scraping-html-with-innerhtml-or-jquery/">Scraping HTML with innerHTML or jQuery</a></p>
]]></description>
			<content:encoded><![CDATA[<p>A couple of nice write-ups on how to scrape HTML using innerHTML at <a title="Pathfinder" href="http://www.pathf.com/blogs/2007/09/parsing-html-wi/">Pathfinder Development</a>:</p>
<blockquote><p>A common solution has been to proxy and scrape an application with a combination of XQuery and TagSoup (to fix the ugly, broken HTML, dontcha know), but it is possible to do this purely in the browser.</p></blockquote>
<p>or with jQuery, as <a title="Scraping with jQuery" href="http://jan.varwig.org/archiv/scraping-pages-with-jquery">Jan Varwig</a> describes:</p>
<blockquote><p>Fortunately, just the day before, I discovered <a href="http://www.jquery.com/">jQuery</a>, a Javascript framework with strong support for <a href="http://docs.jquery.com/DOM/Traversing/Selectors">finding DOM-Nodes via CSS, XPath and some custom selectors</a>. The tricky part now was to get jQuery to access the DOM-Tree of the schedule page on kino.de.</p></blockquote>
<p>Of course, screen scraping would be so much easier using Web Standards.</p>
<p>Post from <a href="http://www.halans.com">Jean-Jacques Halans</a> <a href="http://www.markupasanapi.com">MarkupAsAnApi</a> blog.<br/><br/><a href="http://www.markupasanapi.com/2007/09/11/scraping-html-with-innerhtml-or-jquery/">Scraping HTML with innerHTML or jQuery</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.markupasanapi.com/2007/09/11/scraping-html-with-innerhtml-or-jquery/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
