Print Page - Collecting data from htm files

Title: Collecting data from htm files
Post by: Pete on December 09, 2007, 05:20:39 AM

Is it possible to automate this?

1.   Open Google News in a web browser
2.   Type "query" into the form
3.   Press enter
4.   Click the first returned URL
5.   Extract the main text body, save to f:\query\body.txt
6.   Extract Page URL, save to f:\query\url.txt

edit: I can get as far as opening the 1st page back and outputting the whole sourcecode to a text file, but I dont want all the html noise, just the article and page url stuff.

Title: Collecting data from htm files
Post by: M3ta7h3ad on December 09, 2007, 17:47:57 PM

yes. but youd have to code it in a programming language :)

What are you basing this on? online? or offline system?

If using php just use strip tags to clear out the crap. :)

Title: Re:Collecting data from htm files
Post by: Pete on December 09, 2007, 19:35:13 PM

I had no idea where to start on this when I posted. Its actually a lot trickier than I first thought.

A bit of googling gave me a couple of leads:

1) Ruby Scraper
http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/
Ruby sorta makes sense but its given me such a headache I cant think straight. I can get the scraper to run and output the data in the Ruby Console but I dont get how you can save it to a file.

2) IMacro
http://www.iopus.com/index.htm
Ive spent a while playing with this too but again its giving me headaches. At first I could only get it to output the whole source code - I cant get a clean text file because Google news references 1,000s of websites and they all have different markers that are used as reference points in the macros.

So Im thinking I need to simplify what Im doing. Im gonna have a batch file and get IMacro to capture the CPLs (Complete websites) for a series of websites as well as their URLs.

The idea is getting a clean page downloaded will hopefully be more useful than the raw code and a lot simpler than trying to filter out the correct text.

Problem then is I have lots of keywords and Im looking to capture a whole series of articles so Id have to do:

Batch file:

start ....\marco - keyword1
start ....\macro - keyword2
....
....
start ....\macro - keyword194
echo Done
pause

Then edit 194 or whatever macro files with the respective keywords.

I could do with a way of having the keywords as a variable and have the macro - %var% sorta thing where %var% is taken from a text file. I think I see how to do it with a VBS but my brain just melted so Im gonna leave it for tomorrow.

Would PHP be simpler? Bearing in mind I dont know much about it.

Title: Re:Collecting data from htm files
Post by: cornet on December 09, 2007, 22:30:57 PM

I dont suppose you have a Linux box do you ??

I could write a shell script in a few min that would do this.

Title: Re:Collecting data from htm files
Post by: Pete on December 09, 2007, 23:15:01 PM

I will have tomorrow :)

Will Ubuntu do?

Title: Re:Collecting data from htm files
Post by: cornet on December 10, 2007, 00:01:13 AM

Ubuntu will do fine... :)

I guess you want to see the script now dont you... I warn you, it aint pretty.... but it works (well at least for me). It was written and (semi-tested) in about 20min

If I were doing this then I would probably write a ruby or perl script and us an RSS parsing module to help me out.

Code Select


#!/bin/bash

# The program we are going to use to get the resulting URL
BROWSER="lynx -dump"

# Base URL for google search - using the RSS feed for easy parsing :)
G_URL="http://news.google.co.uk/news?hl=en&ned=uk&ie=UTF-8&output=rss&q="

# Google RSS doesnt like curl, so lets pretend we are firefox
USER_AGENT="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801"

# Change spaces to + signs for search string
KEYWORDS=`echo $@ | sed -e "s/ /+/g"`

# Form the URL
URL=${G_URL}${KEYWORDS}

#
# Here goes the dirty bit :)
#
# * Get the google search rss page using curl
# * Extract the 3rd link (as first 2 are for google)
# * Remove the link tags
# * Extract the url
#
RESULT=`curl -q -A "$USER_AGENT" $URL 2>/dev/null |
grep -P -o -m1 "(.*?)" | head -n 3 | tail -n 1 |
sed -e :a -e s/<[^>]*>//g;/sed -e "s/.*url=//" | sed -e "s/&.*//"`

# View the result in a browser
$BROWSER "$RESULT"

To run this save that lot into a file (called say getnews.sh) then make it executable by doing:

Code Select


chmod +x getnews.sh

Then run it, passing some search words to it:

Code Select


./getnews.sh cows mooing

It will the output the result of "lynx -dump" which will dump the contents as text and the URLs as references after the text

Modify as you see fit :)

Title: Re:Collecting data from htm files
Post by: Pete on December 10, 2007, 20:50:47 PM

Cheers, Im gonna try it as soon as Ubuntu is installed. ETA at current rate: 4 months :(

Title: Re:Collecting data from htm files
Post by: Pete on December 11, 2007, 00:36:23 AM

Is there any way to make MS Virtual PC boot form a virtual CDROM drive?

I got an ubuntu cd, burnt it and tried installing from it but it sorta gives up on installing at 6% and hangs there for 3hrs.

Title: Collecting data from htm files
Post by: M3ta7h3ad on December 11, 2007, 12:06:53 PM

should boot from an iso image. Failing that VMWare Player will do it.

Title: Re:Collecting data from htm files
Post by: Pete on December 11, 2007, 12:58:18 PM

Got it :) Its "capture iso image" i need to do.

Title: Re:Collecting data from htm files
Post by: Pete on December 11, 2007, 15:25:25 PM

2.5hrs later and its finished installing. Shame the mouse doesnt work with it :roll:

It doesnt like running is my tfts native res either..

Gonna try enabling mouse keys but man this is a headache.

Tekforums

Chat => Entertainment & Technology => Topic started by: Pete on December 09, 2007, 05:20:39 AM