Collecting data from htm files

Pete · December 09, 2007, 05:20:39 AM

Is it possible to automate this?

1.   Open Google News in a web browser
2.   Type "query" into the form
3.   Press enter
4.   Click the first returned URL
5.   Extract the main text body, save to f:\query\body.txt
6.   Extract Page URL, save to f:\query\url.txt

edit: I can get as far as opening the 1st page back and outputting the whole sourcecode to a text file, but I dont want all the html noise, just the article and page url stuff.

M3ta7h3ad · December 09, 2007, 17:47:57 PM

yes. but youd have to code it in a programming language

What are you basing this on? online? or offline system?

If using php just use strip tags to clear out the crap.

Pete · December 09, 2007, 19:35:13 PM

I had no idea where to start on this when I posted. Its actually a lot trickier than I first thought.

A bit of googling gave me a couple of leads:

1) Ruby Scraper
http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/
Ruby sorta makes sense but its given me such a headache I cant think straight. I can get the scraper to run and output the data in the Ruby Console but I dont get how you can save it to a file.

2) IMacro
http://www.iopus.com/index.htm
Ive spent a while playing with this too but again its giving me headaches. At first I could only get it to output the whole source code - I cant get a clean text file because Google news references 1,000s of websites and they all have different markers that are used as reference points in the macros.

So Im thinking I need to simplify what Im doing. Im gonna have a batch file and get IMacro to capture the CPLs (Complete websites) for a series of websites as well as their URLs.

The idea is getting a clean page downloaded will hopefully be more useful than the raw code and a lot simpler than trying to filter out the correct text.

Problem then is I have lots of keywords and Im looking to capture a whole series of articles so Id have to do:

Batch file:

start ....\marco - keyword1
start ....\macro - keyword2
....
....
start ....\macro - keyword194
echo Done
pause

Then edit 194 or whatever macro files with the respective keywords.

I could do with a way of having the keywords as a variable and have the macro - %var% sorta thing where %var% is taken from a text file. I think I see how to do it with a VBS but my brain just melted so Im gonna leave it for tomorrow.

Would PHP be simpler? Bearing in mind I dont know much about it.

cornet · December 09, 2007, 22:30:57 PM

I dont suppose you have a Linux box do you ??

I could write a shell script in a few min that would do this.

Pete · December 09, 2007, 23:15:01 PM

I will have tomorrow

Will Ubuntu do?

cornet · December 10, 2007, 00:01:13 AM

Ubuntu will do fine...

I guess you want to see the script now dont you... I warn you, it aint pretty.... but it works (well at least for me). It was written and (semi-tested) in about 20min

If I were doing this then I would probably write a ruby or perl script and us an RSS parsing module to help me out.

Code Select


#!/bin/bash

# The program we are going to use to get the resulting URL
BROWSER="lynx -dump"

# Base URL for google search - using the RSS feed for easy parsing :)
G_URL="http://news.google.co.uk/news?hl=en&ned=uk&ie=UTF-8&output=rss&q="

# Google RSS doesnt like curl, so lets pretend we are firefox
USER_AGENT="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801"

# Change spaces to + signs for search string
KEYWORDS=`echo $@ | sed -e "s/ /+/g"`

# Form the URL
URL=${G_URL}${KEYWORDS}

#
# Here goes the dirty bit :)
#
# * Get the google search rss page using curl
# * Extract the 3rd link (as first 2 are for google)
# * Remove the link tags
# * Extract the url
#
RESULT=`curl -q -A "$USER_AGENT" $URL 2>/dev/null |
grep -P -o -m1 "(.*?)" | head -n 3 | tail -n 1 |
sed -e :a -e s/<[^>]*>//g;/sed -e "s/.*url=//" | sed -e "s/&.*//"`

# View the result in a browser
$BROWSER "$RESULT"

To run this save that lot into a file (called say getnews.sh) then make it executable by doing:

Code Select


chmod +x getnews.sh

Then run it, passing some search words to it:

Code Select


./getnews.sh cows mooing

It will the output the result of "lynx -dump" which will dump the contents as text and the URLs as references after the text

Modify as you see fit

Pete · December 10, 2007, 20:50:47 PM

Cheers, Im gonna try it as soon as Ubuntu is installed. ETA at current rate: 4 months

Pete · December 11, 2007, 00:36:23 AM

Is there any way to make MS Virtual PC boot form a virtual CDROM drive?

I got an ubuntu cd, burnt it and tried installing from it but it sorta gives up on installing at 6% and hangs there for 3hrs.

M3ta7h3ad · December 11, 2007, 12:06:53 PM

should boot from an iso image. Failing that VMWare Player will do it.

Pete · December 11, 2007, 12:58:18 PM

Got it

Its "capture iso image" i need to do.

Pete · December 11, 2007, 15:25:25 PM

2.5hrs later and its finished installing. Shame the mouse doesnt work with it

It doesnt like running is my tfts native res either..

Gonna try enabling mouse keys but man this is a headache.

News:

Collecting data from htm files