Chat => Entertainment & Technology => Topic started by: Pete on December 09, 2007, 05:20:39 AM
Title: Collecting data from htm files
Post by: Pete on December 09, 2007, 05:20:39 AM
Is it possible to automate this?
1. Open Google News in a web browser 2. Type "query" into the form 3. Press enter 4. Click the first returned URL 5. Extract the main text body, save to f:\query\body.txt 6. Extract Page URL, save to f:\query\url.txt
edit: I can get as far as opening the 1st page back and outputting the whole sourcecode to a text file, but I dont want all the html noise, just the article and page url stuff.
Title: Collecting data from htm files
Post by: M3ta7h3ad on December 09, 2007, 17:47:57 PM
yes. but youd have to code it in a programming language :)
What are you basing this on? online? or offline system?
If using php just use strip tags to clear out the crap. :)
Title: Re:Collecting data from htm files
Post by: Pete on December 09, 2007, 19:35:13 PM
I had no idea where to start on this when I posted. Its actually a lot trickier than I first thought.
A bit of googling gave me a couple of leads:
1) Ruby Scraper http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/ Ruby sorta makes sense but its given me such a headache I cant think straight. I can get the scraper to run and output the data in the Ruby Console but I dont get how you can save it to a file.
2) IMacro http://www.iopus.com/index.htm Ive spent a while playing with this too but again its giving me headaches. At first I could only get it to output the whole source code - I cant get a clean text file because Google news references 1,000s of websites and they all have different markers that are used as reference points in the macros.
So Im thinking I need to simplify what Im doing. Im gonna have a batch file and get IMacro to capture the CPLs (Complete websites) for a series of websites as well as their URLs.
The idea is getting a clean page downloaded will hopefully be more useful than the raw code and a lot simpler than trying to filter out the correct text.
Problem then is I have lots of keywords and Im looking to capture a whole series of articles so Id have to do:
Then edit 194 or whatever macro files with the respective keywords.
I could do with a way of having the keywords as a variable and have the macro - %var% sorta thing where %var% is taken from a text file. I think I see how to do it with a VBS but my brain just melted so Im gonna leave it for tomorrow.
Would PHP be simpler? Bearing in mind I dont know much about it.
Title: Re:Collecting data from htm files
Post by: cornet on December 09, 2007, 22:30:57 PM
I dont suppose you have a Linux box do you ??
I could write a shell script in a few min that would do this.
Title: Re:Collecting data from htm files
Post by: Pete on December 09, 2007, 23:15:01 PM
I will have tomorrow :)
Will Ubuntu do?
Title: Re:Collecting data from htm files
Post by: cornet on December 10, 2007, 00:01:13 AM
Ubuntu will do fine... :)
I guess you want to see the script now dont you... I warn you, it aint pretty.... but it works (well at least for me). It was written and (semi-tested) in about 20min
If I were doing this then I would probably write a ruby or perl script and us an RSS parsing module to help me out.
# The program we are going to use to get the resulting URL BROWSER="lynx -dump"
# Base URL for google search - using the RSS feed for easy parsing :) G_URL="http://news.google.co.uk/news?hl=en&ned=uk&ie=UTF-8&output=rss&q="
# Google RSS doesnt like curl, so lets pretend we are firefox USER_AGENT="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801"
# Change spaces to + signs for search string KEYWORDS=`echo $@ | sed -e "s/ /+/g"`
# Form the URL URL=${G_URL}${KEYWORDS}
# # Here goes the dirty bit :) # # * Get the google search rss page using curl # * Extract the 3rd link (as first 2 are for google) # * Remove the link tags # * Extract the url # RESULT=`curl -q -A "$USER_AGENT" $URL 2>/dev/null | grep -P -o -m1 "(.*?)" | head -n 3 | tail -n 1 | sed -e :a -e s/<[^>]*>//g;/sed -e "s/.*url=//" | sed -e "s/&.*//"`
# View the result in a browser $BROWSER "$RESULT"
To run this save that lot into a file (called say getnews.sh) then make it executable by doing: