News:

Tekforums.net - The improved home of Tekforums! :D

Main Menu

Collecting data from htm files

Started by Pete, December 09, 2007, 05:20:39 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Pete

Is it possible to automate this?

1.   Open Google News in a web browser
2.   Type "query" into the form
3.   Press enter
4.   Click the first returned URL
5.   Extract the main text body, save to f:\query\body.txt
6.   Extract Page URL, save to f:\query\url.txt


edit: I can get as far as opening the 1st page back and outputting the whole sourcecode to a text file, but I dont want all the html noise, just the article and page url stuff.

 
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.

M3ta7h3ad

yes. but youd have to code it in a programming language :)

What are you basing this on? online? or offline system?

If using php just use strip tags to clear out the crap. :)

Pete

I had no idea where to start on this when I posted. Its actually a lot trickier than I first thought.

A bit of googling gave me a couple of leads:

1) Ruby Scraper
http://www.igvita.com/blog/2007/02/04/ruby-screen-scraper-in-60-seconds/
Ruby sorta makes sense but its given me such a headache I cant think straight. I can get the scraper to run and output the data in the Ruby Console but I dont get how you can save it to a file.

2) IMacro
http://www.iopus.com/index.htm
Ive spent a while playing with this too but again its giving me headaches. At first I could only get it to output the whole source code - I cant get a clean text file because Google news references 1,000s of websites and they all have different markers that are used as reference points in the macros.


So Im thinking I need to simplify what Im doing. Im gonna have a batch file and get IMacro to capture the CPLs (Complete websites) for a series of websites as well as their URLs.

The idea is getting a clean page downloaded will hopefully be more useful than the raw code and a lot simpler than trying to filter out the correct text.

Problem then is I have lots of keywords and Im looking to capture a whole series of articles so Id have to do:

Batch file:

start ....\marco - keyword1
start ....\macro - keyword2
....
....
start ....\macro - keyword194
echo Done
pause


Then edit 194 or whatever macro files with the respective keywords.

I could do with a way of having the keywords as a variable and have the macro - %var% sorta thing where %var% is taken from a text file. I think I see how to do it with a VBS but my brain just melted so Im gonna leave it for tomorrow.

Would PHP be simpler? Bearing in mind I dont know much about it.
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.

cornet

I dont suppose you have a Linux box do you ??

I could write a shell script in a few min that would do this.

Pete

I will have tomorrow :)

Will Ubuntu do?
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.

cornet

Ubuntu will do fine... :)

I guess you want to see the script now dont you... I warn you, it aint pretty.... but it works  (well at least for me). It was written and (semi-tested) in about 20min

If I were doing this then I would probably write a ruby or perl script and us an RSS parsing module to help me out.


#!/bin/bash

# The program we are going to use to get the resulting URL
BROWSER="lynx -dump"

# Base URL for google search - using the RSS feed for easy parsing :)
G_URL="http://news.google.co.uk/news?hl=en&ned=uk&ie=UTF-8&output=rss&q="

# Google RSS doesnt like curl, so lets pretend we are firefox
USER_AGENT="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.3) Gecko/20010801"

# Change spaces to + signs for search string
KEYWORDS=`echo $@ | sed -e "s/ /+/g"`

# Form the URL
URL=${G_URL}${KEYWORDS}

#
# Here goes the dirty bit :)
#
# * Get the google search rss page using curl
# * Extract the 3rd link (as first 2 are for google)
# * Remove the link tags
# * Extract the url
#
RESULT=`curl -q -A "$USER_AGENT" $URL 2>/dev/null |
grep -P -o -m1 "(.*?)" | head -n 3 | tail -n 1 |
sed -e :a -e s/<[^>]*>//g;/sed -e "s/.*url=//" | sed -e "s/&.*//"`

# View the result in a browser
$BROWSER "$RESULT"


To run this save that lot into a file (called say getnews.sh) then make it executable by doing:

chmod +x getnews.sh


Then run it, passing some search words to it:

./getnews.sh cows mooing


It will the output the result of "lynx -dump" which will dump the contents as text and the URLs as references after the text

Modify as you see fit :)


Pete

Cheers, Im gonna try it as soon as Ubuntu is installed. ETA at current rate: 4 months   :(
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.

Pete

Is there any way to make MS Virtual PC boot form a virtual CDROM drive?

I got an ubuntu cd, burnt it and tried installing from it but it sorta gives up on installing at 6% and hangs there for 3hrs.
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.

M3ta7h3ad

should boot from an iso image. Failing that VMWare Player will do it.

Pete

Got it :) Its "capture iso image" i need to do.
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.

Pete

2.5hrs later and its finished installing. Shame the mouse doesnt work with it  :roll:

It doesnt like running is my tfts native res either..

Gonna try enabling mouse keys but man this is a headache.
I know sh*ts bad right now with all that starving bullsh*t and the dust storms and we are running out of french fries and burrito coverings.