Jump to content

Web Scraping Program Help


hansman1982

Recommended Posts

Hey there,

I have looked around for something like this and even tried to teach myself Python to do it since I never seem to get answers that solve my problems (usually it is "GOOGLE IT" or "I paid no attention to what you actually said so here is a program that uses Tsar Bomba when you ask for pruning shears to create a Bonsai tree")

Needs from my program:

1. Tell it a specific set of sites to pull information from

2. From those sites, have it pull the Headline, Author, Date, Link for articles posted as they post

3. Display this information in a manner similar to how CKAN displays

4. Allow me to categorize these links

5. Create HTML coding for these categories to copy/paste or auto-post to a website

6. Create peace in the Middle East

So:

1. Is there something like this out there

2. If there isn't or I do decide to go a custom route, how long would this take to code/test (for the purpose of calling BS on programmers as necessary)

The reason why I post this here:

1. KSP has an amazing collection of modders so it seems there are a number of experts (or maybe even someone willing to do it)

2. I tried Reddit and I hate posting to most of the subs on there

3. Help me Obi-Won Kenobi, you are my only hope

Link to comment
Share on other sites

Python is meh at best for doing online things. I've never gotten it to work, but have seen examples of internet-enabled code.

I'd suggest Java. Java has very convenient tools that grab stuff from HTML.

Problem is, I don't know how to program in Java. :P

Link to comment
Share on other sites

So, what you are saying is I just need to figure out how to hack into the NSA supercomputer and steal their program?

Sounds easy enough!

Oh, a quick Google search of "Java web scraper open source" gave me what looks to be a promising result...

Link to comment
Share on other sites

No, I've found RSS too slow and I literally only want/need the headline, author, link and date and then an easy way to get it into HTML.

I was able to cobble something together using Excel and the data grab it has but it's clunky and a pain to keep correct.

Link to comment
Share on other sites

No, I've found RSS too slow and I literally only want/need the headline, author, link and date and then an easy way to get it into HTML.

I was able to cobble something together using Excel and the data grab it has but it's clunky and a pain to keep correct.

How fast does it exactly need to be? Like you say, cobbling together will hurt you in the end.

Link to comment
Share on other sites

Python, PHP or PERL would accomplish this relatively easily. A bit of googling should throw up any number of examples that you could try adapting. Stack Exchange is probably a good place to start - I'd certainly look there before reddit.

Your major hurdle is wanting to do this for a number of sites, it's easy enough to write a script to scrape a single site, not so easy when you're trying to do a one-size-fits-all script. Regular expressions are going to be your friend here.

Link to comment
Share on other sites

How fast does it exactly need to be? Like you say, cobbling together will hurt you in the end.

I'd be fine if it ran it once an hour and threw the new links at me but, ideally, it'd be continuously running (which makes me cringe at processor/memory/power usage)

- - - Updated - - -

Thanks everyone so far. I'll try stack exchange as listed above.

Link to comment
Share on other sites

This thread is quite old. Please consider starting a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...