Results 1 to 3 of 3

Thread: Collecting and analysing data from publicly accessible sources...

  1. #1
    Join Date
    Mar 2014
    Beans
    70
    Distro
    Kubuntu 14.04 Trusty Tahr

    Collecting and analysing data from publicly accessible sources...

    Hi,

    I'm looking for some kind of "poor man's CIA"... I'm pretty active in nature conservation and I'm looking to build a system which automatically collects information from publicly accessible sources, say press, publicly visible forums, etc. and analyses the data for certain connections and relations. Specifically, I'm looking to discover information which could help prevent poaching of endangered species.
    I already have a very simple system in place that regularly pulls RSS feeds from certain news sources and websites and if an article or post contains certain keywords, those are stored in a database. That's a good start but the analysis is still a major task and it's not always easy to see connections.
    Does anybody know any open source software which could help us analyze this kind of information? What would this kind of software be called (English isn't my native language)?

    -S
    Linux user since 1997, reg'd Linux User #247167, quote: "GUI? Great, I can open multiple console windows side by side!"
    Specialties: Network / security / communications, Asterisk

  2. #2
    Join Date
    Feb 2007
    Location
    West Hills CA
    Beans
    10,044
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Collecting and analysing data from publicly accessible sources...

    Normally, this activity would be called a Clipping Service. Here's an example: http://www.burrellesluce.com/service...g/full_service

    There are not many open source projects: http://newsclipper.sourceforge.net/

    Your current data format is RSS feeds which are HTML or XML text files. So you would need an HTML/XML data miner to pull out key phrases from your feeds. Then you would need to review them to determine if they are relevant to your purpose. To automate this process would require a Big Data infrastructure and some sort of artifical intelligence framework that learns over time. This is not trivial.

    Do you have some small example datasets that you can attach?

    I would keep your pets inside.
    -------------------------------------
    Oooh Shiny: PopularPages

    Unumquodque potest reparantur. Patientia sit virtus.

  3. #3
    Join Date
    Mar 2014
    Beans
    70
    Distro
    Kubuntu 14.04 Trusty Tahr

    Re: Collecting and analysing data from publicly accessible sources...

    So THAT is "Big Data"...? I do have written a Perl script which pulls RSS feeds from a number of local and national news sites and if an article contains certain (hardcoded) keywords, it's pushed into a mysql database. To prevent duplicates, I also hash the description text and store the hash, so that part is already working. My next step would be to have a second table for the keywords with some easy "pull keyword out of text into table" function in the GUI. The acquisition and filtering of relevant articles isn't that hard.

    The analysing is the hard part. As I wrote, it's fairly hard to see patterns or relations if you look at all those articles manually, e.g. recognize names, locations, etc. Simply spoken, I would like to have a flag popping up if e.g. a certain place occurs repeatingly in connection with reports about poaching. If it's not like every day or every week, something like that is hard to catch when looking at a ton of daily articles manually.

    Sample dataset - not yet... But you can basically create or imagine some yourself. Just grab every article which contains e.g. "bear", "wolf", "lynx" and "poaching" from all relevant national and local news services. Then you have an idea of the amount of information I have to process...

    Pets?
    Linux user since 1997, reg'd Linux User #247167, quote: "GUI? Great, I can open multiple console windows side by side!"
    Specialties: Network / security / communications, Asterisk

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •