Reading RSS feeds with Magpie RSS


Last semester I was taking a client-side programming class and ended up attempting to design a Twitter-esque web app using PHP and AJAX. The goal of the website was to create a twitter like interface for a news aggregation website. While my team and I were all competent programmers, there was one problem that we were all having a hard time solving at first: How to add realistic looking articles to our database while on a serious time crunch. After a little bit of debate and quite a bit of head-against-wall banging, we came upon the idea of pulling RSS feeds from major news organizations and parsing them out into byte-sized pieces (typically just the headline). We decided to use 3 major news organizations for simplicity’s sake since each RSS feed was just a little bit different, and we ended up going with NPR, CNN, and MSNBC World News.

We started out by downloading Magpie RSS, and extracting the files into our Aptana Studio project. There were 4 code files, all of which end in the file extension .inc but are in fact just PHP files. Next, we created a new PHP file called getarticles.php. This file was designed to be run from the PHP command line, not from the browser, so we put it above the root of our webserver’s public area, and created a cron job to run it every 2 minutes. After creating getarticles.php, we included “rss_fetch.inc” from Magpie RSS, as well as our Database Class so we could insert the data whenever we had a new article.

The next step in the process is the tedious part. Every RSS feed is different, so we have to check out the RSS feed we want to parse and see what data it holds. Luckily this isn’t too tough in php:

$nprrss = fetch_rss('http://www.npr.org/rss/rss.php?id=1001');
echo '<pre>';
print_r($nprrss);
echo '</pre>';

This will show how the array is organized… namely what fields are in the Feed, so we can parse through them. We start out by breaking the array into individual entries:

$items = array_slice($nprrss->items, 0);

And then looping through each entry:

foreach ($items as $item )
{
     ... Do some Crap ...
}

During each iteration of the above loop, we will be referencing named array indexes (much like looping through rows returned from a database) and doing whatever it is we want to do with the contents of the feed.

foreach ($items as $item )
	{
		$data['TITLE'] = $item['title'];
		$data['AUTHOR'] = "nprbot";
		$data['TEXT'] = $item['description'];
		$data['THUMBS_UP'] = 3;
		$data['THUMBS_DOWN'] = 0;
		$title = explode(" ",$data['TITLE']);
		$sql = "select * from ARTICLE where TITLE like '%".mysql_real_escape_string($title[0])."%".mysql_real_escape_string($title[1])."%".mysql_real_escape_string($title[2])."%".mysql_real_escape_string($title[3])."%".mysql_real_escape_string($title[4])."%' and AUTHOR = 'nprbot'";
	    $result = $db->query($sql);
		if(mysql_num_rows($result) != 0)
		{	
			echo "***************************************************************************************\n\n";
			echo "Article: ".$data['TITLE']." already in db";
			echo "***************************************************************************************\n\n";
		}
		else {
			echo "Comparing title and sql for debugging\n";
			echo $sql."\n";
			echo $data['TITLE']."\n";
			echo "Creating article by nprbot titled: ".$data['TITLE']."\n\n" ;
			$db->query_insert("ARTICLE", $data);
			sleep($rest);
		}
	}

To finish the task at hand, we simply rinse and repeat at this point:
***Note***
The following code is the result of 3 Exhausted and caffeine deprived college students working late into the night, after several days of 6-8 hour coding marathons– and is therefore by no means the most elegant, proper, or even safest code ever written.

<?php
	/*
	 * This script is designed to take ABSOLUTELY FREAKING FOREVER to run
	 * in order to save the server undo stress from the web crawling. Best
	 * to run it when you are bored or going to bed.
	 */
	include('public/rss_fetch.inc');
	include("database.class.php");
	$db->connect();
	/*
	 * The $rest variable is designed to set the sleep time between adding entries into the database.
	 * By default I am setting it to 40 just so the articles are nicely spaced out
	 * in terms of their timestamps. If you need to run the script more quickly, feel free to set it
	 * to a nicer number like 1 or 0.
	 */
	$rest = 1;
	/*
	 * Define all rss feeds
	 */
	$nprrss = fetch_rss('http://www.npr.org/rss/rss.php?id=1001');
	$cnnrss = fetch_rss('http://rss.cnn.com/rss/cnn_topstories.rss');
	$msnbcWorldrss = fetch_rss('http://rss.msnbc.msn.com/id/3032506/device/rss/rss.xml');
	/*
	 * Begin processing feeds with the following steps:
	 * 		1. Slice the feed into individual entries
	 * 		2. Loop through the entries and grab what we want
	 * 		3. Insert the data into the database if not already there
	 */
	$items = array_slice($nprrss->items, 0);
	/*
	 * NPR Feed
	 */
	foreach ($items as $item )
	{
		$data['TITLE'] = $item['title'];
		$data['AUTHOR'] = "nprbot";
		$data['TEXT'] = $item['description'];
		$data['THUMBS_UP'] = 3;
		$data['THUMBS_DOWN'] = 0;
		$title = explode(" ",$data['TITLE']);
		$sql = "select * from ARTICLE where TITLE like '%".mysql_real_escape_string($title[0])."%".mysql_real_escape_string($title[1])."%".mysql_real_escape_string($title[2])."%".mysql_real_escape_string($title[3])."%".mysql_real_escape_string($title[4])."%' and AUTHOR = 'nprbot'";
	    $result = $db->query($sql);
		if(mysql_num_rows($result) != 0)
		{	
			echo "***************************************************************************************\n\n";
			echo "Article: ".$data['TITLE']." already in db";
			echo "***************************************************************************************\n\n";
		}
		else {
			echo "Comparing title and sql for debugging\n";
			echo $sql."\n";
			echo $data['TITLE']."\n";
			echo "Creating article by nprbot titled: ".$data['TITLE']."\n\n" ;
			$db->query_insert("ARTICLE", $data);
			sleep($rest);
		}
	}
	$items = array_slice($cnnrss->items, 0);
	/*
	 * CNN Feed
	 */
	foreach ($items as $item )
	{
		$text = explode("<",$item['description']);
	 	$data['TITLE'] = $item['title'];
		$data['AUTHOR'] = "cnnbot";
		$data['TEXT'] = strval($text[0]);
		$data['THUMBS_UP'] = 3;
		$data['THUMBS_DOWN'] = 0;
		$title = explode(" ",$data['TITLE']);
		$sql = "select * from ARTICLE where TITLE like '%".mysql_real_escape_string($title[0])."%".mysql_real_escape_string($title[1])."%".mysql_real_escape_string($title[2])."%".mysql_real_escape_string($title[3])."%".mysql_real_escape_string($title[4])."%' and AUTHOR = 'cnnbot'";
	    $result = $db->query($sql);
		if(mysql_num_rows($result) != 0)
		{
			echo "***************************************************************************************\n\n";
			echo "Article: ".$data['TITLE']." already in db";
			echo "***************************************************************************************\n\n";
		}
		else {
			echo "Comparing title and sql for debugging\n";
			echo $sql."\n";
			echo $data['TITLE']."\n";
			echo "Creating article by cnnbot titled: ".$data['TITLE']."\n\n" ;
			$db->query_insert("ARTICLE", $data);
			sleep($rest);
		}
	}
	$items = array_slice($msnbcWorldrss->items, 0);
	/*
	 * MSNBC Feed
	 */
	foreach ($items as $item )
	{
		/*
		 * The MSNBC Feed likes to be difficult. Some entries have pictures, others
		 * do not. This chunk of code parses out the description based upon whether
		 * or not there is a picture.
		 */
		$text = explode("query($sql);
		if(mysql_num_rows($result) != 0)
		{		
			echo "***************************************************************************************\n\n";
			echo "Article: ".$data['TITLE']." already in db";
			echo "***************************************************************************************\n\n";
		}
		else {
			$haystack = strtolower($data['TEXT']);
			$needle = "<";
			$pos = strpos($haystack,$needle);
			if($pos === false) {
				echo "Comparing title and sql for debugging\n";
				echo $sql."\n";
				echo $data['TITLE']."\n";
				echo "Creating article by msnbcbot titled: ".$data['TITLE']."\n\n" ;
				$db->query_insert("ARTICLE", $data);
				sleep($rest);
			}
			else {
 				echo "Skipping article by msnbcbot with the text:\n" ;
 				echo $data['TEXT']."\n\n";
			}
		}	
	}
?>
Advertisements
Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s