Prerequisite
- The coding is very simple but it may help to have a little scripting experience.
- Some of the functions that I use only work with PHP 5 , but dont let that stop you. Do a small search in google for the alternatives if you run into problems.
{
$find = "?";
$trim = rtrim ($url,'a..z,A..Z,=,_,&');
$remove_to = strpbrk($trim, '?');
$number = 1;
$counter= 0;
while ($lastnum != $number)
{
$over = "?page=".$number."&";
$replace = str_replace($find,$over,$url);
$myArray[$counter] = $replace;
$number++;
$counter++;
}
return $myArray;
}
In the code above, we create a function called createUrl. createUrl takes the very first url and the number of the last url as its two arguments. Next we create a variable called $find and give it the charcter value of ?. In the next two lines of code, we will destroy the initial url down to the value in $find .Then comes the while loop, which recreates our urls for us and puts them into an array. Moving on.
This is all nice and cool, but what are we going to do now with all of these urls in our array? Simple, we are going to use php's built in file_get_contents function to open them all up and prepare them for the scraping.
$counter=0;
foreach ($url as $value)
{
$html=file_get_contents ($value);
$myArray[$counter] = $html;
$counter++;
}
return $myArray;
};
This is our function that take the url array we created earlier and opens them up one at a time, then It puts there content ( i.e all of the information on the page) and puts it into a new array.
Once we have all of this information, we need to go through it and pick out the pieces we want like the name of the organization, address, state, zip code, phone number etc. In the code below we are going to use preg_match_all and preg_match to grab those specific pieces of data. Its pretty self explanatory. It will also ouput the data to the screen for us.
foreach ($list as $value){
echo "<span style='width:8px; background:blue'> </span>";
preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
foreach ($matches[0] as $match) {
preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);
$title = $temp['1'];
$title = strip_tags(trim($title));
$description = $desc['1'];
$description = strip_tags(trim($description));
$phone = $num['1'];
$phone = strip_tags(trim($phone));
print "<b>$title</b>
<br>$description<br>
$phone<br>
<br>";
}
}
This is the final code. Just replace $lastnum with the number of pages the search has plus one. Click here for a example of the output.
<?
ini_set('memory_limit', '99999M');
function createUrl($url,$lastnum)
{
$find = "?";
$trim = rtrim ($url,'a..z,A..Z,=,_,&');
$remove_to = strpbrk($trim, '?');
$number = 1;
$counter= 0;
while ($lastnum != $number)
{
$over = "?page=".$number."&";
$replace = str_replace($find,$over,$url);
$myArray[$counter] = $replace;
$number++;
$counter++;
}
return $myArray;
}
$url = "http://www.yellowpages.com/TX/Internet-Marketing-Advertising?search_mode=all&search_terms=seo";
$lastnum = 1 +1;
$url = createUrl($url,$lastnum);
function createList ($url ) {
$counter=0;
foreach ($url as $value)
{
$html=file_get_contents ($value);
$myArray[$counter] = $html;
$counter++;
}
return $myArray;
}
$list = createList($url);
foreach ($list as $value){
echo "<span style='width:8px; background:blue'> </span>";
preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
foreach ($matches[0] as $match) {
preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);
$title = $temp['1'];
$title = strip_tags(trim($title));
$description = $desc['1'];
$description = strip_tags(trim($description));
$phone = $num['1'];
$phone = strip_tags(trim($phone));
print "<b>$title</b>
<br>$description<br>
$phone<br>
<br>";
}
}
?>
