Yellowpages Scraper Tutorial
Hello, and welcome to this yellowpages.com scraper tutorial. In this tutorial, I will show you how to create your very own yellowpages scraper for free! Thats right, Free!!! It's crazy that someone would charge you $250.00 for such a simple code. After this tutorial you will no longer have to pay for databases. You could just get the information yourself. So lets get started.

Prerequisite

  • The coding is very simple but it may help to have a little scripting experience.
  • Some of the functions that I use only work with PHP 5 , but dont let that stop you. Do a small search in google for the alternatives if you run into problems.
First take a look at the structure of these yellowpages url.
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?search_mode=all&search_terms=seo
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=2&search_mode=all&search_terms=seo
Do you see the difference? In the second url
sort=content&page=2&
is added for pagination, but we need to only be concerned with the number. Of course if we increase or decrease this number, your browser will display the next or previous page. We need our scraper / spider to automatically start at the bottom of the search query, scrape the page, and continue on in that manner until the very last page. So for example, lets do a search for seo in Texas. The url is:
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?search_mode=all&search_terms=seo
We need to know how many pages there are for this search. The easy way of doing this is to just look at the number in the url of the last page result. Which is located at the bottom of the page.
Click the image for a bigger View!

 

 

 

 

 

For now lets just remember that number ( 14 ). Later on, (if you wanted to develop this for a client $$$) you could scrape that number automatically. But first we need to create a function that can dynamically create our url's. For this example we will need 14 url's i.e
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=1&search_mode=all&search_terms=seo
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=2&search_mode=all&search_terms=seo
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=3&search_mode=all&search_terms=seo
. . . you get the ideal.
function createUrl($url,$lastnum)
{
$find = "?";
$trim = rtrim ($url,'a..z,A..Z,=,_,&');
$remove_to = strpbrk($trim, '?');
$number = 1;
$counter= 0;
while ($lastnum != $number)
{
$over = "?page=".$number."&";
$replace = str_replace($find,$over,$url);
$myArray[$counter] = $replace;
$number++;
$counter++;
}
return $myArray;
}

In the code above, we create a function called createUrl. createUrl takes the very first url and the number of the last url as its two arguments. Next we create a variable called $find and give it the charcter value of ?. In the next two lines of code, we will destroy the initial url down to the value in $find .Then comes the while loop, which recreates our urls for us and puts them into an array. Moving on.

This is all nice and cool, but what are we going to do now with all of these urls in our array? Simple, we are going to use php's built in file_get_contents function to open them all up and prepare them for the scraping.

function createList ($url ) {
$counter=0;
foreach ($url as $value)
{
$html=file_get_contents ($value);
$myArray[$counter] = $html;
$counter++;
}
return $myArray;
};

This is our function that take the url array we created earlier and opens them up one at a time, then It puts there content ( i.e all of the information on the page) and puts it into a new array.

Once we have all of this information, we need to go through it and pick out the pieces we want like the name of the organization, address, state, zip code, phone number etc. In the code below we are going to use preg_match_all and preg_match to grab those specific pieces of data. Its pretty self explanatory. It will also ouput the data to the screen for us.

foreach ($list as $value){
echo "<span style='width:8px; background:blue'> </span>";
preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
foreach ($matches[0] as $match) {
preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);

$title = $temp['1'];
$title = strip_tags(trim($title));

$description = $desc['1'];
$description = strip_tags(trim($description));

$phone = $num['1'];
$phone = strip_tags(trim($phone));

print "<b>$title</b>
<br>$description<br>
$phone<br>
<br>";
}
}

This is the final code. Just replace $lastnum with the number of pages the search has plus one. Click here for a example of the output.

<?
ini_set('memory_limit', '99999M');
function createUrl($url,$lastnum)
{
$find = "?";
$trim = rtrim ($url,'a..z,A..Z,=,_,&');
$remove_to = strpbrk($trim, '?');
$number = 1;
$counter= 0;
while ($lastnum != $number)
{
$over = "?page=".$number."&";
$replace = str_replace($find,$over,$url);
$myArray[$counter] = $replace;
$number++;
$counter++;
}
return $myArray;
}

 

 

$url = "http://www.yellowpages.com/TX/Internet-Marketing-Advertising?search_mode=all&search_terms=seo";
$lastnum = 1 +1;
$url = createUrl($url,$lastnum);

function createList ($url ) {
$counter=0;
foreach ($url as $value)
{
$html=file_get_contents ($value);
$myArray[$counter] = $html;
$counter++;
}
return $myArray;
}
$list = createList($url);

 

foreach ($list as $value){
echo "<span style='width:8px; background:blue'> </span>";
preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
foreach ($matches[0] as $match) {
preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);

$title = $temp['1'];
$title = strip_tags(trim($title));

$description = $desc['1'];
$description = strip_tags(trim($description));

$phone = $num['1'];
$phone = strip_tags(trim($phone));

 

print "<b>$title</b>
<br>$description<br>
$phone<br>
<br>";
}
}
?>