Wait

Building a search engine with php

This tutorial will show you how to build a php search engine. Really this tutorial will show you how to scrape data from web pages. You can use this data in a variety of ways, one use would be to build a search engine.

You need a way to scrape the data. You have a few options. I prefer using the built in php function file_get_contents or curl. In this tutorial we will be using file_get_contents since not all php configurations have curl installed. Below is a simple example of how to get the html of any page on the internet into a string.

<?php
// the url you want to scrape
$url 'http://www.inet411.com'
// this puts the html of the page into the string '$contents'
$contents file_get_contents($url);
// for display only, converts \n to <br /> and htmlentities the text so the html is not parsed 
echo nl2br(htmlentities($contents)); 
?>

And the results:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <meta name="description" content="iNet411.com free and open source php, jquery, css, ajax, javascript and xhtml scripts" />
        <meta name="keywords" content="php, jquery,css,ajax,javascript,xhtml, scripts, tutorials" />
        <title>iNet411.com</title>
        <script type="text/javascript" src="http://www.inet411.com/public/js/jquery.js"></script>
                                <link rel="stylesheet" type="text/css" href="http://www.inet411.com/public/css/style.css" />
                </head>
    <body>
        <div id="body_container">
            <div id="header_container">
                <div id="header">
                    <h1>iNet411.com</h1>
                </div>
                <!-- end header -->
                <div class="mattblacktabs">
                    <ul>
                        <li>
                            <a href="http://www.inet411.com/">Home</a>
                        </li>
                    </ul>
                </div>
            </div>
            <!-- end header_container -->
            <div id="left_right_container">
                <div id="left_column">
                    <div class="urbangreymenu">
                        <h3 class="headerbar">Articles</h3>
                        <ul>
                            <li>
                                <a href="http://www.inet411.com/articles/jquery/wait-plugin.html">jQuery Wait Plugin</a>
                            </li>
                            <li>
                                <a href="http://www.inet411.com/articles/php/array-remove.html">PHP array_remove</a>
                            </li>
                            <li>
                                <a href="http://www.inet411.com/articles/how-to/build-a-php-search-engine-part-1.html">PHP Search engine part 1</a>
                            </li>
                            
                           
                        </ul>
                    </div>
                    <!-- end urbangreymenu -->
                </div>
                <!-- end left_column -->
                <div id="center_column">
                    <div class="content"><h1>Welcome to iNet411.com</h1>
<p>Free and open source php, jquery, javascript, ajax, css and xhtml scripts.  Feel free to use any of the script you
find here for any purpose you choose.  Make sure to check each scripts license before use.</p>
<p>Php tutorials coming soon.</p>                    </div>
                    <!-- end content -->

                                    </div>
                <!-- end center_column -->
                <div id="right_column">

                    <!-- <div class="urbangreymenu2">
                        <h3 class="headerbar">Stats</h3>
                        <ul>
                            <li>
                             
                            </li>
                        </ul>
                    </div> -->
                    <div id="credits">
                        <div id="credits_left">
                            <p>
                            <a href="http://validator.w3.org/check?uri=referer"><img src="http://www.inet411.com/public/images/valid-xhtml10.png" alt="Valid XHTML 1.0 Strict" /></a>
                            </p>
                            <p>
                            <a href="http://validator.w3.org/check?uri=http://www.inet411.com/public/css/style.css"><img src="http://www.inet411.com/public/images/valid-css.gif" alt="Valid XHTML 1.0 Strict" /></a>
                            </p>
                            <p>
                            <a href="http://creativecommons.org/licenses/by-sa/3.0/"><img src="http://www.inet411.com/public/images/88x31.png" alt="Valid XHTML 1.0 Strict" /></a>
                            </p>
                        </div>
                        <div id="credits_right">
                            <p>
                            <a href="http://php.net"><img src="http://www.inet411.com/public/images/php-power-white.png" alt="Valid XHTML 1.0 Strict" /></a>
                            </p>
                            <p>
                            <a href="http://mysql.com"><img src="http://www.inet411.com/public/images/powered-by-mysql-88x31.png" alt="Valid XHTML 1.0 Strict" /></a>
                            </p>
                            <p>
                            <a href="http://jquery.com"><img src="http://www.inet411.com/public/images/88x31_bk01.png" alt="Valid XHTML 1.0 Strict" /></a>
                            </p>    
                        </div>
                        <div class="clear"></div>
                    </div>
                </div>
                <!-- end left_column -->
                <div style="clear:both;"></div>
            </div>
            <!-- end left_right_container -->
            <div id="footer">
                Copyright 2005-2009 iNet411
            </div>
        </div>
        <!-- end body_container -->
    </body>
</html>

Now that we have the contents in a string we can do pretty much anything we want

Here is a quick sample, and we'll start doing more in the next tutorial. For now we will do something that many people have a need for: We will get all the links from this page and put them into a nice array so we can use them later.

<?php
$url 
'http://www.inet411.com';
$contents file_get_contents($url);
preg_match_all"/<a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/"$contents$matches );
echo 
'<pre>';
print_r($matches[1]);
echo 
'</pre>';
?>

Thats it, the first 3 lines of code will grab the contents from the page you desire, put that into a string. It will then take that string and pull out all of the links and put them into a nice little array for you.

And the results:

KeyValue
0http://www.inet411.com/
1http://www.inet411.com/articles/jquery/wait-plugin.html
2http://www.inet411.com/articles/php/array-remove.html
3http://www.inet411.com/articles/how-to/build-a-php-search-engine-part-1.html
4http://validator.w3.org/check?uri=referer
5http://validator.w3.org/check?uri=http://www.inet411.com/public/css/style.css
6http://creativecommons.org/licenses/by-sa/3.0/
7http://php.net
8http://mysql.com
9http://jquery.com

Stay tuned for part 2

Submit a comment

(required) (required, will not be shown) (optional) 0 + 3 =

Valid XHTML 1.0 Strict

Valid XHTML 1.0 Strict

Valid XHTML 1.0 Strict

Valid XHTML 1.0 Strict

Valid XHTML 1.0 Strict

Valid XHTML 1.0 Strict