Search This Blog

Monday 6 November 2017

create search engine like google

CREATE A SEARCH ENGINE LIKE GOOGLE


earch Engine architecture

Before going into further details, let us talk about what should be our goals while developing a search engine. Listed below is a brief set of goals which we should be focused on -
  • WebCrawler, indexer and document storage which should be capable of handling a large volume of documents may be 1 million or even more. .
  • We should follow the test driven development which would help to enforce good design and modular code.
  • We should have the ability to support various strategies for things like the index, document store, search etc.
A typical search engine consists of few parts -
  • A crawler which is used to pull external documents.
  • An index which is the place where the documents are stored in an inverted tree and
  • A document store to keep the documents.
THE CRAWLER
In order to crawl, we should come up with a list of URL’s. There are a few generic ways to do this as listed under -
  • The most common is to feed the crawler with a list of links which contain lots of links as listed. Our next job is to crawl them and harvest as we go down the list
  • Another approach is to download a list of URL’s and then use that list.
Since our aim is to get the actual website only, let us write a simple parser to extract the appropriate data out. It is quite straight forward as shown below -
Listing 6: The parser –
                $file_handle = fopen( " Quantcast-Top-Million.txt ", "r" );
 
       while ( !feof ( $file_handle ) ) {
             $line = fgets( $file_handle );
             if( preg_match( '/^\d+/',$line ) ) { # if it starts with some amount of digits
                    $tmp = explode( "\t",$line );
                    $rank = trim( $tmp[0] );
                    $url = trim( $tmp[1] );
                    if( $url != 'Hidden profile' ) { # Hidden profile appears sometimes just ignore then 
 CREATE SEARCH ENGINE LIKE GOOGLE                        echo $ 
 
}
  }
 }
 fclose( $file_handle );
DOWNLOADING
Downloading the data is going to take some time hence we should be prepared for a longer wait. We can write a very basic crawler in PHP simply by using a file_get_contents and sticking in a url. Let us have a look into the following code -

Listing 7: The crawler –
        $file_handle = fopen("urllist.txt", "r");
         while (!feof($file_handle)) {
                 $url = trim(fgets($file_handle));
                 $content = file_get_contents($url);
                 $document = array($url,$content);
                 $serialized = serialize($document);
                 $fp = fopen('./documents/'.md5($url), 'w');
                 fwrite($fp, $serialized);
                 fclose($fp);
         }
         fclose($file_handle);
The above code is essentially a single threaded crawler. It simply loops over every url in the file, extracts down the content and then saves the content to the disk. The only thing we should note here is that it stores the url and the content in a document since we might need to to use the URL for ranking purpose and also it is helpful to keep a track where the document came from. We should keep in mind that we may run out of file system storage limits while trying to store lots of documents in one folder.
THE INDEX
The reason I initially talked about the test driven development mechanism, is that I prefer the bottom up approach. The index, which we are going to create, should have a few very simple responsibilities as listed under - SIGNUP
  • It needs to store its contents to disk and retrieve them.
  • It needs to be able to clear itself when we decide to regenerate things.
  • It should validate documents that its storing.
Having these tasks defined Let us have the following interface in place -

Listing 8: The interface –
        interface iindex {
                 public function storeDocuments($name,array $documents);
                 public function getDocuments($name);
                 public function clearIndex();
                 public function validateDocument(array $document);
         }
 CREATE SEARCH ENGINE LIKE GOOGLE

THE DOCUMENT STORE
The document store is a somewhat odd if we are going to index things that we probably already have what we wanted to be stored somewhere else. The most obvious thing in this case is that the documents are already in some database.
THE INDEXER
The next step in our approach to build our search is to create the indexer. An indexer takes a document, breaks it apart and feeds it into the index, and also possibly to the document store depending upon our implementation.
INDEXING
Now that we have the ability to store and index some documents. Let us go through the steps we need here to have the indexing in place -

   CREATE SEARCH ENGINE LIKE GOOGLE

  • The first thing we are supposed to do here is to set the time limit to unlimited since the indexing process might take a longer time than expected.
  • Our next step is to define the position of the index and the documents that are going to stay in order to avoid the errors.
SEARCHING
Searching requires a relatively simple approach. In fact we only require a single method as shown below -

Listing 9: The search interface –
                interface isearch {
                       public function dosearch($searchterms);
       }
Of course, the actual implementation is not that easy. It is rather more complex than it appears.

Conclusion

Through this document, I have tried to cover the different areas of search engine and its features. Also I have discussed on how to create our own search engine with the help of MySQL and. Let us conclude this article in following bullets - APPLY FOR NEW SITE NOW
  • Search engine is a powerful and useful tool in today's Internet world.
  • A search engine is based on several complex mathematical formulae which are used to generate the search results.
  • The results obtained for the specific query are then displayed on the SERP or the Search Engine

No comments:

Post a Comment