Improved Search Algorythm (Alternative to FULLTEXT)

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
I have an image search engine (not on this server) which is now crawling nicely, but the search function leaves much to be desired!

Currently, I'm using the fulltext MATCH AGAINST system; which works nicely but has two major limitations:

1) My host limits me to a 4 character search (all 3 character strings are ignored)
2) It doesn't support the %searchstring% feature.

Firstly, I want a 3 character search and 2nd, I have urls which I want to use the %searchstring% functionality.

In addition, it has be be sorted by relevance!!!

OK..... so I have done a lot of work on this and have come up with something that works but I'm VERY concerned about efficiency. With only a thousand or so records, this would be OK, but when the index grows to millions, this is going to be a headache.

Wondering if there is a better way of doing it?

PHP:
<?php
mysql_connect("blah de blah") or die(mysql_error());
mysql_select_db("DB") or die(mysql_error());

?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Search</title>
</head>

<body>
<form id="form1" method="post" action="">
  <label>
    <input type="text" name="search" id="search" />
  </label>
  <label>
    <input type="submit" name="button" id="button" value="Submit" />
  </label>
</form>

<?php
$q=strtolower(strip_tags($_POST['search']));

echo $q."<br/>";

$q_array = explode(" ", $q);
$q_num = count($q_array);

print_r($q_array);

       $query = "SELECT ID, IMG_URL, KEYWORDS FROM IMAGES"; //full table content query
       $result = mysql_query($query); //action result
       $num_results = mysql_num_rows($result); //count rows
       
       for ($i = 0;$i < $num_results; $i++) { //loop through rows
              
              $row = mysql_fetch_object($result); //get row content as object
                    
              //assign object values to variables
              $id = $row->ID;
              $img_url = strtolower(strip_tags($row->IMG_URL));
              $img_url_FC = $row->IMG_URL_FC;
              $keywords = strtolower(strip_tags($row->KEYWORDS));
              $keywords_FC = $row->KEYWORDS_FC;
              
              //------------------------------------------SCORING----------------------------------------
              
              $relevancy = 0;  //set relevance to zero
              
              //--------------------SEARCH THROUGH 1ST COLUMN---------------------------
              //score for exact string
              $relevancy = $relevancy + (substr_count($img_url, $q))*10;
                    
              for ($a = 0; $a < $q_num; $a++)
              {
                      //count characters.  if less than 3, ignore
                    if(strlen($q_array[$a])>2){
                         //gives score of 1 for each occasion words appear
                         $relevancy = $relevancy + substr_count($img_url, $q_array[$a]);
                    }
              } 
              
              //--------------------SEARCH THROUGH 2ND COLUMN---------------------------
              //score for exact string
              $relevancy = $relevancy + (substr_count($keywords, $q))*10;
              
              for ($a = 0; $a < $q_num; $a++)
              { 
                    //count characters.  if less than 3, ignore
                    if(strlen($q_array[$a])>2){ 
                         //gives score of 1 for each occasion words appear
                         $relevancy = $relevancy + substr_count($keywords, $q_array[$a]);
                      }
              }
              //------------------------------------------END SCORING----------------------------------------
              
              
              //add id to relevancy
              $relevancy = $relevancy.".".$id;
              
              //assign results to array
              if ($relevancy>0.999999999999999999999999999999){
                  
                $processed_array[$relevancy]["IMG_URL_FC"] = $img_url_FC;
                  $processed_array[$relevancy]["KEYWORDS_FC"] = $keywords_FC;
                  
              }
              
       } //end for ($i = 0;$i < $num_results; $i++) { //loop through rows
       echo "<br/>";
       
       krsort($processed_array);
       
       $count_processed_array = count($processed_array);
       echo "<br/>".$count_processed_array." results<br/>";
       
       foreach($processed_array as $relevance=>$val){
           echo "<br/>";
           echo "Relevance: ".floor($relevance)."<br/>";
           
           echo $processed_array [$relevance] [IMG_URL_FC].": ";
           echo $processed_array [$relevance] [KEYWORDS_FC]."<br/>";

        }


 
?>
</body>
</html>
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
My host limits me to a 4 character search (all 3 character strings are ignored)
This is the default limit is imposed by MySQL.

It doesn't support the %searchstring% feature.
Do you mean you also want to search for exact matches of the search string?

You're right, when the DB gets too large, performance will be terrible. Since you're only using the DB for persistent storage, the DB isn't used to its full potential.

Here's some ideas on how to approach this, rather than a full solution.
  • Think about writing a stored routine rather than a PHP function.
  • In natural language mode for fulltext, you can use AS SCORE to get the relevance MySQL calculates.
  • Divide the search string into terms longer than three characters and terms three characters long (discard terms of less than three characters). Pass these two values, along with the original search string, to the stored routine.
  • For terms of >3 characters, use the standard fulltext search. For terms of 3 characters, use a regexp (note scoring these will be difficult). For the full string match, use LIKE, giving matched records some constant score.
  • For each row, combine its scores, discard those rows with scores that are too low, sort & return.

For the above, temporary tables and joins are your friend.
 

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
I knew you wouldn't let me down!!

I'm also glad you understand the scope of my challenge. There are some things I need to clarify...

As you may/may not remember, I have an image search engine and I am currently capturing 3 strings... crawled url, image url and keywords found in 'alt'.

This means I have lots of continuous strings i.e. images/wallpaperslordoftherings/amonhen.jpg

The fulltext search does not distinguish "the rings" or "wallpapers" from this string which is why I wanted to use a wildcard either side of each split search word. From what I understand, the natural language mode does not do this either and is still limited to the default 4 character minimum string search (not good). To be honest, I went for the fulltext search becasue of the relavance scoring, which suits me well, but the same issues remain unless my host will change the fulltext setting and then re-boot the MySQL server (unlikely).

I know what you mean about using a regexp for the <4 strings, but as you said, this will be very tricky to pull off convincingly without lots of messy coding.

I have, in the absence of a suitable solution, refined the existing code and will certainly read up on stored routines if it helps later on (now starting reading up on it and it looks like a good solution).

I now however have another issue which is blocking my progess a little.... new thread methinks...

Thank you very much for your time spent on this. It is much appreciated.
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
To generate a search string for all the keywords:
PHP:
$fullsearch = implode('%', $keywords);
However, the order of the keywords is significant. If you don't want this to be the case, you could also generate every permutation of every combination of search terms. You'll have to write a function to do this. I wouldn't recommend it, since it will generate sum(i=1..n; n!/i!) terms in as much time. Three terms goes to ten, which isn't so bad, but four balloons to 41, five to 206 and six to 1237. The increase is more than exponential. The only mitigating factor is that people aren't too likely to enter more than six keywords.

Another option is to use both approaches: use a stored routine with fulltext, regexps and LIKE matching to simply cut down on the number of results the query returns, then use PHP to handle the final scoring & sorting. The regexp MySQL uses doesn't need to be too tricky:
PHP:
$threes = '[[:<:]](' . implode('|', array_filter($keywords, function($word) {return strlen($word) == 3;}) ) . ')[[:>:]]';
Note that this requires PHP 5.3 for anonymous functions; for 5.2, you can use create_function. You've now got a MySQL regexp for the three-character keywords. It's probably better to add the word boundary markers within the stored routine.

If you don't care to use MySQL's fulltext scores, you can search for all rows containing at least one keyword:
PHP:
$searchQuery = $db->prepare(
   'SELECT id, img_url, keywords 
      FROM images 
      WHERE img_url RLIKE :keywords 
        OR keywords RLIKE :keywords');
...

$keywords = preg_split("/[^\w']+/", $_REQUEST['keywords']);
$terms = preg_replace("/'/", "'?"
         '[[:<:]](' 
         . implode('|', array_filter($keywords, 
                                     function($word) {return strlen($word) >= 3;}) ) 
        . ')[[:>:]]'
    );
$searchQuery->execute(array(':keywords' => $terms));
With this option, you don't use a stored routine and PHP still handles scoring & sorting.
 

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
This is a lot to get my head round and areas I'm not totally comfortable with yet.

Your suggestion of implode isn't suitable becasue I don't want to use the entire string to search (unless I do that as well as what I already have... hmmm). A search string may be "Lord of the rings" or even "Rings belonging to a lord" - which is why I wanted to loop through each word.

I love the option of minimising the MySQL result before processing - this will greatly improve efficiency.

Just need to figure out caching now as well!
 
Top