ereg_replace and explode problem

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
I am tying to analyse a string in a number of way for my new adaptive learning language program at www.brain.x10hosting.com.

One part is to determine which words in the string are common, and thus remove them.

This test pages is at www.brain.x10hosting.com/test.php

The php below goes with a form - obviously.

PHP:
echo "Input: ".$_POST['input']."<br/>";

//clean input
    $input = $_POST['input'];
    $clean_input = ereg_replace("[^A-Za-z]", " ", $input);
    
    echo "Cleaned input: ".$clean_input."<br/>";

//split response into words and count
    $input_array = explode(" ", $clean_input);
    $input_array_count = count($input_array);
    
    echo stripslashes($input_array_count)." words.<br/><br/>";

//loop through each word and check if common.
    for($i=0;$i<$input_array_count;$i++){
        
        echo "Loop ".$i.": ";
        $word_to_test = $input_array[$i];
        echo "Word tested: (".$word_to_test."): ";
        
        $sql_common="SELECT WORD FROM COMMONWORDS WHERE WORD = '".$word_to_test."'  ";
        $result_common = mysql_query($sql_common);
        $row_common = mysql_fetch_array($result_common);
        
        //if common
        if (isset($row_common['WORD'])){
            echo $row_common['WORD']." Found in Database.<br/><br/>";
        }
        //if not common
        if (!isset($row_common['WORD'])){
            echo " Is not common. <br/><br/>";
            $proc_input_array[] = $word_to_test;
        }
    
    }
    
    print_r($proc_input_array);

$proc_input_array should only contain clean words not found in a db.

The trouble is that the ereg_replace changes all non-letter characters for spaces. Then when it comes to the explode, it adds another value to the array for each added space.....

How can I get rid of these extra spaces and stick with the words alone?

Also, words with characters in them get split, like "don't". This gets split into "don" and "t".
 
Last edited:

as4s1n

New Member
Messages
174
Reaction score
4
Points
0
The trim() method cuts out whitespace. Try:
PHP:
$input_array = explode(" ", trim($clean_input));

Also for the special characters you could try htmlspecialchars() which turns the values into html characters, for example a space ' ' would be %20 and it would come out in html as a ' ' still
 
Last edited:

descalzo

Grim Squeaker
Community Support
Messages
9,373
Reaction score
326
Points
83
PHP:
echo "Input: ".$_POST['input']."<br/>";


    $input = strtolower(  stripslashes( $_POST['input'] ) ) ;  
   // Lowercase easier to handle, use stripslashes here

    $clean_input = ereg_replace("[^a-z']", " ", $input);  
   // depending on how you are going to handle  it's  dog's, leave  '  in
    
    echo "Cleaned input: ".$clean_input."<br/>";


    $input_array = preg_split('/\s+/' , $clean_input); 
   // use regular expression to overcome multiple whitespace

    $input_array_count = count($input_array);
    
    echo $input_array_count  . " words.<br/><br/>";  
   //Why did you use stripslashes here?

and a bit later:
PHP:
        //if common
        if (isset($row_common['WORD'])){
            echo $row_common['WORD']." Found in Database.<br/><br/>";
        } else {  // no use testing it again
        
            echo " Is not common. <br/><br/>";
            $proc_input_array[] = $word_to_test;
        }
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
PHP:
    $clean_input = ereg_replace("[^a-z']", " ", $input);  
   // depending on how you are going to handle  it's  dog's, leave  '  in
Careful. Leaving in single-quotes and using mysql_query without first quoting the input values leaves the query vulnerable to SQL injection.

@learning_brain: note that the input word list can be filtered in a single query:
PHP:
// in some other script
if (get_magic_quotes_gpc()) {
    $_REQUEST; # so $GLOBALS['_REQUEST'] exists
    foreach (array('_GET', '_POST', '_COOKIE', '_REQUEST') as $k) {
        $GLOBALS[$k] = array_map('stripslashes', $GLOBALS[$k]);
    }
}
...
//LocalDB::connect returns a PDO
try {
    $db = LocalDB::connect();
    $wordStr = strtolower($_POST['input']);
    $words = array_map(array($db, 'quote') , preg_split("/[^a-zA-Z']+/", $wordStr));
    $wordStr = implode("', '", $words);

    $commonWordsQuery = $db->query("SELECT word FROM common_words WHERE word IN ('$wordStr')");
    $commonWords = $commonWordsQuery->fetchAll(PDO::FETCH_COLUMN);

    $proc_input_array = array_diff($words, $commonWords);
} catch (PDOException $exc) {
    ...
}
A peek at the source of array_diff reveals it's about as efficient as you could do with your own implementation, if not more so.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
@as4s1n. Thanks for this. The trouble is if you trim the result, it wouldn't have nay spaces to explode the result.

@descalzo. Love the strtolower idea and built that in.

I've also removed the random stripslashes and altered the if{}else{} weird construct.

Unfortunately, the preg_split was still leaving single spaces at front/back of string, allowing another loop and a " " entry in the final array.

As misson points out, ignoring the apostrophe in the string can lead to mysql problems so I had to figure a way to erradicate all characters. This is also important when it comes to variants. People might says it's or its, but they both mean the same thing so have to be equal to check. The only problems come with words like I'll and ill!

I have found a more obvious solution that evaded me last night. Simply explode first then clean afterwards in each loop! This way, all spaces are in the correct places, the word count is correct and the cleaning doesn't have to deal with spaces at all.

@misson I'm now trying to get my head round your contribution, I know this isn't particularly efficient so a result based on a single sql would be ideal. Reading up now.

----------------------------------

After some reading up, the array_diff() function certainly came in handy, but I didn't fully understand the PDO connection, so I've modified my own code to something similar.

While I haven't achieved a comparitive sql result, I have simply created two strings: one for the input phrase and one from the COMMONWORDS table.

The array_diff then simply compares the two and strips common words out.

Result!

PS - although it would have been better to clean the string accurately without the loop.

PHP:
<?php

echo "Input: ".$_POST['input']."<br/>";
$input = $_POST['input'];

//split response into words and count
	$input_array = explode(" ", $input);
	$input_array_count = count($input_array);
	
	echo $input_array_count." words.<br/><br/>";

//loop through each word.
	for($i=0;$i<$input_array_count;$i++){
		
		echo "Loop ".$i.": ";
		//assign word to be tested
		$word_to_test = $input_array[$i];
		echo "Word tested: (".$word_to_test."): ";
		
		//change to lowercse
		$word_lower = strtolower($word_to_test);
		//strip out everything other than letters
		$clean_word = ereg_replace("[^A-Za-z]", "", $word_lower);
	
		echo "Cleaned word: (".$clean_word.").<br/> ";
		
		//create growing array with cleaned words
		$clean_phrase_array[] = $clean_word;

	}
	echo "Cleaned phrase is: ";
	print_r($clean_phrase_array);
	echo "<br/>";
	
	//retrieve all common words
	$sql_common="SELECT * FROM COMMONWORDS";
	$result_common = mysql_query($sql_common);
	//assign to array
	while ($row=mysql_fetch_assoc($result_common))
	{
	$common_words[] = $row['WORD'];
	}
	mysql_free_result($result_common);
		
	echo "Common words are: ";
	print_r($common_words);
	echo "<br/>";
	
	//compare differences between arrays.
	$proc_input_array = array_diff($clean_phrase_array,$common_words);
	
	echo "Uncommon phrase to use:";
	print_r($proc_input_array);
?>
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
[...] but I didn't fully understand the PDO connection, [...]
If you're referring to LocalDB::connect, see "Display all that would be secret while Mysql is broken", "Script terminating in middle of script for no reason", "MySQL and PHP" ..., though it has less to do with PDO and more to do with isolating sensitive information (you could write a LocaldDB::connect that returns a connection based on the old mysql driver). If you're talking about PDO, read the tutorial "Writing MySQL Scripts with PHP and PDO". PDO has many advantages over the mysqli and (outdated) mysql driver. At this point, there's no good reason to use the latter.


PHP:
<?php [...]
//loop through each word.
	for($i=0;$i<$input_array_count;$i++){
            [...]
	}
This loop can be replaced with (e.g.)
PHP:
$clean_phrase_array=preg_split('/\s+/', 
                               strtolower(trim(preg_replace('/[^a-zA-Z ]+/', '', 
                                                            $_POST['input']))));

PHP:
	//retrieve all common words
	$sql_common="SELECT * FROM COMMONWORDS";
SELECT * and no WHERE clause will result in terrible performance, unless there are a handful of words in table common_words. Even if table common_words is small, you shouldn't fetch all columns, especially since you only need column word. The old version was better.

Only SQL keywords should be in uppercase. Column names, tables, databases, user defined functions and procedures should be in camelCase, or in lower case with words separated_by_underscores. This helps a reader distinguish user-defined things from keywords. Otherwise, you might as well type everything in lower case and save your shift (or caps lock) key.
 
Last edited:
Top