Php Regular Expressions - Request For Help.

Pages: 1, 2
free web hosting

Read Latest Entries..: (Post #11) by ^zer0dyer$ on May 2 2005, 08:02 AM. (Line Breaks Removed)
First of all, if you are just trying to escape HTML characters use:-htmlentities() OR-htmlspecialchars()which are built-in PHP functions. If you are just trying to learn regexp, more power to you! In PCRE regular expressions, there are several types of delimiters you can use for your patternsCODE<?php# This finds all tags$pattern = "@<\w+?[^>]>@is"; //... read more.
Read the FIRST post of this Topic. - Express your Opinion! Contribute Knowledge :-).

Free Web Hosting > Computers & Tech > Programming > Scripting > PHP

Php Regular Expressions - Request For Help.

vujsa
So I'm trying to learn how to pull useful content from a web page.

Here's what I got so far:
CODE
<?php
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

/* **************************************************************** */
/*                                                                  */
/*             Got this code @ http://www.php.net/                  */
/*                                                                  */
/* **************************************************************** */
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',
                '@&(lt|#60);@i',
                '@&(gt|#62);@i',
                '@&(nbsp|#160);@i',
                '@&(iexcl|#161);@i',
                '@&(cent|#162);@i',
                '@&(pound|#163);@i',
                '@&(copy|#169);@i',
                '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                '',
                '\1',
                '"',
                '&',
                '<',
                '>',
                ' ',
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                'chr(\1)');

$text = preg_replace($search, $replace, $html);
echo $text;
?>


The only line that I don't completely understand is:
CODE
'@<script[^>]*?>.*?</script>@si',
I know what it does, I just don't know how.

Here is the output:
QUOTE(Output)
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc


So lets say I wanted to get the name of the Latest Member.

QUOTE(Output)
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc

So, how do I get just that information?

Any help here would be great. cool.gif

vujsa

 

 

 


Reply

mastercomputers
QUOTE(vujsa @ Apr 13 2005, 06:58 PM)
So I'm trying to learn how to pull useful content from a web page.

Here's what I got so far:
CODE
<?php
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

/* **************************************************************** */
/*                                                                  */
/*             Got this code @ http://www.php.net/                  */
/*                                                                  */
/* **************************************************************** */
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',
                '@&(lt|#60);@i',
                '@&(gt|#62);@i',
                '@&(nbsp|#160);@i',
                '@&(iexcl|#161);@i',
                '@&(cent|#162);@i',
                '@&(pound|#163);@i',
                '@&(copy|#169);@i',
                '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                '',
                '\1',
                '"',
                '&',
                '<',
                '>',
                ' ',
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                'chr(\1)');

$text = preg_replace($search, $replace, $html);
echo $text;
?>


The only line that I don't completely understand is:
CODE
'@<script[^>]*?>.*?</script>@si',
I know what it does, I just don't know how.

Here is the output:
So lets say I wanted to get the name of the Latest Member.
So, how do I get just that information?

Any help here would be great.  cool.gif

vujsa
*



Some of this I don't understand, as I didn't learn much on PCRE syntax, but it shares some commonalities between PERL's and grep's way. I don't understand the @ bit but it seems it's to show the start and end of the regex and si are some type of modifiers, i being case insensitive, s, I'm not sure.

OK

<script[^>]*?> finds the <script language="javascript"> part. <script matches the exact string, [^>]*? is known as a negating class, combined with *? means grab everything that's not > which could appear 0 to any instances, as well as being optional, then the > on the end means the end of that part of the script.

.*? means any characters 0 to endless, as well as being optional, so needs not exist.

and </script> matches the exact string, when you combined it altogether it'll match anything <script blah blah>anything here</script> but it must have the exact strings it's asking for first, which out of this is <script > </script>.

Sorry I have to head off for a bit, but I'll come back and explain anything else I've left out.

Cheers,


MC



 

 

 


Reply

vujsa
Well, I've been playing arround with this a bit now and am beginning to understand some of it.

The "@" is used as a delimiter. Nearly anything will work I guess except escape "\". Pipe "|" or Slash "/" could have just as easily been used.

Eagerly awaiting more of your vast knowledge! cool.gif

vujsa

Reply

mastercomputers
What do you need to understand?

[^0-9]*? matches anything that's not a number, *? is the greedy inverter, meaning it will only return a result if it has something else it can match with. e.g.

The string 'Hello, World!' you could do [^0-9]*?! and it'd match, because it has a definite match of the ! character (at the end, must be definite match at the end of it), if you left it out [^0-9]*? will not function, if you wanted to match anything that is not numbers, then you would leave out the greedy inverter.

[^0-9]* which will match anything not a number e.g. The string Hey1234You will match HeyYou missing out the numbers. This is greedy because it has no end to it other than going through the whole file and matching everything till it reaches the end of the file.

The s modifier is to include whitespace characters.

Greedy inverters stops greediness, e.g. .* will match anything and everything, except newline characters, well depends on setup or modifiers. .*? also will match anything and everything except it needs to have a definite end match to this.

If you could explain what you need more understanding with, I'd be glad to help.


Cheers,


MC


Reply

mastercomputers
OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.

Latest Member: User Month

So from this we could do:

Latest\sMember:\s.+?(?=\s)

Notice any problems with this?

Well, \s is for space,

basically, we're assuming that a username does not contain a space. Now I'm not sure what characters a username can accept or not accept without viewing the source. If you know for a fact that spaces can't be used inside a Username then this is quite an acceptable approach. I'm not using the s modifier as I'm detecting the spaces myself, What I'm assuming is the Username is seperated by spaces, unless usernames can contain spaces, this expression is useless.

+? means their must be an occurance here, while not being greedy. e.g. must have at least 1 match, * means 0 or many + means 1 or many.

(?=\s) means look ahead to see if the next is a space, this is so we don't include the space (?=) is look ahead.

* is the same as {0,}
+ is the same as {1,}

We could use the modifiers to our advantage.

There are many methods, even using PHP's date function and not matching the month as well.

I'll leave you up to decide on how this should be done, if you knew what characters aren't allowed in the username you would have better chance solving this problem.


Cheers,


MC

Reply

vujsa
Wow, MC, again a lot more information than I though I needed to know.
That's what's nice about your post, they answer the next three questions too. The PHP site has a very nice expaination of the differences between PHP and Perl regex. The problem is tha I don't know Perl regex either. dry.gif

As you mentioned, you haven't doen much with PCRE so I really appreciate you going ahead and helping anyhow. I tend to stay away from stuff if I have to research it before I can answer the question.

The explaination you gave for the
CODE
'@<script[^>]*?>.*?</script>@si',
was very helpful. It being the most complex statement in the array, gives the most oppurtunity to learn from. Now that I know how it works, this statement will be my starting point for learning regex.

CODE
[^0-9]*?!
really clarified how the greedy usage for me.

Here is what I came up with before you got back.
CODE
<?php
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',
                '@&(lt|#60);@i',
                '@&(gt|#62);@i',
                '@&(nbsp|#160);@i',
                '@&(iexcl|#161);@i',
                '@&(cent|#162);@i',
                '@&(pound|#163);@i',
                '@&(copy|#169);@i',
                '@&#(\d+);@e',                    // evaluate as php
                );

$replace = array ('',
                '',
                '\1',
                '"',
                '&',
                '<',
                '>',
                ' ',
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                'chr(\1)',
                );

$search2 = array ('@.*?Latest Member:\s@si');

$replace2 = array ('');

$text = preg_replace($search, $replace, $html);
$text2 = preg_replace($search2, $replace2, $text);

echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $text <br /><br /><b>Extracted Data:</b><br />\n \n\n $text2  <br /><br /></body></html>";
?>


With the following output: biggrin.gif
QUOTE
Test String:
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 13, 2005, 10:34:58 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc

Extracted Data:
marc


QUOTE(mastercomputers @ Apr 13 2005, 04:43 AM)
OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.
Cheers,
MC
*


Actually, the general explaination is quite helpful because ultimately, my example is just a way for me to learn how to do more complex things with regex.

Come to think of it, if I could just borrow you brain for a few days, it would really save on the typing. laugh.gif

Hey, thanks for all the help. cool.gif

vujsa

By the way, the "s" modifier at the end of the regex as explained by the PHP web site:
QUOTE(www.php.net)
s (PCRE_DOTALL)
  • If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

Reply

vizskywalker
I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks.

~Viz

Reply

vujsa
QUOTE(vizskywalker @ Apr 14 2005, 01:04 AM)
I followed most of the explanation.  But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si".  Also, what exactly does preg_replace() do?  Thanks.

~Viz
*



The "." specifies to match all charaters. So here is an example:
CODE
$string = "Hello world!  Welcome to my example. <br />\n Please feel free to increase my reputation points! <br />\n Thank you.";
$string2 = preg_replace('@.*?\sThank you.@si', 'Hello world!  How cool am I!', $string);

echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $string <br /><br /><b>New Data:</b><br />\n \n\n $string2  <br /><br /></body></html>";


Returns:
QUOTE(Output)
Test String:
Hello world! Welcome to my example.
Please feel free to increase my reputation points!
Thank you.

New Data:
Hello world! How cool am I!


So '@.*?\sThank you.@si', means match all characters up to and including one space followed by "Thank You."

Basically acts like a wild card.

So for '@<script[^>]*?>.*?</script>@si' the "." means match all characters in between the <script ...> and </script> tags.

preg_replace() is Pattern Regular Expression Replace function.
Works like ereg_replace() but a little more complex, more flexible, and slightly different syntax. Basically, you can use an array of patterns that when matched will be replaced by the coresponding value in an array of replacement values.

Instead of typing ereg_replace() over and over, just use the array method with preg_replace().
For more information try the PHP Web Site About PCRE
Hope this helps. cool.gif

vujsa

Reply

overture
I'd just like to thankyou for starting this topic, it was very convenient for me smile.gif this has helped me as much as it has helped you Vujsa. An extra thanks to M^E for the extensive explanations biggrin.gif. thanks.

Reply

vizskywalker
Yes, it is very helpful to have this post. And I'm sure overture meant MC not M^E.

~Viz

Reply

Latest Entries

^zer0dyer$
First of all, if you are just trying to escape HTML characters use:
-htmlentities() OR
-htmlspecialchars()
which are built-in PHP functions. If you are just trying to learn regexp, more power to you! cool.gif

In PCRE regular expressions, there are several types of delimiters you can use for your patterns
CODE
<?php
# This finds all tags
$pattern = "@<\w+?[^>]>@is"; // @ is the delimiter in pattern.
?>

That was a very eloquent pattern, mastercomputers. I just recently started using look aheads/behinds, and have had fun toying around with them smile.gif

Also, if you want to print out mastercomputers' result with preg_match, try the following for some good practice:
CODE
<?php
$file = "path/to/file";
$handle = @file_get_contents($file) or die("File not acquired!\n</body>\n</html>");
// This is safer!

$pattern = "#Latest\sMember:\s.+?(?=\s)#i";

if ( preg_match($pattern, $handle, $matches) )
{
  print "<pre>\n".print_r($matches)."\n<pre>\n";
}
else
{
  print "<p>Nothing found =(";
}

# Finds all matches
if ( preg_match_all($pattern, $handle, $matches) )
{ # $matches is now a 2-dimensional array
  print "<pre>\n".print_r($matches)."\n</pre>\n";
}
else
{
  print "Nothing found =(";
}
?>


Peace,
+CurTis-

Reply


Got an Opinion! Express your Views! (no registration):-
Add your Reply/ Opinion/ Views/ Comments/ Suggestion/ Questions/ Queries etc.
Posts with decent grammar & English will be accepted and please refrain from profanities.
For asking a Question, We recommend you to sign-up (for free) so that you can track the topic easily.

Nature of your Post*: Opinion/ Reply/ Comments
Question/Query
Feedback to us.
       
Name   Email
Title/Question*

Pages: 1, 2
Similar Topics

Keywords : php regular expressions request

  1. Script Request - script request (2)
  2. Wierd Problem With $_post/$_get/$_request - (11)
    I'm having a bad problem and no idea how to fix it...it's setting my back on my PHP work. I
    have 7 if statements that do something depending on what $_POST is, it works for the first 6th and
    on the last one it adds a 1 (only ads the 1 at the end IF its the one in the if statement). Example
    of the failing line: CODE if ($_POST == 'yes' AND $_POST == 'lol' AND $_POST
    == 'cmd') { The other ifs are like that, just $_POST == other things. Those if
    statements work. It does this with $_GET and $_REQUEST. It does this no matter what...
  3. How To Remove Query String Using Regular Expressions - (4)
    Hi, I'm a complete newbie using regular expressions and what i want to do is to remove all the
    parameters of the query string of any url using regular expressions, how can i do this? For example
    if i have some urls like these: http://www.domain.com/file1.php?var1=value1&var2=value2
    http://www.domain.com/file2.php?var1=value1&var2=value2
    http://www.domain.com/file3.php?var1=value1&var2=value2 I want to only get these ones:
    http://www.domain.com/file1.php http://www.domain.com/file2.php http://www.domain.com/file3.php
    If it is possible please post no...
  4. Php Access Log In Reverse Order - Request For Help. (8)
    So I need help getting data entered into my log correctly. I want the newest entry to be at the
    beginning (top) of the log instead of at the end (bottom). Here's what I have: CODE
    function access_log(){  // Enter data in usage log. $filename = "access.log"; $entry = gmdate("M
    d, Y H:i:s T").": ". getenv("REMOTE_ADDR").": ". getenv("HTTP_USER_AGENT")." \n";
    fwrite(fopen($filename, "a"), $entry); fclose(fopen($filename, "a")); }  //  End function
    access_log() ?> And it outputs: Mar 29, 2005 07:57:16 GMT Standard Time: 192.168.1.1:
    Mozilla/5.0 (Window...



Looking for php, regular, expressions, request






*SIMILAR VIDEOS*
Searching Video's for php, regular, expressions, request
advertisement




Php Regular Expressions - Request For Help.