Welcome Guest ( Log In | Register )



2 Pages V   1 2 >  
Reply to this topicStart new topic
> Php Regular Expressions, Request For Help.
vujsa
post Apr 13 2005, 05:58 AM
Post #1


Absolute Newbie
Group Icon

Group: Admin
Posts: 888
Joined: 20-February 05
From: Indianapolis, Indiana, USA (Midwest)
Member No.: 2,714



So I'm trying to learn how to pull useful content from a web page.

Here's what I got so far:
CODE
<?php
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

/* **************************************************************** */
/*                                                                  */
/*             Got this code @ http://www.php.net/                  */
/*                                                                  */
/* **************************************************************** */
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',
                '@&(lt|#60);@i',
                '@&(gt|#62);@i',
                '@&(nbsp|#160);@i',
                '@&(iexcl|#161);@i',
                '@&(cent|#162);@i',
                '@&(pound|#163);@i',
                '@&(copy|#169);@i',
                '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                '',
                '\1',
                '"',
                '&',
                '<',
                '>',
                ' ',
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                'chr(\1)');

$text = preg_replace($search, $replace, $html);
echo $text;
?>


The only line that I don't completely understand is:
CODE
'@<script[^>]*?>.*?</script>@si',
I know what it does, I just don't know how.

Here is the output:
QUOTE(Output)
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc


So lets say I wanted to get the name of the Latest Member.

QUOTE(Output)
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc

So, how do I get just that information?

Any help here would be great. cool.gif

vujsa
Go to the top of the page
 
+Quote Post
mastercomputers
post Apr 13 2005, 07:36 AM
Post #2


BUG.SWAT.PATROL
Group Icon

Group: Members
Posts: 626
Joined: 1-September 04
From: Auckland, New Zealand
Member No.: 27



QUOTE(vujsa @ Apr 13 2005, 06:58 PM)
So I'm trying to learn how to pull useful content from a web page.

Here's what I got so far:
CODE
<?php
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

/* **************************************************************** */
/*                                                                  */
/*             Got this code @ http://www.php.net/                  */
/*                                                                  */
/* **************************************************************** */
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',
                '@&(lt|#60);@i',
                '@&(gt|#62);@i',
                '@&(nbsp|#160);@i',
                '@&(iexcl|#161);@i',
                '@&(cent|#162);@i',
                '@&(pound|#163);@i',
                '@&(copy|#169);@i',
                '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                '',
                '\1',
                '"',
                '&',
                '<',
                '>',
                ' ',
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                'chr(\1)');

$text = preg_replace($search, $replace, $html);
echo $text;
?>


The only line that I don't completely understand is:
CODE
'@<script[^>]*?>.*?</script>@si',
I know what it does, I just don't know how.

Here is the output:
So lets say I wanted to get the name of the Latest Member.
So, how do I get just that information?

Any help here would be great.  cool.gif

vujsa
*



Some of this I don't understand, as I didn't learn much on PCRE syntax, but it shares some commonalities between PERL's and grep's way. I don't understand the @ bit but it seems it's to show the start and end of the regex and si are some type of modifiers, i being case insensitive, s, I'm not sure.

OK

<script[^>]*?> finds the <script language="javascript"> part. <script matches the exact string, [^>]*? is known as a negating class, combined with *? means grab everything that's not > which could appear 0 to any instances, as well as being optional, then the > on the end means the end of that part of the script.

.*? means any characters 0 to endless, as well as being optional, so needs not exist.

and </script> matches the exact string, when you combined it altogether it'll match anything <script blah blah>anything here</script> but it must have the exact strings it's asking for first, which out of this is <script > </script>.

Sorry I have to head off for a bit, but I'll come back and explain anything else I've left out.

Cheers,


MC



Go to the top of the page
 
+Quote Post
vujsa
post Apr 13 2005, 07:52 AM
Post #3


Absolute Newbie
Group Icon

Group: Admin
Posts: 888
Joined: 20-February 05
From: Indianapolis, Indiana, USA (Midwest)
Member No.: 2,714



Well, I've been playing arround with this a bit now and am beginning to understand some of it.

The "@" is used as a delimiter. Nearly anything will work I guess except escape "\". Pipe "|" or Slash "/" could have just as easily been used.

Eagerly awaiting more of your vast knowledge! cool.gif

vujsa
Go to the top of the page
 
+Quote Post
mastercomputers
post Apr 13 2005, 09:13 AM
Post #4


BUG.SWAT.PATROL
Group Icon

Group: Members
Posts: 626
Joined: 1-September 04
From: Auckland, New Zealand
Member No.: 27



What do you need to understand?

[^0-9]*? matches anything that's not a number, *? is the greedy inverter, meaning it will only return a result if it has something else it can match with. e.g.

The string 'Hello, World!' you could do [^0-9]*?! and it'd match, because it has a definite match of the ! character (at the end, must be definite match at the end of it), if you left it out [^0-9]*? will not function, if you wanted to match anything that is not numbers, then you would leave out the greedy inverter.

[^0-9]* which will match anything not a number e.g. The string Hey1234You will match HeyYou missing out the numbers. This is greedy because it has no end to it other than going through the whole file and matching everything till it reaches the end of the file.

The s modifier is to include whitespace characters.

Greedy inverters stops greediness, e.g. .* will match anything and everything, except newline characters, well depends on setup or modifiers. .*? also will match anything and everything except it needs to have a definite end match to this.

If you could explain what you need more understanding with, I'd be glad to help.


Cheers,


MC

Go to the top of the page
 
+Quote Post
mastercomputers
post Apr 13 2005, 09:43 AM
Post #5


BUG.SWAT.PATROL
Group Icon

Group: Members
Posts: 626
Joined: 1-September 04
From: Auckland, New Zealand
Member No.: 27



OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.

Latest Member: User Month

So from this we could do:

Latest\sMember:\s.+?(?=\s)

Notice any problems with this?

Well, \s is for space,

basically, we're assuming that a username does not contain a space. Now I'm not sure what characters a username can accept or not accept without viewing the source. If you know for a fact that spaces can't be used inside a Username then this is quite an acceptable approach. I'm not using the s modifier as I'm detecting the spaces myself, What I'm assuming is the Username is seperated by spaces, unless usernames can contain spaces, this expression is useless.

+? means their must be an occurance here, while not being greedy. e.g. must have at least 1 match, * means 0 or many + means 1 or many.

(?=\s) means look ahead to see if the next is a space, this is so we don't include the space (?=) is look ahead.

* is the same as {0,}
+ is the same as {1,}

We could use the modifiers to our advantage.

There are many methods, even using PHP's date function and not matching the month as well.

I'll leave you up to decide on how this should be done, if you knew what characters aren't allowed in the username you would have better chance solving this problem.


Cheers,


MC
Go to the top of the page
 
+Quote Post
vujsa
post Apr 14 2005, 05:50 AM
Post #6


Absolute Newbie
Group Icon

Group: Admin
Posts: 888
Joined: 20-February 05
From: Indianapolis, Indiana, USA (Midwest)
Member No.: 2,714



Wow, MC, again a lot more information than I though I needed to know.
That's what's nice about your post, they answer the next three questions too. The PHP site has a very nice expaination of the differences between PHP and Perl regex. The problem is tha I don't know Perl regex either. dry.gif

As you mentioned, you haven't doen much with PCRE so I really appreciate you going ahead and helping anyhow. I tend to stay away from stuff if I have to research it before I can answer the question.

The explaination you gave for the
CODE
'@<script[^>]*?>.*?</script>@si',
was very helpful. It being the most complex statement in the array, gives the most oppurtunity to learn from. Now that I know how it works, this statement will be my starting point for learning regex.

CODE
[^0-9]*?!
really clarified how the greedy usage for me.

Here is what I came up with before you got back.
CODE
<?php
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',
                '@&(lt|#60);@i',
                '@&(gt|#62);@i',
                '@&(nbsp|#160);@i',
                '@&(iexcl|#161);@i',
                '@&(cent|#162);@i',
                '@&(pound|#163);@i',
                '@&(copy|#169);@i',
                '@&#(\d+);@e',                    // evaluate as php
                );

$replace = array ('',
                '',
                '\1',
                '"',
                '&',
                '<',
                '>',
                ' ',
                chr(161),
                chr(162),
                chr(163),
                chr(169),
                'chr(\1)',
                );

$search2 = array ('@.*?Latest Member:\s@si');

$replace2 = array ('');

$text = preg_replace($search, $replace, $html);
$text2 = preg_replace($search2, $replace2, $text);

echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $text <br /><br /><b>Extracted Data:</b><br />\n \n\n $text2  <br /><br /></body></html>";
?>


With the following output: biggrin.gif
QUOTE
Test String:
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 13, 2005, 10:34:58 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc

Extracted Data:
marc


QUOTE(mastercomputers @ Apr 13 2005, 04:43 AM)
OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.
Cheers,
MC
*


Actually, the general explaination is quite helpful because ultimately, my example is just a way for me to learn how to do more complex things with regex.

Come to think of it, if I could just borrow you brain for a few days, it would really save on the typing. laugh.gif

Hey, thanks for all the help. cool.gif

vujsa

By the way, the "s" modifier at the end of the regex as explained by the PHP web site:
QUOTE(www.php.net)
s (PCRE_DOTALL)
  • If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
Go to the top of the page
 
+Quote Post
vizskywalker
post Apr 14 2005, 06:04 AM
Post #7


Techno-Necromancer
Group Icon

Group: Members
Posts: 1,018
Joined: 13-January 05
From: The Net
Member No.: 2,127



I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks.

~Viz
Go to the top of the page
 
+Quote Post
vujsa
post Apr 14 2005, 06:59 AM
Post #8


Absolute Newbie
Group Icon

Group: Admin
Posts: 888
Joined: 20-February 05
From: Indianapolis, Indiana, USA (Midwest)
Member No.: 2,714



QUOTE(vizskywalker @ Apr 14 2005, 01:04 AM)
I followed most of the explanation.  But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si".  Also, what exactly does preg_replace() do?  Thanks.

~Viz
*



The "." specifies to match all charaters. So here is an example:
CODE
$string = "Hello world!  Welcome to my example. <br />\n Please feel free to increase my reputation points! <br />\n Thank you.";
$string2 = preg_replace('@.*?\sThank you.@si', 'Hello world!  How cool am I!', $string);

echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $string <br /><br /><b>New Data:</b><br />\n \n\n $string2  <br /><br /></body></html>";


Returns:
QUOTE(Output)
Test String:
Hello world! Welcome to my example.
Please feel free to increase my reputation points!
Thank you.

New Data:
Hello world! How cool am I!


So '@.*?\sThank you.@si', means match all characters up to and including one space followed by "Thank You."

Basically acts like a wild card.

So for '@<script[^>]*?>.*?</script>@si' the "." means match all characters in between the <script ...> and </script> tags.

preg_replace() is Pattern Regular Expression Replace function.
Works like ereg_replace() but a little more complex, more flexible, and slightly different syntax. Basically, you can use an array of patterns that when matched will be replaced by the coresponding value in an array of replacement values.

Instead of typing ereg_replace() over and over, just use the array method with preg_replace().
For more information try the PHP Web Site About PCRE
Hope this helps. cool.gif

vujsa

This post has been edited by vujsa: Apr 14 2005, 09:59 PM
Go to the top of the page
 
+Quote Post
overture
post Apr 14 2005, 12:40 PM
Post #9


Premium Member
Group Icon

Group: Members
Posts: 208
Joined: 6-September 04
From: England
Member No.: 315



I'd just like to thankyou for starting this topic, it was very convenient for me smile.gif this has helped me as much as it has helped you Vujsa. An extra thanks to M^E for the extensive explanations biggrin.gif. thanks.
Go to the top of the page
 
+Quote Post
vizskywalker
post Apr 14 2005, 08:56 PM
Post #10


Techno-Necromancer
Group Icon

Group: Members
Posts: 1,018
Joined: 13-January 05
From: The Net
Member No.: 2,127



Yes, it is very helpful to have this post. And I'm sure overture meant MC not M^E.

~Viz
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >
Reply to this topicStart new topic

Collapse

> Similar Topics

Topics Topics
  1. Wierd Problem With $_post/$_get/$_request(11)
  2. Script Request(2)