|
|
|
|
![]() ![]() |
Apr 13 2005, 05:58 AM
Post
#1
|
|
|
Absolute Newbie Group: Admin Posts: 888 Joined: 20-February 05 From: Indianapolis, Indiana, USA (Midwest) Member No.: 2,714 |
So I'm trying to learn how to pull useful content from a web page.
Here's what I got so far: CODE <?php $filename = "http://www.forum500.com"; $html = file_get_contents($filename); /* **************************************************************** */ /* */ /* Got this code @ http://www.php.net/ */ /* */ /* **************************************************************** */ // $document should contain an HTML document. // This will remove HTML tags, javascript sections // and white space. It will also convert some // common HTML entities to their text equivalent. $search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@([\r\n])[\s]+@', // Strip out white space '@&(quot|#34);@i', // Replace HTML entities '@&(amp|#38);@i', '@&(lt|#60);@i', '@&(gt|#62);@i', '@&(nbsp|#160);@i', '@&(iexcl|#161);@i', '@&(cent|#162);@i', '@&(pound|#163);@i', '@&(copy|#169);@i', '@&#(\d+);@e'); // evaluate as php $replace = array ('', '', '\1', '"', '&', '<', '>', ' ', chr(161), chr(162), chr(163), chr(169), 'chr(\1)'); $text = preg_replace($search, $replace, $html); echo $text; ?> The only line that I don't completely understand is: CODE '@<script[^>]*?>.*?</script>@si', I know what it does, I just don't know how.Here is the output: QUOTE(Output) Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc So lets say I wanted to get the name of the Latest Member. QUOTE(Output) Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc So, how do I get just that information? Any help here would be great. vujsa |
|
|
|
Apr 13 2005, 07:36 AM
Post
#2
|
|
|
BUG.SWAT.PATROL Group: Members Posts: 626 Joined: 1-September 04 From: Auckland, New Zealand Member No.: 27 |
QUOTE(vujsa @ Apr 13 2005, 06:58 PM) So I'm trying to learn how to pull useful content from a web page. Here's what I got so far: CODE <?php $filename = "http://www.forum500.com"; $html = file_get_contents($filename); /* **************************************************************** */ /* */ /* Got this code @ http://www.php.net/ */ /* */ /* **************************************************************** */ // $document should contain an HTML document. // This will remove HTML tags, javascript sections // and white space. It will also convert some // common HTML entities to their text equivalent. $search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@([\r\n])[\s]+@', // Strip out white space '@&(quot|#34);@i', // Replace HTML entities '@&(amp|#38);@i', '@&(lt|#60);@i', '@&(gt|#62);@i', '@&(nbsp|#160);@i', '@&(iexcl|#161);@i', '@&(cent|#162);@i', '@&(pound|#163);@i', '@&(copy|#169);@i', '@&#(\d+);@e'); // evaluate as php $replace = array ('', '', '\1', '"', '&', '<', '>', ' ', chr(161), chr(162), chr(163), chr(169), 'chr(\1)'); $text = preg_replace($search, $replace, $html); echo $text; ?> The only line that I don't completely understand is: CODE '@<script[^>]*?>.*?</script>@si', I know what it does, I just don't know how.Here is the output: So lets say I wanted to get the name of the Latest Member. So, how do I get just that information? Any help here would be great. vujsa Some of this I don't understand, as I didn't learn much on PCRE syntax, but it shares some commonalities between PERL's and grep's way. I don't understand the @ bit but it seems it's to show the start and end of the regex and si are some type of modifiers, i being case insensitive, s, I'm not sure. OK <script[^>]*?> finds the <script language="javascript"> part. <script matches the exact string, [^>]*? is known as a negating class, combined with *? means grab everything that's not > which could appear 0 to any instances, as well as being optional, then the > on the end means the end of that part of the script. .*? means any characters 0 to endless, as well as being optional, so needs not exist. and </script> matches the exact string, when you combined it altogether it'll match anything <script blah blah>anything here</script> but it must have the exact strings it's asking for first, which out of this is <script > </script>. Sorry I have to head off for a bit, but I'll come back and explain anything else I've left out. Cheers, MC |
|
|
|
Apr 13 2005, 07:52 AM
Post
#3
|
|
|
Absolute Newbie Group: Admin Posts: 888 Joined: 20-February 05 From: Indianapolis, Indiana, USA (Midwest) Member No.: 2,714 |
Well, I've been playing arround with this a bit now and am beginning to understand some of it.
The "@" is used as a delimiter. Nearly anything will work I guess except escape "\". Pipe "|" or Slash "/" could have just as easily been used. Eagerly awaiting more of your vast knowledge! vujsa |
|
|
|
Apr 13 2005, 09:13 AM
Post
#4
|
|
|
BUG.SWAT.PATROL Group: Members Posts: 626 Joined: 1-September 04 From: Auckland, New Zealand Member No.: 27 |
What do you need to understand?
[^0-9]*? matches anything that's not a number, *? is the greedy inverter, meaning it will only return a result if it has something else it can match with. e.g. The string 'Hello, World!' you could do [^0-9]*?! and it'd match, because it has a definite match of the ! character (at the end, must be definite match at the end of it), if you left it out [^0-9]*? will not function, if you wanted to match anything that is not numbers, then you would leave out the greedy inverter. [^0-9]* which will match anything not a number e.g. The string Hey1234You will match HeyYou missing out the numbers. This is greedy because it has no end to it other than going through the whole file and matching everything till it reaches the end of the file. The s modifier is to include whitespace characters. Greedy inverters stops greediness, e.g. .* will match anything and everything, except newline characters, well depends on setup or modifiers. .*? also will match anything and everything except it needs to have a definite end match to this. If you could explain what you need more understanding with, I'd be glad to help. Cheers, MC |
|
|
|
Apr 13 2005, 09:43 AM
Post
#5
|
|
|
BUG.SWAT.PATROL Group: Members Posts: 626 Joined: 1-September 04 From: Auckland, New Zealand Member No.: 27 |
OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.
Latest Member: User Month So from this we could do: Latest\sMember:\s.+?(?=\s) Notice any problems with this? Well, \s is for space, basically, we're assuming that a username does not contain a space. Now I'm not sure what characters a username can accept or not accept without viewing the source. If you know for a fact that spaces can't be used inside a Username then this is quite an acceptable approach. I'm not using the s modifier as I'm detecting the spaces myself, What I'm assuming is the Username is seperated by spaces, unless usernames can contain spaces, this expression is useless. +? means their must be an occurance here, while not being greedy. e.g. must have at least 1 match, * means 0 or many + means 1 or many. (?=\s) means look ahead to see if the next is a space, this is so we don't include the space (?=) is look ahead. * is the same as {0,} + is the same as {1,} We could use the modifiers to our advantage. There are many methods, even using PHP's date function and not matching the month as well. I'll leave you up to decide on how this should be done, if you knew what characters aren't allowed in the username you would have better chance solving this problem. Cheers, MC |
|
|
|
Apr 14 2005, 05:50 AM
Post
#6
|
|
|
Absolute Newbie Group: Admin Posts: 888 Joined: 20-February 05 From: Indianapolis, Indiana, USA (Midwest) Member No.: 2,714 |
Wow, MC, again a lot more information than I though I needed to know.
That's what's nice about your post, they answer the next three questions too. The PHP site has a very nice expaination of the differences between PHP and Perl regex. The problem is tha I don't know Perl regex either. As you mentioned, you haven't doen much with PCRE so I really appreciate you going ahead and helping anyhow. I tend to stay away from stuff if I have to research it before I can answer the question. The explaination you gave for the CODE '@<script[^>]*?>.*?</script>@si', was very helpful. It being the most complex statement in the array, gives the most oppurtunity to learn from. Now that I know how it works, this statement will be my starting point for learning regex.CODE [^0-9]*?! really clarified how the greedy usage for me.Here is what I came up with before you got back. CODE <?php $filename = "http://www.forum500.com"; $html = file_get_contents($filename); // $document should contain an HTML document. // This will remove HTML tags, javascript sections // and white space. It will also convert some // common HTML entities to their text equivalent. $search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript '@<[\/\!]*?[^<>]*?>@si', // Strip out HTML tags '@([\r\n])[\s]+@', // Strip out white space '@&(quot|#34);@i', // Replace HTML entities '@&(amp|#38);@i', '@&(lt|#60);@i', '@&(gt|#62);@i', '@&(nbsp|#160);@i', '@&(iexcl|#161);@i', '@&(cent|#162);@i', '@&(pound|#163);@i', '@&(copy|#169);@i', '@&#(\d+);@e', // evaluate as php ); $replace = array ('', '', '\1', '"', '&', '<', '>', ' ', chr(161), chr(162), chr(163), chr(169), 'chr(\1)', ); $search2 = array ('@.*?Latest Member:\s@si'); $replace2 = array (''); $text = preg_replace($search, $replace, $html); $text2 = preg_replace($search2, $replace2, $text); echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $text <br /><br /><b>Extracted Data:</b><br />\n \n\n $text2 <br /><br /></body></html>"; ?> With the following output: QUOTE Test String: Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 13, 2005, 10:34:58 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc Extracted Data: marc QUOTE(mastercomputers @ Apr 13 2005, 04:43 AM) OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes. Cheers, MC Actually, the general explaination is quite helpful because ultimately, my example is just a way for me to learn how to do more complex things with regex. Come to think of it, if I could just borrow you brain for a few days, it would really save on the typing. Hey, thanks for all the help. vujsa By the way, the "s" modifier at the end of the regex as explained by the PHP web site: QUOTE(www.php.net) s (PCRE_DOTALL)
|
|
|
|
Apr 14 2005, 06:04 AM
Post
#7
|
|
|
Techno-Necromancer Group: Members Posts: 1,018 Joined: 13-January 05 From: The Net Member No.: 2,127 |
I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks.
~Viz |
|
|
|
Apr 14 2005, 06:59 AM
Post
#8
|
|
|
Absolute Newbie Group: Admin Posts: 888 Joined: 20-February 05 From: Indianapolis, Indiana, USA (Midwest) Member No.: 2,714 |
QUOTE(vizskywalker @ Apr 14 2005, 01:04 AM) I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks. ~Viz The "." specifies to match all charaters. So here is an example: CODE $string = "Hello world! Welcome to my example. <br />\n Please feel free to increase my reputation points! <br />\n Thank you."; $string2 = preg_replace('@.*?\sThank you.@si', 'Hello world! How cool am I!', $string); echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $string <br /><br /><b>New Data:</b><br />\n \n\n $string2 <br /><br /></body></html>"; Returns: QUOTE(Output) Test String: Hello world! Welcome to my example. Please feel free to increase my reputation points! Thank you. New Data: Hello world! How cool am I! So '@.*?\sThank you.@si', means match all characters up to and including one space followed by "Thank You." Basically acts like a wild card. So for '@<script[^>]*?>.*?</script>@si' the "." means match all characters in between the <script ...> and </script> tags. preg_replace() is Pattern Regular Expression Replace function. Works like ereg_replace() but a little more complex, more flexible, and slightly different syntax. Basically, you can use an array of patterns that when matched will be replaced by the coresponding value in an array of replacement values. Instead of typing ereg_replace() over and over, just use the array method with preg_replace(). For more information try the PHP Web Site About PCRE Hope this helps. vujsa This post has been edited by vujsa: Apr 14 2005, 09:59 PM |
|
|
|
Apr 14 2005, 12:40 PM
Post
#9
|
|
|
Premium Member Group: Members Posts: 208 Joined: 6-September 04 From: England Member No.: 315 |
I'd just like to thankyou for starting this topic, it was very convenient for me
|
|
|
|
Apr 14 2005, 08:56 PM
Post
#10
|
|
|
Techno-Necromancer Group: Members Posts: 1,018 Joined: 13-January 05 From: The Net Member No.: 2,127 |
Yes, it is very helpful to have this post. And I'm sure overture meant MC not M^E.
~Viz |
|
|
|
![]() ![]() |
Similar Topics
| Topics | Topics | |
|---|---|---|
|
|
|