Jump to content



Welcome to AstaHost - Dear Guest , Please Register here to get Your own website. - Ask a Question / Express Opinion / Reply w/o Sign-Up!

Toggle shoutbox Shoutbox Open the Shoutbox in a popup

@  yordan : (14 April 2014 - 05:28 PM) By The Way, This Could Be An Interesting Subject For A Topic, What About Posting This Question? Let's See If Other People Have The Same Feeling Concerning Bootlists!
@  yordan : (13 April 2014 - 09:36 AM) Boot Order : Cd, [Usb,] Hard Drive :D
@  yordan : (11 April 2014 - 07:23 PM) I Simply Let The Bios Do That
@  Ritesh : (11 April 2014 - 10:23 AM) Is It Possible To Launch Fedora Live Cd Or Installation Disk From Hard Drive On Windows Platform Using Grub Mbr File.
@  Ritesh : (11 April 2014 - 10:21 AM) No U Are Not.. Btw.. I Have Question For You.
@  yordan : (10 April 2014 - 08:02 AM) You Are Partially Right.
I Was Not.
Nevertheless, I Am Again :)
@  Ritesh : (09 April 2014 - 07:33 PM) :P
@  Ritesh : (09 April 2014 - 07:33 PM) I Think U R Not..
@  yordan : (09 April 2014 - 09:28 AM) I'm The Master Of The Shoutbox!
@  yordan : (05 April 2014 - 10:32 PM) He-He
@  Ritesh : (04 April 2014 - 06:59 PM) Ha Ha Ha ....
@  yordan : (04 April 2014 - 11:15 AM) Welcome Back, Starscream!
@  yordan : (03 April 2014 - 02:31 PM) And I Hope That He Will Come Back Soon :)
@  yordan : (01 April 2014 - 02:53 PM) Nice, Ritesh Came, I'm Not Home Alone Today.
@  Ritesh : (01 April 2014 - 08:51 AM) Oh!!! Poor Dear Yordan..
@  yordan : (31 March 2014 - 10:02 AM) I'm A Poor Lonesome Cow-Boy
@  yordan : (27 March 2014 - 02:22 PM) He Is Unpatient Due To His Patients!
@  Ritesh : (27 March 2014 - 10:46 AM) :(
@  Ritesh : (27 March 2014 - 10:46 AM) He Is Busy With His Patients.
@  yordan : (26 March 2014 - 08:12 PM) Ahsani, Where Are You?

Photo
- - - - -

Php Regular Expressions Request For Help.


11 replies to this topic

#1 vujsa

vujsa

    Absolute Newbie

  • Members
  • 888 posts
  • Gender:Male
  • Location:Indianapolis, Indiana, USA (Midwest)
  • myCENTs:35.43

Posted 13 April 2005 - 05:58 AM

So I'm trying to learn how to pull useful content from a web page.

Here's what I got so far:
<?php 
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

/* **************************************************************** */
/*                                                                  */
/*             Got this code @ http://www.php.net/                  */
/*                                                                  */
/* **************************************************************** */
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                 '@([\r\n])[\s]+@',                // Strip out white space
                 '@&(quot|#34);@i',                // Replace HTML entities
                 '@&(amp|#38);@i',
                 '@&(lt|#60);@i',
                 '@&(gt|#62);@i',
                 '@&(nbsp|#160);@i',
                 '@&(iexcl|#161);@i',
                 '@&(cent|#162);@i',
                 '@&(pound|#163);@i',
                 '@&(copy|#169);@i',
                 '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                 '',
                 '\1',
                 '"',
                 '&',
                 '<',
                 '>',
                 ' ',
                 chr(161),
                 chr(162),
                 chr(163),
                 chr(169),
                 'chr(\1)');

$text = preg_replace($search, $replace, $html);
echo $text;
?>

The only line that I don't completely understand is:
'@<script[^>]*?>.*?</script>@si',
I know what it does, I just don't know how.

Here is the output:

Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc


So lets say I wanted to get the name of the Latest Member.

Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 12, 2005, 10:30:12 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc

So, how do I get just that information?

Any help here would be great. :rolleyes:

vujsa

#2 Guest_mastercomputers_*

Guest_mastercomputers_*
  • Guests

Posted 13 April 2005 - 07:36 AM

So I'm trying to learn how to pull useful content from a web page.

Here's what I got so far:

<?php 
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

/* **************************************************************** */
/*                                                                  */
/*             Got this code @ http://www.php.net/                  */
/*                                                                  */
/* **************************************************************** */
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                 '@([\r\n])[\s]+@',                // Strip out white space
                 '@&(quot|#34);@i',                // Replace HTML entities
                 '@&(amp|#38);@i',
                 '@&(lt|#60);@i',
                 '@&(gt|#62);@i',
                 '@&(nbsp|#160);@i',
                 '@&(iexcl|#161);@i',
                 '@&(cent|#162);@i',
                 '@&(pound|#163);@i',
                 '@&(copy|#169);@i',
                 '@&#(\d+);@e');                    // evaluate as php

$replace = array ('',
                 '',
                 '\1',
                 '"',
                 '&',
                 '<',
                 '>',
                 ' ',
                 chr(161),
                 chr(162),
                 chr(163),
                 chr(169),
                 'chr(\1)');

$text = preg_replace($search, $replace, $html);
echo $text;
?>

The only line that I don't completely understand is:
'@<script[^>]*?>.*?</script>@si',
I know what it does, I just don't know how.

Here is the output:
So lets say I wanted to get the name of the Latest Member.
So, how do I get just that information?

Any help here would be great.  :rolleyes:

vujsa

<{POST_SNAPBACK}>


Some of this I don't understand, as I didn't learn much on PCRE syntax, but it shares some commonalities between PERL's and grep's way. I don't understand the @ bit but it seems it's to show the start and end of the regex and si are some type of modifiers, i being case insensitive, s, I'm not sure.

OK

<script[^>]*?> finds the <script language="javascript"> part. <script matches the exact string, [^>]*? is known as a negating class, combined with *? means grab everything that's not > which could appear 0 to any instances, as well as being optional, then the > on the end means the end of that part of the script.

.*? means any characters 0 to endless, as well as being optional, so needs not exist.

and </script> matches the exact string, when you combined it altogether it'll match anything <script blah blah>anything here</script> but it must have the exact strings it's asking for first, which out of this is <script > </script>.

Sorry I have to head off for a bit, but I'll come back and explain anything else I've left out.

Cheers,


MC

#3 vujsa

vujsa

    Absolute Newbie

  • Members
  • 888 posts
  • Gender:Male
  • Location:Indianapolis, Indiana, USA (Midwest)
  • myCENTs:35.43

Posted 13 April 2005 - 07:52 AM

Well, I've been playing arround with this a bit now and am beginning to understand some of it.

The "@" is used as a delimiter. Nearly anything will work I guess except escape "\". Pipe "|" or Slash "/" could have just as easily been used.

Eagerly awaiting more of your vast knowledge! :rolleyes:

vujsa

#4 Guest_mastercomputers_*

Guest_mastercomputers_*
  • Guests

Posted 13 April 2005 - 09:13 AM

What do you need to understand?

[^0-9]*? matches anything that's not a number, *? is the greedy inverter, meaning it will only return a result if it has something else it can match with. e.g.

The string 'Hello, World!' you could do [^0-9]*?! and it'd match, because it has a definite match of the ! character (at the end, must be definite match at the end of it), if you left it out [^0-9]*? will not function, if you wanted to match anything that is not numbers, then you would leave out the greedy inverter.

[^0-9]* which will match anything not a number e.g. The string Hey1234You will match HeyYou missing out the numbers. This is greedy because it has no end to it other than going through the whole file and matching everything till it reaches the end of the file.

The s modifier is to include whitespace characters.

Greedy inverters stops greediness, e.g. .* will match anything and everything, except newline characters, well depends on setup or modifiers. .*? also will match anything and everything except it needs to have a definite end match to this.

If you could explain what you need more understanding with, I'd be glad to help.


Cheers,


MC

#5 Guest_mastercomputers_*

Guest_mastercomputers_*
  • Guests

Posted 13 April 2005 - 09:43 AM

OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.

Latest Member: User Month

So from this we could do:

Latest\sMember:\s.+?(?=\s)

Notice any problems with this?

Well, \s is for space,

basically, we're assuming that a username does not contain a space. Now I'm not sure what characters a username can accept or not accept without viewing the source. If you know for a fact that spaces can't be used inside a Username then this is quite an acceptable approach. I'm not using the s modifier as I'm detecting the spaces myself, What I'm assuming is the Username is seperated by spaces, unless usernames can contain spaces, this expression is useless.

+? means their must be an occurance here, while not being greedy. e.g. must have at least 1 match, * means 0 or many + means 1 or many.

(?=\s) means look ahead to see if the next is a space, this is so we don't include the space (?=) is look ahead.

* is the same as {0,}
+ is the same as {1,}

We could use the modifiers to our advantage.

There are many methods, even using PHP's date function and not matching the month as well.

I'll leave you up to decide on how this should be done, if you knew what characters aren't allowed in the username you would have better chance solving this problem.


Cheers,


MC

#6 vujsa

vujsa

    Absolute Newbie

  • Members
  • 888 posts
  • Gender:Male
  • Location:Indianapolis, Indiana, USA (Midwest)
  • myCENTs:35.43

Posted 14 April 2005 - 05:50 AM

Wow, MC, again a lot more information than I though I needed to know.
That's what's nice about your post, they answer the next three questions too. The PHP site has a very nice expaination of the differences between PHP and Perl regex. The problem is tha I don't know Perl regex either. :rolleyes:

As you mentioned, you haven't doen much with PCRE so I really appreciate you going ahead and helping anyhow. I tend to stay away from stuff if I have to research it before I can answer the question.

The explaination you gave for the
'@<script[^>]*?>.*?</script>@si',
was very helpful. It being the most complex statement in the array, gives the most oppurtunity to learn from. Now that I know how it works, this statement will be my starting point for learning regex.

[^0-9]*?!
really clarified how the greedy usage for me.

Here is what I came up with before you got back.
<?php 
$filename = "http://www.forum500.com";
$html = file_get_contents($filename);

// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                 '@([\r\n])[\s]+@',                // Strip out white space
                 '@&(quot|#34);@i',                // Replace HTML entities
                 '@&(amp|#38);@i',
                 '@&(lt|#60);@i',
                 '@&(gt|#62);@i',
                 '@&(nbsp|#160);@i',
                 '@&(iexcl|#161);@i',
                 '@&(cent|#162);@i',
                 '@&(pound|#163);@i',
                 '@&(copy|#169);@i',
                 '@&#(\d+);@e',                    // evaluate as php
                 );

$replace = array ('',
                 '',
                 '\1',
                 '"',
                 '&',
                 '<',
                 '>',
                 ' ',
                 chr(161),
                 chr(162),
                 chr(163),
                 chr(169),
                 'chr(\1)',
                 );

$search2 = array ('@.*?Latest Member:\s@si');

$replace2 = array ('');

$text = preg_replace($search, $replace, $html);
$text2 = preg_replace($search2, $replace2, $text);

echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $text <br /><br /><b>Extracted Data:</b><br />\n \n\n $text2  <br /><br /></body></html>";
?>

With the following output: :)

Test String:
Test Forum - Index Welcome, Guest. Please login or register. 1 Hour 1 Day 1 Week 1 Month Forever Login with username, password and session length Shoutbox guest123 : Hello SpaceWaste : meh though, Im off to bed, night man SpaceWaste : lol damn bugs....ah well. Its ok, this is just fine SpaceWaste : damn, youre right View All 2 Posts in 1 Topics by 4 Members - Latest Member: marc April 13, 2005, 10:34:58 PM Test Forum News SMF - Just Installed General Category Posts Topics Last post General Discussion Feel free to talk about anything and everything in this board. 2 1 April 06, 2005, 08:50:39 PM Re: Welcome to SMF! by Dooga Test Forum - Info Center Forum Stats Total Topics: 1Total Posts: 2 Latest Post: "Re: Welcome to SMF!" (April 06, 2005, 08:50:39 PM) View the 10 most recent posts on the forum. [More Stats] Total Members: 4 Latest Member: marc Users Online 1 Guest, 0 Users Most users online today: 1 Most users online ever: 4 ( April 10, 2005, 08:40:22 PM ) Test Forum | Powered by SMF 1.0.2. © 2001-2005, Lewis Media. All Rights Reserved. Helios design by Bloc

Extracted Data:
marc


OK well I missed your question, but to match this the latest member bit is quite easy, we know for a fact that this is how it goes.
Cheers,
MC

<{POST_SNAPBACK}>

Actually, the general explaination is quite helpful because ultimately, my example is just a way for me to learn how to do more complex things with regex.

Come to think of it, if I could just borrow you brain for a few days, it would really save on the typing. :)

Hey, thanks for all the help. :)

vujsa

By the way, the "s" modifier at the end of the regex as explained by the PHP web site:

s (PCRE_DOTALL)

  • If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.



#7 vizskywalker

vizskywalker

    Techno-Necromancer

  • Members
  • 1,018 posts
  • Location:The Net

Posted 14 April 2005 - 06:04 AM

I followed most of the explanation. But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si". Also, what exactly does preg_replace() do? Thanks.

~Viz

#8 vujsa

vujsa

    Absolute Newbie

  • Members
  • 888 posts
  • Gender:Male
  • Location:Indianapolis, Indiana, USA (Midwest)
  • myCENTs:35.43

Posted 14 April 2005 - 06:59 AM

I followed most of the explanation.  But I'm still unclear as to the purpose of the "." in "@<script[^>]*?>.*?</script>@si".  Also, what exactly does preg_replace() do?  Thanks.

~Viz

<{POST_SNAPBACK}>


The "." specifies to match all charaters. So here is an example:
$string = "Hello world!  Welcome to my example. <br />\n Please feel free to increase my reputation points! <br />\n Thank you.";
$string2 = preg_replace('@.*?\sThank you.@si', 'Hello world!  How cool am I!', $string);

echo "<html><head><title>Content Capture</title></head><body><b>Test String:</b><br />\n $string <br /><br /><b>New Data:</b><br />\n \n\n $string2  <br /><br /></body></html>";

Returns:

Test String:
Hello world! Welcome to my example.
Please feel free to increase my reputation points!
Thank you.

New Data:
Hello world! How cool am I!


So '@.*?\sThank you.@si', means match all characters up to and including one space followed by "Thank You."

Basically acts like a wild card.

So for '@<script[^>]*?>.*?</script>@si' the "." means match all characters in between the <script ...> and </script> tags.

preg_replace() is Pattern Regular Expression Replace function.
Works like ereg_replace() but a little more complex, more flexible, and slightly different syntax. Basically, you can use an array of patterns that when matched will be replaced by the coresponding value in an array of replacement values.

Instead of typing ereg_replace() over and over, just use the array method with preg_replace().
For more information try the PHP Web Site About PCRE
Hope this helps. :rolleyes:

vujsa

Edited by vujsa, 14 April 2005 - 09:59 PM.


#9 overture

overture

    Premium Member

  • Members
  • 208 posts
  • Location:England
  • Interests:Web Design, Web Programming, Software Programming, Drawing, Digital Art :D

Posted 14 April 2005 - 12:40 PM

I'd just like to thankyou for starting this topic, it was very convenient for me :rolleyes: this has helped me as much as it has helped you Vujsa. An extra thanks to M^E for the extensive explanations :). thanks.

#10 vizskywalker

vizskywalker

    Techno-Necromancer

  • Members
  • 1,018 posts
  • Location:The Net

Posted 14 April 2005 - 08:56 PM

Yes, it is very helpful to have this post. And I'm sure overture meant MC not M^E.

~Viz

#11 overture

overture

    Premium Member

  • Members
  • 208 posts
  • Location:England
  • Interests:Web Design, Web Programming, Software Programming, Drawing, Digital Art :D

Posted 15 April 2005 - 01:05 PM

lol yes i did Viz :rolleyes:

#12 ^zer0dyer$

^zer0dyer$

    Newbie [ Level 1 ]

  • Members
  • 4 posts
  • Location:Alpha Centauri
  • Interests:Whatever my girlfriend's are.

Posted 02 May 2005 - 08:02 AM

First of all, if you are just trying to escape HTML characters use:
-htmlentities() OR
-htmlspecialchars()
which are built-in PHP functions. If you are just trying to learn regexp, more power to you! :(

In PCRE regular expressions, there are several types of delimiters you can use for your patterns
<?php
# This finds all tags
$pattern = "@<\w+?[^>]>@is"; // @ is the delimiter in pattern.
?>
That was a very eloquent pattern, mastercomputers. I just recently started using look aheads/behinds, and have had fun toying around with them :D

Also, if you want to print out mastercomputers' result with preg_match, try the following for some good practice:
<?php
$file = "path/to/file";
$handle = @file_get_contents($file) or die("File not acquired!\n</body>\n</html>");
 // This is safer!

$pattern = "#Latest\sMember:\s.+?(?=\s)#i";

if ( preg_match($pattern, $handle, $matches) )
{
   print "<pre>\n".print_r($matches)."\n<pre>\n";
}
else
{
   print "<p>Nothing found =(";
}

# Finds all matches
if ( preg_match_all($pattern, $handle, $matches) )
{ # $matches is now a 2-dimensional array
   print "<pre>\n".print_r($matches)."\n</pre>\n";
}
else
{
   print "Nothing found =(";
}
?>

Peace,
+CurTis-



Reply to this topic



  


1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users