Links

Lists

Latest Updates

Ruby On Rails List
Python list
Advanced Java
The JavaScript List
Apache Users
Full Disclosure
Linux Security

Search the archives!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Javascript] regexp - how to exclude a substring?


  • From: paul at novitskisoftware.com (Paul Novitski)
  • Subject: [Javascript] regexp - how to exclude a substring?
  • Date: Mon May 23 17:41:27 2005

Shawn,

Consider the interesting problem of selecting a chunk of HTML based on a 
complex CSS-style selector:

         div p.intro span

Ideally, I'd be able to convert this into a single regular expression, 
something like:

         /<div[ >].*<p [^>]*class=\"[^\"]*intro[ \"].*<span/si

This will locate a <div followed by a <p class="intro" followed by a span 
-- but won't guarantee any parent-child-grandchild relationship between 
them.  It will match both of these:

         <div>
            <p class="intro">
               <span>
and:
         <div></div>
         <p class="intro"></p>
         <span>

That's why I've wanted to exclude a string in the regexp, not just a 
character.  However, it appears that I have hit the ceiling of what regular 
expressions can do in this area so I'll let go of that.


My current strategy is to initialize the template engine by walking the 
document recording the lineage of each element:

<html>                          0:html
   <body>                        1:html body
     <div id="content">          2:html body div#content
        <h2>                     3:html body div#content h2
        </h2>
        <p class="intro">        4:html body div#content p.intro
          <span>                 5:html body div#content p.intro span
          </span>
        </p>
     </div>
     <ul id="nav" class="menu">  6:html body ul#nav.menu
...

(I wonder if this is how some rendering engines work internally, so they 
don't have to keep re-parsing the tree repeatedly?)

Then I can search those lineage strings for the matches I want.  Assuming 
that every tag name is preceded by a space, and that #ids come before 
.classes, then this regex should work to pinpoint the desired element:

         /(\d)+:.* div.* p[^ ]*\.intro.* span/
will match:
         div p.intro span
in:
         5:html body div#content p.intro span


Then my parenthetical expression (\d) will yield the key number, n'est ce pas?

Paul


At 11:24 AM 5/23/2005, Shawn Milo wrote:
 > Maybe you can answer a more general question I have about regular
 > expressions: why, when you search for
 >          <div.*<\/div
 > does regexp return a string that stretches all the way to the last </div
 > found and not simply to the first one it encounters?

Easy: It's called "greedy matching," and every regex engine does it.

That's where I thought maybe something like a lookahead or lookbehind
might come in handy, because you can say something like:

<div, then a </div, without another </div before it

As for your other comments, I'll re-read and see if a thought is born.

Shawn