Skip to content Skip to sidebar Skip to footer

Php Preg_split On Spaces, But Not Within Tags

i am using preg_split('/\'[^\']*\'(*SKIP)(*F)|\x20/', $input_line); and run it on phpliveregex.com it produce array : array(10 0=>test 1=>or 2=>&l

Solution 1:

It's easier to use the DOMDocument since you don't need to describe what a html tag is and how it looks. You only need to check the nodeType. When it's a textNode, split it with preg_match_all(it's more handy than to design a pattern for preg_split):

$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';

$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);

$nodeList = $dom->documentElement->childNodes;

$results = [];

foreach ($nodeListas$childNode) {
    if ($childNode->nodeType == XML_TEXT_NODE &&
        preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
        $results = array_merge($results, $m[0]);
    else$results[] = $dom->saveHTML($childNode);
}

print_r($results);

Note: I have chosen a default behaviour when a double quote part stays unclosed (without a closing quote), feel free to change it.

Note2: Sometimes LIBXML_ constants are not defined. You can solve this problem testing it before and defining it when needed:

if (!defined('LIBXML_HTML_NOIMPLIED'))
    define('LIBXML_HTML_NOIMPLIED', 8192);

Solution 2:

Description

Instead of using a split command just match the sections you want

<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/\1>)|(?:"[^"]*"|[^"<]*)*

Regular expression visualization

Example

Live Demo

https://regex101.com/r/bK8iL3/1

Sample text

Note the difficult edge case in the second paragraph

<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"

some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

Sample Matches

MATCH 1
0.  [0-11]  `<b>test</b>`

MATCH 2
0.  [11-15] ` or `

MATCH 3
0.  [15-38] `<strong> this </strong>`

MATCH 4
0.  [38-56] `<em> oh yeah </em>`

MATCH 5
0.  [56-61] ` and `

MATCH 6
0.  [61-75] `<i>oh yeah</i>`

MATCH 7
0.  [75-111]    ` Here we are "ye we 'hold' it" some`

MATCH 8
0.  [111-117]   `<img/>`

MATCH 9
0.  [117-121]   `gfsf`

MATCH 10
0.  [121-213]   `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`

MATCH 11
0.  [213-224]   `<pre></pre>`

MATCH 12
0.  [224-237]   `<code></code>`

MATCH 13
0.  [237-254]   `<strong></strong>`

MATCH 14
0.  [254-261]   `<b></b>`

MATCH 15
0.  [261-270]   `<em></em>`

MATCH 16
0.  [270-277]   `<i></i>`

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      img                      'img'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [\s>\/]                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^'"\s>]*                any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and "
                                 "), '>' (0 or more times (matching
                                 the most amount possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/?                      '/' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      a                        'a'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      pre                      'pre'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      code                     'code'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      strong                   'strong'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      b                        'b'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      em                       'em'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      i                        'i'
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [\s>\\]                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), '>', '\\'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^'"\s>]*                any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and "
                                 "), '>' (0 or more times (matching
                                 the most amount possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/?                      '/' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    \/                       '/'
----------------------------------------------------------------------
    \1                       what was matched by capture \1
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^"<]*                   any character except: '"', '<' (0 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------

Post a Comment for "Php Preg_split On Spaces, But Not Within Tags"