Class: Oga::XML::Lexer

Inherits:

Object

Object
Oga::XML::Lexer

show all

Defined in:: lib/oga/xml/lexer.rb

Overview

Low level lexer that supports both XML and HTML (using an extra option). To lex HTML input set the :html option to true when creating an instance of the lexer:

lexer = Oga::XML::Lexer.new(:html => true)

This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.

Thread Safety

Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:

# Don't do this!
lexer   = Oga::XML::Lexer.new('....')
threads = []

2.times do
  threads << Thread.new do
    lexer.advance do |*args|
      p args
    end
  end
end

threads.each(&:join)

However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.

Strict Mode

By default the lexer is rather permissive regarding the input. For example, missing closing tags are inserted by default. To disable this behaviour the lexer can be run in “strict mode” by setting :strict to true:

lexer = Oga::XML::Lexer.new('...', :strict => true)

Strict mode only applies to XML documents.

Constant Summary collapse

HTML_SCRIPT = These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.

'script'.freeze

HTML_STYLE =

'style'.freeze

HTML_TABLE_ALLOWED = Elements that are allowed directly in a <table> element.

Whitelist.new(
  %w{thead tbody tfoot tr caption colgroup col}
)

HTML_SCRIPT_ELEMENTS =

Whitelist.new(%w{script template})

HTML_TABLE_ROW_ELEMENTS = The elements that may occur in a thead, tbody, or tfoot. Technically “th” is not allowed per the HTML5 spec, but it’s so commonly used in these elements that we allow it anyway.

Whitelist.new(%w{tr th}) + HTML_SCRIPT_ELEMENTS

HTML_CLOSE_SELF = Elements that should be closed automatically before a new opening tag is processed.

{
  'head' => Blacklist.new(%w{head body}),
  'body' => Blacklist.new(%w{head body}),
  'li'   => Blacklist.new(%w{li}),
  'dt'   => Blacklist.new(%w{dt dd}),
  'dd'   => Blacklist.new(%w{dt dd}),
  'p'    => Blacklist.new(%w{
    address article aside blockquote details div dl fieldset figcaption
    figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr main menu nav
    ol p pre section table ul
  }),
  'rb'       => Blacklist.new(%w{rb rt rtc rp}),
  'rt'       => Blacklist.new(%w{rb rt rtc rp}),
  'rtc'      => Blacklist.new(%w{rb rtc}),
  'rp'       => Blacklist.new(%w{rb rt rtc rp}),
  'optgroup' => Blacklist.new(%w{optgroup}),
  'option'   => Blacklist.new(%w{optgroup option}),
  'colgroup' => Whitelist.new(%w{col template}),
  'caption'  => HTML_TABLE_ALLOWED.to_blacklist,
  'table'    => HTML_TABLE_ALLOWED + HTML_SCRIPT_ELEMENTS,
  'thead'    => HTML_TABLE_ROW_ELEMENTS,
  'tbody'    => HTML_TABLE_ROW_ELEMENTS,
  'tfoot'    => HTML_TABLE_ROW_ELEMENTS,
  'tr'       => Whitelist.new(%w{td th}) + HTML_SCRIPT_ELEMENTS,
  'td'       => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED,
  'th'       => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED
}

LITERAL_HTML_ELEMENTS = Names of HTML tags of which the content should be lexed as-is.

Whitelist.new([HTML_SCRIPT, HTML_STYLE])

Instance Method Summary collapse

#advance {|type, value, line| ... } ⇒ Object
Advances through the input and generates the corresponding tokens.
#html? ⇒ TrueClass|FalseClass
#html_script? ⇒ TrueClass|FalseClass
#html_style? ⇒ TrueClass|FalseClass
#initialize(data, options = {}) ⇒ Lexer constructor
A new instance of Lexer.
#lex ⇒ Array
Gathers all the tokens for the input and returns them as an Array.
#read_data {|| ... } ⇒ String
Yields the data to lex to the supplied block.
#strict? ⇒ TrueClass|FalseClass

Constructor Details

#initialize(data, options = {}) ⇒ `Lexer`

Returns a new instance of Lexer

Parameters:

data (String|IO) —
The data to lex. This can either be a String or an IO instance.
options (Hash) (defaults to: {})

Options Hash (options):

:html (TrueClass|FalseClass) —
When set to true the lexer will treat the input as HTML instead of XML. This makes it possible to lex HTML void elements such as <link href="">.
:strict (TrueClass|FalseClass) —
Enables/disables strict parsing of XML documents, disabled by default.

# File 'lib/oga/xml/lexer.rb', line 115

def initialize(data, options = {})
  @data   = data
  @html   = options[:html]
  @strict = options[:strict] || false
  @line     = 1
  @elements = []
  reset_native
end

Instance Method Details

#advance {|type, value, line| ... } ⇒ `Object`

Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.

Each token is an Array in the following format:

[TYPE, VALUE]

The type is a symbol, the value is either nil or a String.

This method stores the supplied block in @block and resets it after the lexer loop has finished.

Yield Parameters:

type (Symbol)
value (String)
line (Fixnum)

# File 'lib/oga/xml/lexer.rb', line 172

def advance(&block)
  @block = block

  read_data do |chunk|
    advance_native(chunk)
  end

  # Add any missing closing tags
  if !strict? and !@elements.empty?
    @elements.length.times { on_element_end }
  end
ensure
  @block = nil
end

#html? ⇒ `TrueClass|FalseClass`

Returns:

(TrueClass|FalseClass)



188
189
190

# File 'lib/oga/xml/lexer.rb', line 188

def html?
  @html == true
end

#html_script? ⇒ `TrueClass|FalseClass`

Returns:

(TrueClass|FalseClass)



198
199
200

# File 'lib/oga/xml/lexer.rb', line 198

def html_script?
  html? && current_element == HTML_SCRIPT
end

#html_style? ⇒ `TrueClass|FalseClass`

Returns:

(TrueClass|FalseClass)



203
204
205

# File 'lib/oga/xml/lexer.rb', line 203

def html_style?
  html? && current_element == HTML_STYLE
end

#lex ⇒ `Array`

Gathers all the tokens for the input and returns them as an Array.

Returns:

(Array)

#read_data {|| ... } ⇒ `String`

Yields the data to lex to the supplied block.

Yield Parameters:

(String)

Returns:

(String)

# File 'lib/oga/xml/lexer.rb', line 128

def read_data
  if @data.is_a?(String)
    yield @data

  # IO, StringIO, etc
  # THINK: read(N) would be nice, but currently this screws up the C code
  elsif @data.respond_to?(:each_line)
    @data.each_line { |line| yield line }

  # Enumerator, Array, etc
  elsif @data.respond_to?(:each)
    @data.each { |chunk| yield chunk }
  end
end

#strict? ⇒ `TrueClass|FalseClass`

Returns:

(TrueClass|FalseClass)



193
194
195

# File 'lib/oga/xml/lexer.rb', line 193

def strict?
  @strict
end

Class: Oga::XML::Lexer

Overview

Thread Safety

Strict Mode

Constant Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data, options = {}) ⇒ Lexer

Instance Method Details

#advance {|type, value, line| ... } ⇒ Object

#html? ⇒ TrueClass|FalseClass

#html_script? ⇒ TrueClass|FalseClass

#html_style? ⇒ TrueClass|FalseClass

#lex ⇒ Array

#read_data {|| ... } ⇒ String

#strict? ⇒ TrueClass|FalseClass

#initialize(data, options = {}) ⇒ `Lexer`

#advance {|type, value, line| ... } ⇒ `Object`

#html? ⇒ `TrueClass|FalseClass`

#html_script? ⇒ `TrueClass|FalseClass`

#html_style? ⇒ `TrueClass|FalseClass`

#lex ⇒ `Array`

#read_data {|| ... } ⇒ `String`

#strict? ⇒ `TrueClass|FalseClass`