Class: Oga::XML::Lexer

Inherits:
Object
  • Object
show all
Defined in:
lib/oga/xml/lexer.rb

Overview

Low level lexer that supports both XML and HTML (using an extra option). To lex HTML input set the :html option to true when creating an instance of the lexer:

lexer = Oga::XML::Lexer.new(:html => true)

This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.

Thread Safety

Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:

# Don't do this!
lexer   = Oga::XML::Lexer.new('....')
threads = []

2.times do
  threads << Thread.new do
    lexer.advance do |*args|
      p args
    end
  end
end

threads.each(&:join)

However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.

Strict Mode

By default the lexer is rather permissive regarding the input. For example, missing closing tags are inserted by default. To disable this behaviour the lexer can be run in “strict mode” by setting :strict to true:

lexer = Oga::XML::Lexer.new('...', :strict => true)

Strict mode only applies to XML documents.

Constant Summary collapse

HTML_SCRIPT =

These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.

'script'.freeze
HTML_STYLE =
'style'.freeze
HTML_TABLE_ALLOWED =

Elements that are allowed directly in a <table> element.

Whitelist.new(
  %w{thead tbody tfoot tr caption colgroup col}
)
HTML_SCRIPT_ELEMENTS =
Whitelist.new(%w{script template})
HTML_TABLE_ROW_ELEMENTS =

The elements that may occur in a thead, tbody, or tfoot.

Technically “th” is not allowed per the HTML5 spec, but it’s so commonly used in these elements that we allow it anyway.

Whitelist.new(%w{tr th}) + HTML_SCRIPT_ELEMENTS
HTML_CLOSE_SELF =

Elements that should be closed automatically before a new opening tag is processed.

{
  'head' => Blacklist.new(%w{head body}),
  'body' => Blacklist.new(%w{head body}),
  'li'   => Blacklist.new(%w{li}),
  'dt'   => Blacklist.new(%w{dt dd}),
  'dd'   => Blacklist.new(%w{dt dd}),
  'p'    => Blacklist.new(%w{
    address article aside blockquote details div dl fieldset figcaption
    figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr main menu nav
    ol p pre section table ul
  }),
  'rb'       => Blacklist.new(%w{rb rt rtc rp}),
  'rt'       => Blacklist.new(%w{rb rt rtc rp}),
  'rtc'      => Blacklist.new(%w{rb rtc}),
  'rp'       => Blacklist.new(%w{rb rt rtc rp}),
  'optgroup' => Blacklist.new(%w{optgroup}),
  'option'   => Blacklist.new(%w{optgroup option}),
  'colgroup' => Whitelist.new(%w{col template}),
  'caption'  => HTML_TABLE_ALLOWED.to_blacklist,
  'table'    => HTML_TABLE_ALLOWED + HTML_SCRIPT_ELEMENTS,
  'thead'    => HTML_TABLE_ROW_ELEMENTS,
  'tbody'    => HTML_TABLE_ROW_ELEMENTS,
  'tfoot'    => HTML_TABLE_ROW_ELEMENTS,
  'tr'       => Whitelist.new(%w{td th}) + HTML_SCRIPT_ELEMENTS,
  'td'       => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED,
  'th'       => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED
}
LITERAL_HTML_ELEMENTS =

Names of HTML tags of which the content should be lexed as-is.

Whitelist.new([HTML_SCRIPT, HTML_STYLE])

Instance Method Summary collapse

Constructor Details

#initialize(data, options = {}) ⇒ Lexer

Returns a new instance of Lexer

Parameters:

  • data (String|IO)

    The data to lex. This can either be a String or an IO instance.

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :html (TrueClass|FalseClass)

    When set to true the lexer will treat the input as HTML instead of XML. This makes it possible to lex HTML void elements such as <link href="">.

  • :strict (TrueClass|FalseClass)

    Enables/disables strict parsing of XML documents, disabled by default.



115
116
117
118
119
120
121
122
# File 'lib/oga/xml/lexer.rb', line 115

def initialize(data, options = {})
  @data   = data
  @html   = options[:html]
  @strict = options[:strict] || false
  @line     = 1
  @elements = []
  reset_native
end

Instance Method Details

#advance {|type, value, line| ... } ⇒ Object

Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.

Each token is an Array in the following format:

[TYPE, VALUE]

The type is a symbol, the value is either nil or a String.

This method stores the supplied block in @block and resets it after the lexer loop has finished.

Yield Parameters:

  • type (Symbol)
  • value (String)
  • line (Fixnum)


172
173
174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/oga/xml/lexer.rb', line 172

def advance(&block)
  @block = block

  read_data do |chunk|
    advance_native(chunk)
  end

  # Add any missing closing tags
  if !strict? and !@elements.empty?
    @elements.length.times { on_element_end }
  end
ensure
  @block = nil
end

#html?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


188
189
190
# File 'lib/oga/xml/lexer.rb', line 188

def html?
  @html == true
end

#html_script?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


198
199
200
# File 'lib/oga/xml/lexer.rb', line 198

def html_script?
  html? && current_element == HTML_SCRIPT
end

#html_style?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


203
204
205
# File 'lib/oga/xml/lexer.rb', line 203

def html_style?
  html? && current_element == HTML_STYLE
end

#lexArray

Gathers all the tokens for the input and returns them as an Array.

Returns:

  • (Array)

See Also:



147
148
149
150
151
152
153
154
155
# File 'lib/oga/xml/lexer.rb', line 147

def lex
  tokens = []

  advance do |type, value, line|
    tokens << [type, value, line]
  end

  tokens
end

#read_data {|| ... } ⇒ String

Yields the data to lex to the supplied block.

Yield Parameters:

  • (String)

Returns:

  • (String)


128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/oga/xml/lexer.rb', line 128

def read_data
  if @data.is_a?(String)
    yield @data

  # IO, StringIO, etc
  # THINK: read(N) would be nice, but currently this screws up the C code
  elsif @data.respond_to?(:each_line)
    @data.each_line { |line| yield line }

  # Enumerator, Array, etc
  elsif @data.respond_to?(:each)
    @data.each { |chunk| yield chunk }
  end
end

#strict?TrueClass|FalseClass

Returns:

  • (TrueClass|FalseClass)


193
194
195
# File 'lib/oga/xml/lexer.rb', line 193

def strict?
  @strict
end