Class: Oga::XML::Lexer
- Inherits:
-
Object
- Object
- Oga::XML::Lexer
- Defined in:
- lib/oga/xml/lexer.rb
Overview
Low level lexer that supports both XML and HTML (using an extra option).
To lex HTML input set the :html
option to true
when creating an
instance of the lexer:
lexer = Oga::XML::Lexer.new(:html => true)
This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.
Thread Safety
Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:
# Don't do this!
lexer = Oga::XML::Lexer.new('....')
threads = []
2.times do
threads << Thread.new do
lexer.advance do |*args|
p args
end
end
end
threads.each(&:join)
However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.
Strict Mode
By default the lexer is rather permissive regarding the input. For
example, missing closing tags are inserted by default. To disable this
behaviour the lexer can be run in “strict mode” by setting :strict
to
true
:
lexer = Oga::XML::Lexer.new('...', :strict => true)
Strict mode only applies to XML documents.
Constant Summary collapse
- HTML_SCRIPT =
These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.
'script'.freeze
- HTML_STYLE =
'style'.freeze
- HTML_TABLE_ALLOWED =
Elements that are allowed directly in a <table> element.
Whitelist.new( %w{thead tbody tfoot tr caption colgroup col} )
- HTML_SCRIPT_ELEMENTS =
Whitelist.new(%w{script template})
- HTML_TABLE_ROW_ELEMENTS =
The elements that may occur in a thead, tbody, or tfoot.
Technically “th” is not allowed per the HTML5 spec, but it’s so commonly used in these elements that we allow it anyway.
Whitelist.new(%w{tr th}) + HTML_SCRIPT_ELEMENTS
- HTML_CLOSE_SELF =
Elements that should be closed automatically before a new opening tag is processed.
{ 'head' => Blacklist.new(%w{head body}), 'body' => Blacklist.new(%w{head body}), 'li' => Blacklist.new(%w{li}), 'dt' => Blacklist.new(%w{dt dd}), 'dd' => Blacklist.new(%w{dt dd}), 'p' => Blacklist.new(%w{ address article aside blockquote details div dl fieldset figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr main menu nav ol p pre section table ul }), 'rb' => Blacklist.new(%w{rb rt rtc rp}), 'rt' => Blacklist.new(%w{rb rt rtc rp}), 'rtc' => Blacklist.new(%w{rb rtc}), 'rp' => Blacklist.new(%w{rb rt rtc rp}), 'optgroup' => Blacklist.new(%w{optgroup}), 'option' => Blacklist.new(%w{optgroup option}), 'colgroup' => Whitelist.new(%w{col template}), 'caption' => HTML_TABLE_ALLOWED.to_blacklist, 'table' => HTML_TABLE_ALLOWED + HTML_SCRIPT_ELEMENTS, 'thead' => HTML_TABLE_ROW_ELEMENTS, 'tbody' => HTML_TABLE_ROW_ELEMENTS, 'tfoot' => HTML_TABLE_ROW_ELEMENTS, 'tr' => Whitelist.new(%w{td th}) + HTML_SCRIPT_ELEMENTS, 'td' => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED, 'th' => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED }
- LITERAL_HTML_ELEMENTS =
Names of HTML tags of which the content should be lexed as-is.
Whitelist.new([HTML_SCRIPT, HTML_STYLE])
Instance Method Summary collapse
-
#advance {|type, value, line| ... } ⇒ Object
Advances through the input and generates the corresponding tokens.
-
#html? ⇒ TrueClass|FalseClass
-
#html_script? ⇒ TrueClass|FalseClass
-
#html_style? ⇒ TrueClass|FalseClass
-
#initialize(data, options = {}) ⇒ Lexer
constructor
A new instance of Lexer.
-
#lex ⇒ Array
Gathers all the tokens for the input and returns them as an Array.
-
#read_data {|| ... } ⇒ String
Yields the data to lex to the supplied block.
-
#strict? ⇒ TrueClass|FalseClass
Constructor Details
#initialize(data, options = {}) ⇒ Lexer
Returns a new instance of Lexer
115 116 117 118 119 120 121 122 |
# File 'lib/oga/xml/lexer.rb', line 115 def initialize(data, = {}) @data = data @html = [:html] @strict = [:strict] || false @line = 1 @elements = [] reset_native end |
Instance Method Details
#advance {|type, value, line| ... } ⇒ Object
Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.
Each token is an Array in the following format:
[TYPE, VALUE]
The type is a symbol, the value is either nil or a String.
This method stores the supplied block in @block
and resets it after
the lexer loop has finished.
172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
# File 'lib/oga/xml/lexer.rb', line 172 def advance(&block) @block = block read_data do |chunk| advance_native(chunk) end # Add any missing closing tags if !strict? and !@elements.empty? @elements.length.times { on_element_end } end ensure @block = nil end |
#html? ⇒ TrueClass|FalseClass
188 189 190 |
# File 'lib/oga/xml/lexer.rb', line 188 def html? @html == true end |
#html_script? ⇒ TrueClass|FalseClass
198 199 200 |
# File 'lib/oga/xml/lexer.rb', line 198 def html_script? html? && current_element == HTML_SCRIPT end |
#html_style? ⇒ TrueClass|FalseClass
203 204 205 |
# File 'lib/oga/xml/lexer.rb', line 203 def html_style? html? && current_element == HTML_STYLE end |
#lex ⇒ Array
Gathers all the tokens for the input and returns them as an Array.
147 148 149 150 151 152 153 154 155 |
# File 'lib/oga/xml/lexer.rb', line 147 def lex tokens = [] advance do |type, value, line| tokens << [type, value, line] end tokens end |
#read_data {|| ... } ⇒ String
Yields the data to lex to the supplied block.
128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# File 'lib/oga/xml/lexer.rb', line 128 def read_data if @data.is_a?(String) yield @data # IO, StringIO, etc # THINK: read(N) would be nice, but currently this screws up the C code elsif @data.respond_to?(:each_line) @data.each_line { |line| yield line } # Enumerator, Array, etc elsif @data.respond_to?(:each) @data.each { |chunk| yield chunk } end end |
#strict? ⇒ TrueClass|FalseClass
193 194 195 |
# File 'lib/oga/xml/lexer.rb', line 193 def strict? @strict end |