Replace the tokenizer with a flex-based scanner (#3846)

* Lex everything except SGML, multiline, SHEBANG

* Prepend SHEBANG#! to tokens

* Support SGML tag/attribute extraction

* Multiline comments

* WIP cont'd; productionifying

* Compile before test

* Add extension to gemspec

* Add flex task to build lexer

* Reentrant extra data storage

* regenerate lexer

* use prefix

* rebuild lexer on linux

* Optimise a number of operations:

* Don't read and split the entire file if we only ever use the first/last n
  lines

* Only consider the first 50KiB when using heuristics/classifying.  This can
  save a *lot* of time; running a large number of regexes over 1MiB of text
  takes a while.

* Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is
  called adds up a lot.

* Use single regex for C++

* act like #lines

* [1][-2..-1] => nil, ffs

* k may not be set
This commit is contained in:
Ashe Connor
2017-10-31 11:06:56 +11:00
committed by GitHub
parent 21babbceb1
commit 99eaf5faf9
15 changed files with 8914 additions and 202 deletions

View File

@@ -10,8 +10,9 @@ Gem::Specification.new do |s|
s.homepage = "https://github.com/github/linguist"
s.license = "MIT"
s.files = Dir['lib/**/*'] + Dir['grammars/*'] + ['LICENSE']
s.files = Dir['lib/**/*'] + Dir['ext/**/*'] + Dir['grammars/*'] + ['LICENSE']
s.executables = ['linguist', 'git-linguist']
s.extensions = ['ext/linguist/extconf.rb']
s.add_dependency 'charlock_holmes', '~> 0.7.5'
s.add_dependency 'escape_utils', '~> 1.1.0'
@@ -19,6 +20,7 @@ Gem::Specification.new do |s|
s.add_dependency 'rugged', '>= 0.25.1'
s.add_development_dependency 'minitest', '>= 5.0'
s.add_development_dependency 'rake-compiler', '~> 0.9'
s.add_development_dependency 'mocha'
s.add_development_dependency 'plist', '~>3.1'
s.add_development_dependency 'pry'