* Lex everything except SGML, multiline, SHEBANG
* Prepend SHEBANG#! to tokens
* Support SGML tag/attribute extraction
* Multiline comments
* WIP cont'd; productionifying
* Compile before test
* Add extension to gemspec
* Add flex task to build lexer
* Reentrant extra data storage
* regenerate lexer
* use prefix
* rebuild lexer on linux
* Optimise a number of operations:
* Don't read and split the entire file if we only ever use the first/last n
lines
* Only consider the first 50KiB when using heuristics/classifying. This can
save a *lot* of time; running a large number of regexes over 1MiB of text
takes a while.
* Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is
called adds up a lot.
* Use single regex for C++
* act like #lines
* [1][-2..-1] => nil, ffs
* k may not be set
The purpose of this gem is to package up the language grammars that are
used for syntax highlighting on github.com. The grammars are TextMate,
Sublime Text, or Atom language grammars, converted to JSON and given the
filename SCOPE.json, where SCOPE is the language scope that the grammar
defines.
The github-linguist-grammars gem packages up all the grammars, and also
exports a Linguist::Grammars.path method to locate the directory
containing the grammars.
To build the gem, simply run `rake build_grammars_gem`. The grammars.yml
file lists all the repositories we download grammars from, as well as
which scopes are defined by each repository. The
script/download-grammars script takes that list and downloads and
processes the grammars into the format expected by the gem.