Commit Graph

47 Commits

Author SHA1 Message Date
Ashe Connor
5fbe9c0902 Allow classifier to run on symlinks as usual (#3948)
* Fixups for symlink detection, incl. test

* assert the heuristics return none for symlink
2018-01-08 09:01:16 +11:00
Ashe Connor
d4c2d83af9 Do not traverse symlinks in heuristics (#3946) 2017-12-12 21:53:36 +11:00
Ashe Connor
99eaf5faf9 Replace the tokenizer with a flex-based scanner (#3846)
* Lex everything except SGML, multiline, SHEBANG

* Prepend SHEBANG#! to tokens

* Support SGML tag/attribute extraction

* Multiline comments

* WIP cont'd; productionifying

* Compile before test

* Add extension to gemspec

* Add flex task to build lexer

* Reentrant extra data storage

* regenerate lexer

* use prefix

* rebuild lexer on linux

* Optimise a number of operations:

* Don't read and split the entire file if we only ever use the first/last n
  lines

* Only consider the first 50KiB when using heuristics/classifying.  This can
  save a *lot* of time; running a large number of regexes over 1MiB of text
  takes a while.

* Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is
  called adds up a lot.

* Use single regex for C++

* act like #lines

* [1][-2..-1] => nil, ffs

* k may not be set
2017-10-31 11:06:56 +11:00
Colin Seymour
01de40faaa Return early in Classifier.classify if no languages supplied (#3471)
* Return early if no languages supplied

There's no need to tokenise the data when attempting to classify without a limited language scope as no action will be performed when it comes to scoring anyway.

* Add test for empty languages array
2017-02-13 18:22:54 +00:00
Brandon Keepers
e42ccf0d82 docs 2014-11-27 11:40:48 -05:00
Brandon Keepers
bf4baff363 Move call method into existing Classifier class 2014-11-27 11:29:38 -05:00
Patrick Reynolds
bd4204b89e fix refactoring from #836 2013-12-29 01:32:56 -06:00
Ted Nyman
a282b56f46 Fix debug method 2013-12-16 20:55:00 -08:00
Ted Nyman
6a8de63d2d Nicer debug factoring 2013-12-14 15:24:26 -08:00
Ted Nyman
4f656c200b Minor docs/naming 2013-11-15 18:42:53 -08:00
Ted Nyman
6a15ae47ee Some space here 2013-07-07 14:07:03 -07:00
Joshua Peek
490afdddd1 some air 2013-06-10 10:37:55 -05:00
Joshua Peek
9822b153eb ws 2013-06-10 10:36:56 -05:00
Patrick Reynolds
e7ac4e0a29 helpful comments 2013-06-06 17:04:28 -05:00
Patrick Reynolds
b275e53b08 use LINGUIST_DEBUG to debug the Bayesian filter 2013-06-06 16:54:18 -05:00
Pascal Borreli
70eafb2ffc Fixed typos 2013-03-03 21:26:31 +00:00
Joshua Peek
bf944f6d1a Make classify a function on the Classifier 2012-07-23 13:47:15 -05:00
Joshua Peek
0c9a947f39 Load classifer db into sample data hash 2012-07-23 13:13:52 -05:00
Joshua Peek
97ae7c1a11 Move classifer db to samples.yml 2012-07-23 13:05:08 -05:00
Joshua Peek
3172bf5b46 Remove gc for now 2012-07-23 12:23:20 -05:00
Joshua Peek
5b28336d56 Move db verification into tests 2012-07-23 12:21:26 -05:00
Joshua Peek
b7f58d96cb Compare md5s of dbs 2012-07-23 12:17:32 -05:00
Joshua Peek
db88e143ba Dump classifier as plain hash 2012-07-23 11:21:55 -05:00
Joshua Peek
95c0985952 Drop defaults in classifier hash 2012-07-23 10:46:54 -05:00
Joshua Peek
7292bdc180 Change Classifier to accept language name Strings 2012-07-20 15:52:27 -05:00
Joshua Peek
bc84a98b54 Set unused var to _ 2012-07-20 15:43:23 -05:00
Joshua Peek
2637d8dc55 Add tokenize helper to Tokenize class 2012-07-20 15:14:58 -05:00
Joshua Peek
189a123760 Quote all YAML keys cause fuck it 2012-06-22 10:30:50 -05:00
Joshua Peek
a7108e4086 Fix calling to_yaml on 1.9 2012-06-22 10:15:14 -05:00
Joshua Peek
2b712dc790 Guard against classify nil data 2012-06-21 11:47:32 -05:00
Joshua Peek
0067f28246 YAML sucks 2012-06-20 16:54:29 -05:00
Joshua Peek
516a220d9f Verify classifer counts 2012-06-20 15:48:46 -05:00
Joshua Peek
7bcf90c527 Skip gc step for now 2012-06-20 15:13:06 -05:00
Joshua Peek
4324971cea Remove debug line 2012-06-20 14:11:23 -05:00
Joshua Peek
2672089154 Ensure language is loaded 2012-06-20 14:10:34 -05:00
Joshua Peek
5daaee88b4 Sort classifier yaml output 2012-06-20 12:50:05 -05:00
Joshua Peek
4484011f08 Switch to log probabilities to avoid float underflows 2012-06-19 16:33:29 -05:00
Joshua Peek
176f6483d0 Ensure token probability is less than 1.0 2012-06-19 15:26:56 -05:00
Joshua Peek
ddf3ec4a5b Warn if classifier instance is out of date 2012-06-19 14:32:04 -05:00
Joshua Peek
d566b35020 Allow classifer languages to be scoped 2012-06-19 14:21:42 -05:00
Joshua Peek
8f85a447de Allow tokens to be passed directly to classify 2012-06-19 14:17:27 -05:00
Joshua Peek
d0691988a9 More classifier docs 2012-06-19 14:15:10 -05:00
Joshua Peek
8a75d4d208 GC classifier db 2012-06-08 16:04:43 -05:00
Joshua Peek
8351d55c56 Don't crash if classifier data is missing 2012-06-08 14:46:06 -05:00
Joshua Peek
9ecab364d1 Dump classifier results 2012-06-08 14:13:26 -05:00
Joshua Peek
e5ae9c328b Use language name as hash key 2012-06-08 13:43:57 -05:00
Joshua Peek
f747b49347 Add simple classifier 2012-06-07 17:10:28 -05:00