Ashe Connor
5fbe9c0902
Allow classifier to run on symlinks as usual ( #3948 )
...
* Fixups for symlink detection, incl. test
* assert the heuristics return none for symlink
2018-01-08 09:01:16 +11:00
Ashe Connor
d4c2d83af9
Do not traverse symlinks in heuristics ( #3946 )
2017-12-12 21:53:36 +11:00
Ashe Connor
99eaf5faf9
Replace the tokenizer with a flex-based scanner ( #3846 )
...
* Lex everything except SGML, multiline, SHEBANG
* Prepend SHEBANG#! to tokens
* Support SGML tag/attribute extraction
* Multiline comments
* WIP cont'd; productionifying
* Compile before test
* Add extension to gemspec
* Add flex task to build lexer
* Reentrant extra data storage
* regenerate lexer
* use prefix
* rebuild lexer on linux
* Optimise a number of operations:
* Don't read and split the entire file if we only ever use the first/last n
lines
* Only consider the first 50KiB when using heuristics/classifying. This can
save a *lot* of time; running a large number of regexes over 1MiB of text
takes a while.
* Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is
called adds up a lot.
* Use single regex for C++
* act like #lines
* [1][-2..-1] => nil, ffs
* k may not be set
2017-10-31 11:06:56 +11:00
Colin Seymour
01de40faaa
Return early in Classifier.classify if no languages supplied ( #3471 )
...
* Return early if no languages supplied
There's no need to tokenise the data when attempting to classify without a limited language scope as no action will be performed when it comes to scoring anyway.
* Add test for empty languages array
2017-02-13 18:22:54 +00:00
Brandon Keepers
e42ccf0d82
docs
2014-11-27 11:40:48 -05:00
Brandon Keepers
bf4baff363
Move call method into existing Classifier class
2014-11-27 11:29:38 -05:00
Patrick Reynolds
bd4204b89e
fix refactoring from #836
2013-12-29 01:32:56 -06:00
Ted Nyman
a282b56f46
Fix debug method
2013-12-16 20:55:00 -08:00
Ted Nyman
6a8de63d2d
Nicer debug factoring
2013-12-14 15:24:26 -08:00
Ted Nyman
4f656c200b
Minor docs/naming
2013-11-15 18:42:53 -08:00
Ted Nyman
6a15ae47ee
Some space here
2013-07-07 14:07:03 -07:00
Joshua Peek
490afdddd1
some air
2013-06-10 10:37:55 -05:00
Joshua Peek
9822b153eb
ws
2013-06-10 10:36:56 -05:00
Patrick Reynolds
e7ac4e0a29
helpful comments
2013-06-06 17:04:28 -05:00
Patrick Reynolds
b275e53b08
use LINGUIST_DEBUG to debug the Bayesian filter
2013-06-06 16:54:18 -05:00
Pascal Borreli
70eafb2ffc
Fixed typos
2013-03-03 21:26:31 +00:00
Joshua Peek
bf944f6d1a
Make classify a function on the Classifier
2012-07-23 13:47:15 -05:00
Joshua Peek
0c9a947f39
Load classifer db into sample data hash
2012-07-23 13:13:52 -05:00
Joshua Peek
97ae7c1a11
Move classifer db to samples.yml
2012-07-23 13:05:08 -05:00
Joshua Peek
3172bf5b46
Remove gc for now
2012-07-23 12:23:20 -05:00
Joshua Peek
5b28336d56
Move db verification into tests
2012-07-23 12:21:26 -05:00
Joshua Peek
b7f58d96cb
Compare md5s of dbs
2012-07-23 12:17:32 -05:00
Joshua Peek
db88e143ba
Dump classifier as plain hash
2012-07-23 11:21:55 -05:00
Joshua Peek
95c0985952
Drop defaults in classifier hash
2012-07-23 10:46:54 -05:00
Joshua Peek
7292bdc180
Change Classifier to accept language name Strings
2012-07-20 15:52:27 -05:00
Joshua Peek
bc84a98b54
Set unused var to _
2012-07-20 15:43:23 -05:00
Joshua Peek
2637d8dc55
Add tokenize helper to Tokenize class
2012-07-20 15:14:58 -05:00
Joshua Peek
189a123760
Quote all YAML keys cause fuck it
2012-06-22 10:30:50 -05:00
Joshua Peek
a7108e4086
Fix calling to_yaml on 1.9
2012-06-22 10:15:14 -05:00
Joshua Peek
2b712dc790
Guard against classify nil data
2012-06-21 11:47:32 -05:00
Joshua Peek
0067f28246
YAML sucks
2012-06-20 16:54:29 -05:00
Joshua Peek
516a220d9f
Verify classifer counts
2012-06-20 15:48:46 -05:00
Joshua Peek
7bcf90c527
Skip gc step for now
2012-06-20 15:13:06 -05:00
Joshua Peek
4324971cea
Remove debug line
2012-06-20 14:11:23 -05:00
Joshua Peek
2672089154
Ensure language is loaded
2012-06-20 14:10:34 -05:00
Joshua Peek
5daaee88b4
Sort classifier yaml output
2012-06-20 12:50:05 -05:00
Joshua Peek
4484011f08
Switch to log probabilities to avoid float underflows
2012-06-19 16:33:29 -05:00
Joshua Peek
176f6483d0
Ensure token probability is less than 1.0
2012-06-19 15:26:56 -05:00
Joshua Peek
ddf3ec4a5b
Warn if classifier instance is out of date
2012-06-19 14:32:04 -05:00
Joshua Peek
d566b35020
Allow classifer languages to be scoped
2012-06-19 14:21:42 -05:00
Joshua Peek
8f85a447de
Allow tokens to be passed directly to classify
2012-06-19 14:17:27 -05:00
Joshua Peek
d0691988a9
More classifier docs
2012-06-19 14:15:10 -05:00
Joshua Peek
8a75d4d208
GC classifier db
2012-06-08 16:04:43 -05:00
Joshua Peek
8351d55c56
Don't crash if classifier data is missing
2012-06-08 14:46:06 -05:00
Joshua Peek
9ecab364d1
Dump classifier results
2012-06-08 14:13:26 -05:00
Joshua Peek
e5ae9c328b
Use language name as hash key
2012-06-08 13:43:57 -05:00
Joshua Peek
f747b49347
Add simple classifier
2012-06-07 17:10:28 -05:00