Merge remote-tracking branch 'origin/master' into filename-matches-multiple-langages

* origin/master:
  Allow mime-types 2.x to be used with Linguist
  Upgrade to rugged 0.22.0b1
  Mention that languages need to be quite popular
  fix vendor/cache
  Gemfile.lock is nolonger considered generated
  Tests for BlobHelper#empty?
  remove reference to empty.js
  Remove more empty samples
  Bail earlier if the file is empty.
  Moving comments
  Use heuristics earlier to inform the rest of the classification process
  Removing inconsistency of `find_by_heuristics` (was sometimes returning nil and sometimes returning and empty array)
  Removing unused array of candidate languages.
  Reworking most heuristics to only return one match
This commit is contained in:
Brandon Keepers
2014-11-18 14:09:15 -05:00
14 changed files with 114 additions and 92 deletions

View File

@@ -100,12 +100,8 @@ module Linguist
def self.detect(blob)
name = blob.name.to_s
# Check if the blob is possibly binary and bail early; this is a cheap
# test that uses the extension name to guess a binary binary mime type.
#
# We'll perform a more comprehensive test later which actually involves
# looking for binary characters in the blob
return nil if blob.likely_binary? || blob.binary?
# Bail early if the blob is binary or empty.
return nil if blob.likely_binary? || blob.binary? || blob.empty?
# A bit of an elegant hack. If the file is executable but extensionless,
# append a "magic" extension so it can be classified with other
@@ -124,16 +120,18 @@ module Linguist
if possible_languages.length > 1
data = blob.data
possible_language_names = possible_languages.map(&:name)
heuristic_languages = Heuristics.find_by_heuristics(data, possible_language_names)
if heuristic_languages.size > 1
possible_language_names = heuristic_languages.map(&:name)
end
# Don't bother with binary contents or an empty file
if data.nil? || data == ""
nil
# Check if there's a shebang line and use that as authoritative
elsif (result = find_by_shebang(data)) && !result.empty?
if (result = find_by_shebang(data)) && !result.empty?
result.first
# No shebang. Still more work to do. Try to find it with our heuristics.
elsif (determined = Heuristics.find_by_heuristics(data, possible_language_names)) && !determined.empty?
determined.first
elsif heuristic_languages.size == 1
heuristic_languages.first
# Lastly, fall back to the probabilistic classifier.
elsif classified = Classifier.classify(Samples.cache, data, possible_language_names).first
# Return the actual Language object based of the string language name (i.e., first element of `#classify`)