Handle case where newline chars don't transcode to detected encoding

We've seen cases where binary files are detected as encodings such as ISO-8859-8-I. This usually happens when the binary files are short, so while the detector is mistaken, there is also not very much data for use in the detection algorithm in the first place so it's understandable that the detector was wrong. In these cases, the code to convert ASCII newline characters to encodings such as ISO-8859-8-I fails because there is no conversion between them. We now simply assume that the data is all one line in those cases. In reality the data is binary, but this obviously difficult to detect reliably.
2025-10-29 17:50:22 +00:00 · 2014-06-03 12:21:07 -04:00
parent a5b6331ab5
commit aa5a94cc3e
3 changed files with 11 additions and 3 deletions
--- a/lib/linguist/blob_helper.rb
+++ b/lib/linguist/blob_helper.rb
@@ -256,10 +256,16 @@ module Linguist
          # without changing the encoding of `data`, and
          # also--importantly--without having to duplicate many (potentially
          # large) strings.
-          encoded_newlines = ["\r\n", "\r", "\n"].
-            map { |nl| nl.encode(encoding).force_encoding(data.encoding) }
+          begin
+            encoded_newlines = ["\r\n", "\r", "\n"].
+              map { |nl| nl.encode(encoding, "ASCII-8BIT").force_encoding(data.encoding) }

-          data.split(Regexp.union(encoded_newlines), -1)
+            data.split(Regexp.union(encoded_newlines), -1)
+          rescue Encoding::ConverterNotFoundError
+            # The data is not splittable in the detected encoding.  Assume it's
+            # one big line.
+            [data]
+          end
        else
          []
        end
--- a/samples/Text/iso8859-8-i.txt
+++ b/samples/Text/iso8859-8-i.txt
@@ -0,0 +1 @@
+%<25><><EFBFBD>
--- a/test/test_blob.rb
+++ b/test/test_blob.rb
@@ -97,6 +97,7 @@ class TestBlob < Test::Unit::TestCase
  def test_sloc
    assert_equal 2, blob("Ruby/foo.rb").sloc
    assert_equal 3, blob("Text/utf16le-windows.txt").sloc
+    assert_equal 1, blob("Text/iso8859-8-i.txt").sloc
  end

  def test_encoding