Merge pull request #3054 from Alhadis/srt

Add support for SubRip Text files and SRecode Templates
2025-10-29 17:50:22 +00:00 · 2016-06-20 07:35:31 +02:00
parent 96bd08e391 02fe28eb25
commit a9f366aed2
8 changed files with 337 additions and 0 deletions
--- a/.gitmodules
+++ b/.gitmodules
@@ -743,3 +743,6 @@
 [submodule "vendor/grammars/language-turing"]
 	path = vendor/grammars/language-turing
 	url = https://github.com/Alhadis/language-turing
+[submodule "vendor/grammars/atom-language-srt"]
+	path = vendor/grammars/atom-language-srt
+	url = https://github.com/314eter/atom-language-srt
--- a/grammars.yml
+++ b/grammars.yml
@@ -180,6 +180,8 @@ vendor/grammars/atom-language-clean:
 - source.clean
 vendor/grammars/atom-language-purescript/:
 - source.purescript
+vendor/grammars/atom-language-srt:
+- text.srt
 vendor/grammars/atom-language-stan/:
 - source.stan
 vendor/grammars/atom-salt:
--- a/lib/linguist/heuristics.rb
+++ b/lib/linguist/heuristics.rb
@@ -391,6 +391,12 @@ module Linguist
      end
    end
    
+    disambiguate ".srt" do |data|
+      if /^(\d{2}:\d{2}:\d{2},\d{3})\s*(-->)\s*(\d{2}:\d{2}:\d{2},\d{3})$/.match(data)
+        Language["SubRip Text"]
+      end
+    end
+    
    disambiguate ".t" do |data|
      if /^\s*%|^\s*var\s+\w+\s*:\s*\w+/.match(data)
        Language["Turing"]
--- a/lib/linguist/languages.yml
+++ b/lib/linguist/languages.yml
@@ -3336,6 +3336,14 @@ SQLPL:
  - .sql
  - .db2

+SRecode Template:
+  type: markup
+  color: "#348a34"
+  tm_scope: source.lisp
+  ace_mode: lisp
+  extensions:
+  - .srt
+
 STON:
  type: data
  group: Smalltalk
@@ -3585,6 +3593,13 @@ Stylus:
  tm_scope: source.stylus
  ace_mode: stylus

+SubRip Text:
+  type: data
+  extensions:
+  - .srt
+  ace_mode: text
+  tm_scope: text.srt
+
 SuperCollider:
  type: programming
  color: "#46390b"
--- a/Template/linguist.srt
+++ b/Template/linguist.srt
@@ -0,0 +1,45 @@
+;;; linguist.srt --- Template for linguist-example-mode
+
+;; Not copyrighted whatsoever.
+;;
+;; GPL can bite my shiny metal ass.
+;;
+;; GitHub:   1
+;; Stallman: 0
+
+set mode "default"
+
+set comment_start ";"
+
+set LICENSE "It's public domain, baby. This was written for the sole
+purpose of the format's inclusion and recognition by GitHub Linguist.
+This block of multiline text was added because every other .srt file
+I could find was GPL-licensed and had long-winded copyright blobs in
+the file's header. Also, check out my sick line-wrapping abilities."
+
+set DOLLAR "$"
+
+context file
+
+
+template license
+----
+{{LICENSE:srecode-comment-prefix}}
+----
+
+
+template filecomment :file :user :time
+----
+{{comment_start}} {{FILENAME}} --- {{^}}
+{{comment_prefix}} YUO WAN GPL?
+{{comment_prefix}} 
+{{comment_prefix}} Copyright (C) {{YEAR}} {{?AUTHOR}}
+{{comment_prefix}}
+{{comment_prefix}} TUO BAD
+{{comment_prefix}} WE EXPAT PEOPLE
+{{comment_prefix}} {{EXPLETIVE}} YOU!
+{{>:copyright}}
+{{comment_end}}
+----
+
+;; end
--- a/Text/Adding.NCL.Language.S01E01.1080p.BluRay.x264.srt
+++ b/Text/Adding.NCL.Language.S01E01.1080p.BluRay.x264.srt
@@ -0,0 +1,240 @@
+1
+00:00:01,250 --> 00:00:03,740
+Adding NCL language.
+
+2
+00:00:04,600 --> 00:00:08,730
+Thanks for the pull request! Do you know if these files are NCL too?
+
+3
+00:00:09,800 --> 00:00:13,700
+Those are poorly-named documentation files for NCL functions.
+
+4
+00:00:14,560 --> 00:00:17,200
+- What's better?
+- This is better.
+
+5
+00:00:18,500 --> 00:00:23,000
+- Would it be correct to recognise these files as text?
+- Yes.
+
+6
+00:00:23,890 --> 00:00:30,000
+In that case, could you add "NCL" to the text entry in languages.yml too?
+
+7
+00:00:30,540 --> 00:00:35,250
+I added the example to "Text" and updated the license in the grammar submodule.
+
+8
+00:00:38,500 --> 00:00:42,360
+Cloning the submodule fails for me in local with this URL.
+
+9
+00:00:42,360 --> 00:00:45,250
+Could you use Git or HTTPS...?
+
+10
+00:00:46,810 --> 00:00:50,000
+I updated the grammar submodule link to HTTPS.
+
+11
+00:00:51,100 --> 00:00:57,000
+It's still failing locally. I don't think you can just update the .gitmodules file.
+
+12
+00:00:57,750 --> 00:01:03,000
+You'll probably have to remove the submodule and add it again to be sure.
+
+13
+00:01:04,336 --> 00:01:11,800
+- I'll see first if it's not an issue on my side...
+- I removed the submodule and added it back with HTTPS.
+
+14
+00:01:13,670 --> 00:01:18,000
+I tested the detection of NCL files with 2000 samples.
+
+15
+00:01:18,000 --> 00:01:25,000
+The Bayesian classifier doesn't seem to be very good at distinguishing text from NCL.
+
+16
+00:01:25,000 --> 00:01:30,740
+We could try to improve it by adding more samples, or we can define a new heuristic rule.
+
+17
+00:01:31,300 --> 00:01:36,200
+- Do you want me to send you the sample files?
+- Yes, please do.
+
+18
+00:01:37,500 --> 00:01:39,500
+In your inbox.
+
+19
+00:01:41,285 --> 00:01:48,216
+- So if I manually go through these and sort out the errors, would that help?
+- Not really.
+
+20
+00:01:48,540 --> 00:01:55,145
+It's a matter of keywords so there's not much to do there except for adding new samples.
+
+21
+00:01:55,447 --> 00:02:02,000
+If adding a few more samples doesn't improve things, we'll see how to define a new heuristic rule.
+
+22
+00:02:04,740 --> 00:02:09,600
+- I added quite a few NCL samples.
+- That's a bit over the top, isn't it?
+
+23
+00:02:10,250 --> 00:02:16,000
+We currently can't add too many samples because of #2117.
+
+24
+00:02:18,000 --> 00:02:20,830
+(sigh) I decreased the number of added samples.
+
+25
+00:02:21,630 --> 00:02:25,300
+Could you test the detection results in local with the samples I gave you?
+
+26
+00:02:26,000 --> 00:02:28,670
+- What is the command to run that test?
+- Here...
+
+27
+00:02:28,716 --> 00:02:38,650
+[Coding intensifies]
+
+28
+00:02:38,650 --> 00:02:43,330
+It is getting hung up on a false detection of Frege in one of the Text samples.
+
+29
+00:02:43,540 --> 00:02:46,115
+Do you have any suggestions for implementing a heuristic?
+
+30
+00:02:47,640 --> 00:02:55,200
+#2441 should fix this. In the meantime, you can change this in "test_heuristics.rb"
+
+31
+00:02:55,165 --> 00:02:57,240
+Why did you have to change this?
+
+32
+00:02:57,777 --> 00:03:04,480
+- It doesn't work for me unless I do that.
+- Hum, same for me. Arfon, does it work for you?
+
+33
+00:03:04,920 --> 00:03:08,830
+Requiring linguist/language doesn't work for me either.
+
+34
+00:03:09,300 --> 00:03:13,885
+We restructured some of the requires a while ago and I think this is just out-of-date code.
+
+35
+00:03:14,065 --> 00:03:20,950
+From a large sample of known NCL files taken from Github, it's now predicting with about 98% accuracy.
+
+36
+00:03:21,183 --> 00:03:28,000
+For a large sample of other files with the NCL extension, it is around 92%.
+
+37
+00:03:27,880 --> 00:03:30,950
+From those, nearly all of the errors come from one GitHub repository,
+
+38
+00:03:30,950 --> 00:03:34,160
+and they all contain the text strings, "The URL" and "The Title".
+
+39
+00:03:35,660 --> 00:03:43,260
+- Do you mean 92% files correctly identified as text?
+- Yes, it correctly identifies 92% as text.
+
+40
+00:03:44,000 --> 00:03:46,150
+I'd really like to see this dramatically reduced.
+
+41
+00:03:46,150 --> 00:03:51,150
+What happens if we reduce to around 5 NCL sample files?
+
+42
+00:03:51,150 --> 00:03:52,600
+Does Linguist still do a reasonable job?
+
+43
+00:03:53,470 --> 00:03:58,190
+I reduced it to 16 NCL samples and 8 text samples.
+
+44
+00:03:58,190 --> 00:04:01,720
+It correctly classifies my whole set of known NCL files.
+
+45
+00:04:01,870 --> 00:04:05,730
+I tried with 5 samples but could not get the same level of accuracy.
+
+46
+00:04:06,670 --> 00:04:10,400
+It incorrectly classifies all of the NCL files in this GitHub repository.
+
+47
+00:04:11,130 --> 00:04:14,660
+All of these files contain the text strings, "THE_URL:" and "THE_TITLE:".
+
+48
+00:04:14,660 --> 00:04:19,500
+It did not misclassify any other text-files with the extension NCL.
+
+49
+00:04:19,970 --> 00:04:25,188
+With 100% accuracy? Does that mean it that the results are better with less samples??
+
+50
+00:04:25,610 --> 00:04:31,190
+I also removed a sample text-file which should have been classified as an NCL file.
+
+51
+00:04:31,000 --> 00:04:35,895
+I think that probably made most of the difference, although I didn't test it atomically.
+
+52
+00:04:35,895 --> 00:04:38,370
+Okay, that makes more sense.
+
+53
+00:04:39,515 --> 00:04:43,450
+I don't get the same results for the text files. Full results here.
+
+54
+00:04:44,650 --> 00:04:50,000
+They all look correctly classified to me, except for the ones in Fanghuan's repository.
+
+55
+00:04:50,000 --> 00:04:55,920
+I manually went through all of the ones where I didn't already know based on the filename or the repository owner.
+
+56
+00:04:56,526 --> 00:05:00,000
+[Presses button] It now correctly classifies all of my test files.
+
+57
+00:05:00,000 --> 00:05:05,970
+R. Pavlick, thanks for this. These changes will be live in the next release of Linguist. In the next couple of weeks.
+
+58
+00:05:05,970 --> 00:05:07,450
+Great! Thanks.
--- a/vendor/grammars/atom-language-srt
+++ b/vendor/grammars/atom-language-srt
--- a/vendor/licenses/grammar/atom-language-srt.txt
+++ b/vendor/licenses/grammar/atom-language-srt.txt
@@ -0,0 +1,25 @@
+---
+type: grammar
+name: atom-language-srt
+license: mit
+---
+Copyright (c) 2016 Pieter Goetschalckx
+
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.