Update README and CONTRIBUTING documentation (#3955)

* Add more troubleshooting info

* Add more updates

* A lot more words and reformatting

* Few more tweaks

* Add how it works on GitHub.com

* More clarifications

* Feedback tweaks

* Add missing run

* Learn grammar
This commit is contained in:
Colin Seymour
2017-12-21 13:17:20 +00:00
committed by GitHub
parent d4c2d83af9
commit d0b906f128
2 changed files with 197 additions and 113 deletions

View File

@@ -4,64 +4,9 @@ Hi there! We're thrilled that you'd like to contribute to this project. Your hel
The majority of contributions won't need to touch any Ruby code at all.
## Adding an extension to a language
## Getting started
We try only to add new extensions once they have some usage on GitHub. In most cases we prefer that extensions be in use in hundreds of repositories before supporting them in Linguist.
To add support for a new extension:
1. Add your extension to the language entry in [`languages.yml`][languages], keeping the extensions in alphabetical order.
1. Add at least one sample for your extension to the [samples directory][samples] in the correct subdirectory.
1. Open a pull request, linking to a [GitHub search result](https://github.com/search?utf8=%E2%9C%93&q=extension%3Aboot+NOT+nothack&type=Code&ref=searchresults) showing in-the-wild usage.
In addition, if this extension is already listed in [`languages.yml`][languages] then sometimes a few more steps will need to be taken:
1. Make sure that example `.yourextension` files are present in the [samples directory][samples] for each language that uses `.yourextension`.
1. Test the performance of the Bayesian classifier with a relatively large number (1000s) of sample `.yourextension` files. (ping **@lildude** to help with this) to ensure we're not misclassifying files.
1. If the Bayesian classifier does a bad job with the sample `.yourextension` files then a [heuristic](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb) may need to be written to help.
## Adding a language
We try only to add languages once they have some usage on GitHub. In most cases we prefer that each new file extension be in use in hundreds of repositories before supporting them in Linguist.
To add support for a new language:
1. Add an entry for your language to [`languages.yml`][languages]. Omit the `language_id` field for now.
1. Add a grammar for your language: `script/add-grammar https://github.com/JaneSmith/MyGrammar`. Please only add grammars that have [one of these licenses][licenses].
1. Add samples for your language to the [samples directory][samples] in the correct subdirectory.
1. Add a `language_id` for your language using `script/set-language-ids`. **You should only ever need to run `script/set-language-ids --update`. Anything other than this risks breaking GitHub search :cry:**
1. Open a pull request, linking to a [GitHub search result](https://github.com/search?utf8=%E2%9C%93&q=extension%3Aboot+NOT+nothack&type=Code&ref=searchresults) showing in-the-wild usage.
In addition, if your new language defines an extension that's already listed in [`languages.yml`][languages] (such as `.foo`) then sometimes a few more steps will need to be taken:
1. Make sure that example `.foo` files are present in the [samples directory][samples] for each language that uses `.foo`.
1. Test the performance of the Bayesian classifier with a relatively large number (1000s) of sample `.foo` files. (ping **@lildude** to help with this) to ensure we're not misclassifying files.
1. If the Bayesian classifier does a bad job with the sample `.foo` files then a [heuristic](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb) may need to be written to help.
Remember, the goal here is to try and avoid false positives!
## Fixing a misclassified language
Most languages are detected by their file extension defined in [languages.yml][languages]. For disambiguating between files with common extensions, linguist applies some [heuristics](/lib/linguist/heuristics.rb) and a [statistical classifier](lib/linguist/classifier.rb). This process can help differentiate between, for example, `.h` files which could be either C, C++, or Obj-C.
Misclassifications can often be solved by either adding a new filename or extension for the language or adding more [samples][samples] to make the classifier smarter.
## Fixing syntax highlighting
Syntax highlighting in GitHub is performed using TextMate-compatible grammars. These are the same grammars that TextMate, Sublime Text and Atom use. Every language in [languages.yml][languages] is mapped to its corresponding TM `scope`. This scope will be used when picking up a grammar for highlighting.
Assuming your code is being detected as the right language, in most cases this is due to a bug in the language grammar rather than a bug in Linguist. [`grammars.yml`][grammars] lists all the grammars we use for syntax highlighting on github.com. Find the one corresponding to your code's programming language and submit a bug report upstream. If you can, try to reproduce the highlighting problem in the text editor that the grammar is designed for (TextMate, Sublime Text, or Atom) and include that information in your bug report.
You can also try to fix the bug yourself and submit a Pull Request. [TextMate's documentation](https://manual.macromates.com/en/language_grammars) offers a good introduction on how to work with TextMate-compatible grammars. You can test grammars using [Lightshow](https://github-lightshow.herokuapp.com).
Once the bug has been fixed upstream, we'll pick it up for GitHub in the next release of Linguist.
## Testing
For development you are going to want to checkout out the source. To get it, clone the repo and run [Bundler](http://gembundler.com/) to install its dependencies.
Before you can start contributing to Linguist, you'll need to set up your environment first. Clone the repo and run `script/bootstrap` to install its dependencies.
git clone https://github.com/github/linguist.git
cd linguist/
@@ -77,7 +22,91 @@ To run Linguist from the cloned repository:
bundle exec bin/linguist --breakdown
To run the tests:
### Dependencies
Linguist uses the [`charlock_holmes`](https://github.com/brianmario/charlock_holmes) character encoding detection library which in turn uses [ICU](http://site.icu-project.org/), and the libgit2 bindings for Ruby provided by [`rugged`](https://github.com/libgit2/rugged). These components have their own dependencies - `icu4c`, and `cmake` and `pkg-config` respectively - which you may need to install before you can install Linguist.
For example, on macOS with [Homebrew](http://brew.sh/): `brew install cmake pkg-config icu4c` and on Ubuntu: `apt-get install cmake pkg-config libicu-dev`.
## Adding an extension to a language
We try only to add new extensions once they have some usage on GitHub. In most cases we prefer that extensions be in use in hundreds of repositories before supporting them in Linguist.
To add support for a new extension:
1. Add your extension to the language entry in [`languages.yml`][languages], keeping the extensions in alphabetical and case-sensitive (uppercase before lowercase) order, with the exception of the primary extension; the primary extension should be first.
1. Add at least one sample for your extension to the [samples directory][samples] in the correct subdirectory. We'd prefer examples of real-world code showing common usage. The more representative of the structure of the language, the better.
1. Open a pull request, linking to a [GitHub search result](https://github.com/search?utf8=%E2%9C%93&q=extension%3Aboot+NOT+nothack&type=Code&ref=searchresults) showing in-the-wild usage.
If you are adding a sample, please state clearly the license covering the code in the sample, and if possible, link to the original source of the sample.
Additionally, if this extension is already listed in [`languages.yml`][languages] and associated with another language, then sometimes a few more steps will need to be taken:
1. Make sure that example `.yourextension` files are present in the [samples directory][samples] for each language that uses `.yourextension`.
1. Test the performance of the Bayesian classifier with a relatively large number (1000s) of sample `.yourextension` files. (ping **@lildude** to help with this) to ensure we're not misclassifying files.
1. If the Bayesian classifier does a bad job with the sample `.yourextension` files then a [heuristic](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb) may need to be written to help.
## Adding a language
We try only to add languages once they have some usage on GitHub. In most cases we prefer that each new file extension be in use in hundreds of repositories before supporting them in Linguist.
To add support for a new language:
1. Add an entry for your language to [`languages.yml`][languages]. Omit the `language_id` field for now.
1. Add a syntax-highlighting grammar for your language using: `script/add-grammar https://github.com/JaneSmith/MyGrammar`
This command will analyze the grammar and, if no problems are found, add it to the repository. If problems are found, please report them to the grammar maintainer as you will not be able to add the grammar if problems are found.
**Please only add grammars that have [one of these licenses][licenses].**
1. Add samples for your language to the [samples directory][samples] in the correct subdirectory.
1. Add a `language_id` for your language using `script/set-language-ids`.
**You should only ever need to run `script/set-language-ids --update`. Anything other than this risks breaking GitHub search :cry:**
1. Open a pull request, linking to a [GitHub search results](https://github.com/search?utf8=%E2%9C%93&q=extension%3Aboot+NOT+nothack&type=Code&ref=searchresults) showing in-the-wild usage.
Please state clearly the license covering the code in the samples. Link directly to the original source if possible.
In addition, if your new language defines an extension that's already listed in [`languages.yml`][languages] (such as `.foo`) then sometimes a few more steps will need to be taken:
1. Make sure that example `.foo` files are present in the [samples directory][samples] for each language that uses `.foo`.
1. Test the performance of the Bayesian classifier with a relatively large number (1000s) of sample `.foo` files. (ping **@lildude** to help with this) to ensure we're not misclassifying files.
1. If the Bayesian classifier does a bad job with the sample `.foo` files then a [heuristic](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb) may need to be written to help.
Remember, the goal here is to try and avoid false positives!
## Fixing a misclassified language
Most languages are detected by their file extension defined in [`languages.yml`][languages]. For disambiguating between files with common extensions, Linguist applies some [heuristics](/lib/linguist/heuristics.rb) and a [statistical classifier](lib/linguist/classifier.rb). This process can help differentiate between, for example, `.h` files which could be either C, C++, or Obj-C.
Misclassifications can often be solved by either adding a new filename or extension for the language or adding more [samples][samples] to make the classifier smarter.
## Fixing syntax highlighting
Syntax highlighting in GitHub is performed using TextMate-compatible grammars. These are the same grammars that TextMate, Sublime Text and Atom use. Every language in [`languages.yml`][languages] is mapped to its corresponding TextMate `scopeName`. This scope name will be used when picking up a grammar for highlighting.
Assuming your code is being detected as the right language, in most cases syntax highlighting problems are due to a bug in the language grammar rather than a bug in Linguist. [`vendor/README.md`][grammars] lists all the grammars we use for syntax highlighting on GitHub.com. Find the one corresponding to your code's programming language and submit a bug report upstream. If you can, try to reproduce the highlighting problem in the text editor that the grammar is designed for (TextMate, Sublime Text, or Atom) and include that information in your bug report.
You can also try to fix the bug yourself and submit a Pull Request. [TextMate's documentation](https://manual.macromates.com/en/language_grammars) offers a good introduction on how to work with TextMate-compatible grammars. You can test grammars using [Lightshow](https://github-lightshow.herokuapp.com).
Once the bug has been fixed upstream, we'll pick it up for GitHub in the next release of Linguist.
## Changing the source of a syntax highlighting grammar
We'd like to ensure Linguist and GitHub.com are using the latest and greatest grammars that are consistent with the current usage but understand that sometimes a grammar can lag behind the evolution of a language or even stop being developed. This often results in someone grasping the opportunity to create a newer and better and more actively maintained grammar, and we'd love to use it and pass on it's functionality to our users.
Switching the source of a grammar is really easy:
script/add-grammar --replace MyGrammar https://github.com/PeterPan/MyGrammar
This command will analyze the grammar and, if no problems are found, add it to the repository. If problems are found, please report these problems to the grammar maintainer as you will not be able to add the grammar if problems are found.
**Please only add grammars that have [one of these licenses][licenses].**
Please then open a pull request for the updated grammar.
## Testing
You can run the tests locally with:
bundle exec rake test
@@ -102,7 +131,7 @@ Linguist is maintained with :heart: by:
As Linguist is a production dependency for GitHub we have a couple of workflow restrictions:
- Anyone with commit rights can merge Pull Requests provided that there is a :+1: from a GitHub member of staff
- Anyone with commit rights can merge Pull Requests provided that there is a :+1: from a GitHub staff member.
- Releases are performed by GitHub staff so we can ensure GitHub.com always stays up to date with the latest release of Linguist and there are no regressions in production.
### Releasing
@@ -125,7 +154,7 @@ If you are the current maintainer of this gem:
1. Tag and push: `git tag vx.xx.xx; git push --tags`
1. Push to rubygems.org -- `gem push github-linguist-3.0.0.gem`
[grammars]: /grammars.yml
[grammars]: /vendor/README.md
[languages]: /lib/linguist/languages.yml
[licenses]: https://github.com/github/linguist/blob/257425141d4e2a5232786bf0b13c901ada075f93/vendor/licenses/config.yml#L2-L11
[samples]: /samples

161
README.md
View File

@@ -5,34 +5,124 @@
This library is used on GitHub.com to detect blob languages, ignore binary or vendored files, suppress generated files in diffs, and generate language breakdown graphs.
See [Troubleshooting](#troubleshooting) and [`CONTRIBUTING.md`](/CONTRIBUTING.md) before filing an issue or creating a pull request.
See [Troubleshooting](#troubleshooting) and [`CONTRIBUTING.md`](CONTRIBUTING.md) before filing an issue or creating a pull request.
## How Linguist works
Linguist takes the list of languages it knows from [`languages.yml`](/lib/linguist/languages.yml) and uses a number of methods to try and determine the language used by each file, and the overall repository breakdown.
Linguist starts by going through all the files in a repository and excludes all files that it determines to be binary data, [vendored code](#vendored-code), [generated code](#generated-code), [documentation](#documentation), or are defined as `data` (e.g. SQL) or `prose` (e.g. Markdown) languages, whilst taking into account any [overrides](#overrides).
If an [explicit language override](#using-gitattributes) has been used, that language is used for the matching files. The language of each remaining file is then determined using the following strategies, in order, with each step either identifying the precise language or reducing the number of likely languages passed down to the next strategy:
- Vim or Emacs modeline,
- commonly used filename,
- shell shebang,
- file extension,
- heuristics,
- naïve Bayesian classification
The result of this analysis is used to produce the language stats bar which displays the languages percentages for the files in the repository. The percentages are calculated based on the bytes of code for each language as reported by the [List Languages](https://developer.github.com/v3/repos/#list-languages) API.
![language stats bar](https://cloud.githubusercontent.com/assets/173/5562290/48e24654-8ddf-11e4-8fe7-735b0ce3a0d3.png)
### How Linguist works on GitHub.com
When you push changes to a repository on GitHub.com, a low priority background job is enqueued to analyze your repository as explained above. The results of this analysis are cached for the lifetime of your repository and are only updated when the repository is updated. As this analysis is performed by a low priority background job, it can take a while, particularly during busy periods, for your language statistics bar to reflect your changes.
## Usage
Install the gem:
$ gem install github-linguist
#### Dependencies
Linguist uses the [`charlock_holmes`](https://github.com/brianmario/charlock_holmes) character encoding detection library which in turn uses [ICU](http://site.icu-project.org/), and the libgit2 bindings for Ruby provided by [`rugged`](https://github.com/libgit2/rugged). These components have their own dependencies - `icu4c`, and `cmake` and `pkg-config` respectively - which you may need to install before you can install Linguist.
For example, on macOS with [Homebrew](http://brew.sh/): `brew install cmake pkg-config icu4c` and on Ubuntu: `apt-get install cmake pkg-config libicu-dev`.
### Application usage
Linguist can be used in your application as follows:
```ruby
require 'rugged'
require 'linguist'
repo = Rugged::Repository.new('.')
project = Linguist::Repository.new(repo, repo.head.target_id)
project.language #=> "Ruby"
project.languages #=> { "Ruby" => 119387 }
```
### Command line usage
A repository's languages stats can also be assessed from the command line using the `linguist` executable. Without any options, `linguist` will output the breakdown that correlates to what is shown in the language stats bar. The `--breakdown` flag will additionally show the breakdown of files by language.
You can try running `linguist` on the root directory in this repository itself:
```console
$ bundle exec bin/linguist --breakdown
68.57% Ruby
22.90% C
6.93% Go
1.21% Lex
0.39% Shell
Ruby:
Gemfile
Rakefile
bin/git-linguist
bin/linguist
ext/linguist/extconf.rb
github-linguist.gemspec
lib/linguist.rb
```
## Troubleshooting
### My repository is detected as the wrong language
![language stats bar](https://cloud.githubusercontent.com/assets/173/5562290/48e24654-8ddf-11e4-8fe7-735b0ce3a0d3.png)
If the language stats bar is reporting a language that you don't expect:
The Language stats bar displays languages percentages for the files in the repository. The percentages are calculated based on the bytes of code for each language as reported by the [List Languages](https://developer.github.com/v3/repos/#list-languages) API. If the bar is reporting a language that you don't expect:
1. Click on the name of the language in the stats bar to see a list of the files that are identified as that language.
1. If you see files that you didn't write, consider moving the files into one of the [paths for vendored code](/lib/linguist/vendor.yml), or use the [manual overrides](#overrides) feature to ignore them.
1. If the files are being misclassified, search for [open issues][issues] to see if anyone else has already reported the issue. Any information you can add, especially links to public repositories, is helpful.
1. Click on the name of the language in the stats bar to see a list of the files that are identified as that language.
Keep in mind this performs a search so the [code search restrictions](https://help.github.com/articles/searching-code/#considerations-for-code-search) may result in files identified in the language statistics not appearing in the search results. [Installing Linguist locally](#usage) and running it from the [command line](#command-line-usage) will give you accurate results.
1. If you see files that you didn't write in the search results, consider moving the files into one of the [paths for vendored code](/lib/linguist/vendor.yml), or use the [manual overrides](#overrides) feature to ignore them.
1. If the files are misclassified, search for [open issues][issues] to see if anyone else has already reported the issue. Any information you can add, especially links to public repositories, is helpful. You can also use the [manual overrides](#overrides) feature to correctly classify them in your repository.
1. If there are no reported issues of this misclassification, [open an issue][new-issue] and include a link to the repository or a sample of the code that is being misclassified.
Keep in mind that the repository language stats are only [updated when you push changes](#how-linguist-works-on-github-com), and the results are cached for the lifetime of your repository. If you have not made any changes to your repository in a while, you may find pushing another change will correct the stats.
### My repository isn't showing my language
Linguist does not consider [vendored code](#vendored-code), [generated code](#generated-code), [documentation](#documentation), or `data` (e.g. SQL) or `prose` (e.g. Markdown) languages (as defined by the `type` attribute in [`languages.yml`](/lib/linguist/languages.yml)) when calculating the repository language statistics.
If the language statistics bar is not showing your language at all, it could be for a few reasons:
1. Linguist doesn't know about your language.
1. The extension you have chosen is not associated with your language in [`languages.yml`](/lib/linguist/languages.yml).
1. All the files in your repository fall into one of the categories listed above that Linguist excludes by default.
If Linguist doesn't know about the language or the extension you're using, consider [contributing](CONTRIBUTING.md) to Linguist by opening a pull request to add support for your language or extension. For everything else, you can use the [manual overrides](#overrides) feature to tell Linguist to include your files in the language statistics.
### There's a problem with the syntax highlighting of a file
Linguist detects the language of a file but the actual syntax-highlighting is powered by a set of language grammars which are included in this project as a set of submodules [and may be found here](https://github.com/github/linguist/blob/master/vendor/README.md).
Linguist detects the language of a file but the actual syntax-highlighting is powered by a set of language grammars which are included in this project as a set of submodules [as listed here](/vendor/README.md).
If you experience an issue with the syntax-highlighting on GitHub, **please report the issue to the upstream grammar repository, not here.** Grammars are updated every time we build the Linguist gem so upstream bug fixes are automatically incorporated as they are fixed.
If you experience an issue with the syntax-highlighting on GitHub, **please report the issue to the upstream grammar repository, not here.** Grammars are updated every time we build the Linguist gem and so upstream bug fixes are automatically incorporated as they are fixed.
## Overrides
Linguist supports a number of different custom overrides strategies for language definitions and vendored paths.
Linguist supports a number of different custom override strategies for language definitions and file paths.
### Using gitattributes
Add a `.gitattributes` file to your project and use standard git-style path matchers for the files you want to override to set `linguist-documentation`, `linguist-language`, `linguist-vendored`, and `linguist-generated`. `.gitattributes` will be used to determine language statistics and will be used to syntax highlight files. You can also manually set syntax highlighting using [Vim or Emacs modelines](#using-emacs-or-vim-modelines).
Add a `.gitattributes` file to your project and use standard git-style path matchers for the files you want to override using the `linguist-documentation`, `linguist-language`, `linguist-vendored`, and `linguist-generated` attributes. `.gitattributes` will be used to determine language statistics and will be used to syntax highlight files. You can also manually set syntax highlighting using [Vim or Emacs modelines](#using-emacs-or-vim-modelines).
```
$ cat .gitattributes
@@ -41,7 +131,7 @@ $ cat .gitattributes
#### Vendored code
Checking code you didn't write, such as JavaScript libraries, into your git repo is a common practice, but this often inflates your project's language stats and may even cause your project to be labeled as another language. By default, Linguist treats all of the paths defined in [lib/linguist/vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml) as vendored and therefore doesn't include them in the language statistics for a repository.
Checking code you didn't write, such as JavaScript libraries, into your git repo is a common practice, but this often inflates your project's language stats and may even cause your project to be labeled as another language. By default, Linguist treats all of the paths defined in [`vendor.yml`](/lib/linguist/vendor.yml) as vendored and therefore doesn't include them in the language statistics for a repository.
Use the `linguist-vendored` attribute to vendor or un-vendor paths.
@@ -53,7 +143,7 @@ jquery.js linguist-vendored=false
#### Documentation
Just like vendored files, Linguist excludes documentation files from your project's language stats. [lib/linguist/documentation.yml](lib/linguist/documentation.yml) lists common documentation paths and excludes them from the language statistics for your repository.
Just like vendored files, Linguist excludes documentation files from your project's language stats. [`documentation.yml`](/lib/linguist/documentation.yml) lists common documentation paths and excludes them from the language statistics for your repository.
Use the `linguist-documentation` attribute to mark or unmark paths as documentation.
@@ -65,7 +155,9 @@ docs/formatter.rb linguist-documentation=false
#### Generated code
Not all plain text files are true source files. Generated files like minified js and compiled CoffeeScript can be detected and excluded from language stats. As an added bonus, unlike vendored and documentation files, these files are suppressed in diffs.
Not all plain text files are true source files. Generated files like minified JavaScript and compiled CoffeeScript can be detected and excluded from language stats. As an added bonus, unlike vendored and documentation files, these files are suppressed in diffs. [`generated.rb`](/lib/linguist/generated.rb) lists common generated paths and excludes them from the language statistics of your repository.
Use the `linguist-generated` attribute to mark or unmark paths as generated.
```
$ cat .gitattributes
@@ -90,52 +182,15 @@ vim: set ft=cpp:
-*- mode: php;-*-
```
## Usage
Install the gem:
```
$ gem install github-linguist
```
Then use it in your application:
```ruby
require 'rugged'
require 'linguist'
repo = Rugged::Repository.new('.')
project = Linguist::Repository.new(repo, repo.head.target_id)
project.language #=> "Ruby"
project.languages #=> { "Ruby" => 119387 }
```
These stats are also printed out by the `linguist` executable. You can use the
`--breakdown` flag, and the binary will also output the breakdown of files by language.
You can try running `linguist` on the root directory in this repository itself:
```
$ bundle exec linguist --breakdown
100.00% Ruby
Ruby:
Gemfile
Rakefile
bin/linguist
github-linguist.gemspec
lib/linguist.rb
```
## Contributing
Please check out our [contributing guidelines](CONTRIBUTING.md).
## License
The language grammars included in this gem are covered by their repositories'
respective licenses. `grammars.yml` specifies the repository for each grammar.
respective licenses. [`vendor/README.md`](/vendor/README.md) lists the repository for each grammar.
All other files are covered by the MIT license, see `LICENSE`.