Sample file mpq.d by Georg Lukas; license GPL 2.
Sample file counts.d by Kate Turner; public domain.
Sample file javascript-race.d by unknown; license MPL 1.1/GPL 2.0/LGPL 2.1.
Sample file probes.d by momjian; license TBD.
While XML is technically a markup language, in the majority of cases it
is just a serialization format for a tool (e.g., project files for IDEs)
rather than hand-authored markup. As such it isn't really useful to
include it in repository language statistics. A C# project doesn't
really care whether Visual Studio uses XML, JSON, or some other format
to serialize its project files, for example.
Documentation is an important part of a software project but is not
generally thought of as part of the code for that project. Repository
language statistics are used to quantify the project's code, so it makes
sense to exclude documentation from those computations.
Documentation files are recognized similarly to vendored files.
lib/linguist/documentation.yml contains regular expressions to match
common names for documentation files. A new linguist-documentation Git
attribute can be used to override those conventions.
Originally, only "programming" languages were included in repository
language statistics. In 33ebee0f6a we
started detecting a few selected "markup" languages as well. We didn't
include all "markup" languages because at the time formats like Markdown
and AsciiDoc were labeled as "markup" languages, and we thought that
including those prose (i.e., non-code) languages in repository
statistics on github.com was misleading for repositories that are
largely about code but also contain a lot of documentation (e.g.,
rails/rails).
This hand-picked set of whitelisted "markup" languages can cause strange
categorization for some repositories. For example, it includes CSS (and
some variants) but not HTML. This results in repositories that contain
the source code for a static website being classified as either a
JavaScript (programming) or CSS (markup) repository, with no mention of
HTML anywhere.
Fast-forward to today, and prose languages are no longer "markup"
languages; they're now "prose" languages. So now we can include all
"markup" languages in repository language statistics without worrying
about undesirable effects for documentation-heavy repositories.