blob: 7874f4fa6f56d0e87d9a479a878f94263fc25199 [file] [log] [blame] [edit]
This license collection script is, fundamentally, one giant pile of
special cases. As such, while there is an attempt to model the rules
that apply to licenses and apply some sort of order to the process,
the code is less than clear. This file attempts to provide an overview.
main.dart is the core of the operation. It first walks the entire
directory tree starting from the root of the repository (which is to
be specified on the command line as the only argument), creating an
in-memory representation of the project (make sure to run this only
after you've run gclient sync, so that all dependencies are on disk).
This is the step that is labeled "Preparing data structures".
Then, it walks this in-memory representation, attempting to assign to
each file one or more licenses. This is the step labeled "Collecting
licenses", which takes a long time.
Finally, it prints out these licenses.
The in-memory representation is a tree of RepositoryEntry objects.
There's three important types of these objects: RepositoryDirectory
objects, which represent directories; RepositoryLicensedFile, which
represents source files and resources that might end up in the binary,
and RepositoryLicenseFile, which represents license files that do not
themselves end up in the binary other than as a side-effect of this
script.
RepositoryDirectory objects contain three lists, the list of
RepositoryDirectory subdirectories, the list of RepositoryLicensedFile
children, and the list of RepositoryLicenseFile children.
RepositoryDirectory objects are the objects that crawl the filesystem.
While the script is pretty conservative (including probably more
licenses than strictly necessary), it tries to avoid including
material that isn't actually used. To do this, RepositoryDirectory
objects only crawl directories and files for which shouldRecurse
returns true. For example, shouldRecurse returns false for ".git"
files. To simplify the configuration of such rules, the default is to
apply the rules in the paths.dart file.
Some directories and files require special handling, and have specific
subclasses of the above classes. To create the appropriate objects,
RepositoryDirectory calls createSubdirectory and createFile to create
the nodes of the tree.
The low-level handling of files is done by classes in filesystem.dart.
This code supports transparently crawling into archives (e.g. .jar
files), as well as handling UTF-8 vs latin1. It contains much magic
and hard-coded file names and so on to handle distinguishing binary
files from text files, and so forth.
This code uses the cache described in cache.dart to try to avoid
having to repeatedly reopen the same file many times in a row.
In the case of a binary file, the license is found by crawling around
the directory structure looking for a "default" license file. In the
case of text files, though, it's often the case that the file itself
mentions the license and therefore the file itself is inspected
looking for copyright or license text. This scanning is done by
determineLicensesFor() in licenses.dart.
This function uses patterns that are themselves in patterns.dart. In
this file we find all manner of long complicated and somewhat crazy
regular expressions. This is where you see quite how absurd this work
can actually be. It is left as an exercise to the reader to look for
the implications of many of the regular expressions; as one example,
though, consider the case of the pattern that matches the AFL/LGPL
dual license statement: there is one file in which the ZIP code for
the Free Software Foundation is off by one, for no clear reason,
leading to the pattern ending with "MA 0211[01]-1307, USA".
The license.dart file also contains the License object, the code that
attempts to determine what copyrights apply to which licenses, and the
code that attempts to identify the licenses themselves (at a high
level), to make sure that appropriate clauses are followed (e.g.
including the copyright with a BSD notice).
In formatter.dart you will find the only part of this codebase that is
actually tested at this time; this is the code that reformats blobs of
text to remove comments and decorations and the like.
The biggest problem with this script right now is that it is absurdly
slow, largely because of the many, many applications of regular
expressions in pattern.dart (especially the ones that use _linebreak).
In regexp_debug.dart we have a wrapper around RegExp that attempts to
quantify this cost, you can see the results if you run the script with
`--verbose`. Long term, the right solution is to change from these
all-in-one patterns to a world where we first tokenize each file, then
apply word-by-word pattern matching to the tokenized input.