blob: 3699c766ce0cf48ad5b6670b1241708b2a1fd5a7 [file] [log] [blame] [edit]
This license collection script is, fundamentally, one giant pile of
special cases. As such, while there is an attempt to model the rules
that apply to licenses and apply some sort of order to the process,
the code is less than clear. This file attempts to provide an overview.
main.dart is the core of the operation. It first walks the entire
directory tree starting from the root of the repository (which is to
be specified on the command line as the only argument), creating an
in-memory representation of the project (make sure to run this only
after you've run gclient sync, so that all dependencies are on disk).
This is the step that is labeled "Preparing data structures".
Then, it walks this in-memory representation, attempting to assign to
each file one or more licenses. This is the step labeled "Collecting
licenses", which takes a long time.
Finally, it prints out these licenses.
The in-memory representation is a tree of RepositoryEntry objects.
There's three important types of these objects: RepositoryDirectory
objects, which represent directories; RepositoryLicensedFile, which
represents source files and resources that might end up in the binary,
and RepositoryLicenseFile, which represents license files that do not
themselves end up in the binary other than as a side-effect of this
script.
RepositoryDirectory objects contain three lists, the list of
RepositoryDirectory subdirectories, the list of RepositoryLicensedFile
children, and the list of RepositoryLicenseFile children.
RepositoryDirectory objects are the objects that crawl the filesystem.
While the script is pretty conservative (including probably more
licenses than strictly necessary), it tries to avoid including
material that isn't actually used. To do this, RepositoryDirectory
objects only crawl directories and files for which shouldRecurse
returns true. For example, shouldRecurse returns false for ".git"
files.
Some directories and files require special handling, and have specific
subclasses of the above classes. To create the appropriate objects,
RepositoryDirectory calls createSubdirectory and createFile to create
the nodes of the tree.
The low-level handling of files is done by classes in filesystem.dart.
This code supports transparently crawling into archives (e.g. .jar
files), as well as handling UTF-8 vs latin1. It contains much magic
and hard-coded file names and so on to handle distinguishing binary
files from text files, and so forth.
This code uses the cache described in cache.dart to try to avoid
having to repeatedly reopen the same file many times in a row.
In the case of a binary file, the license is found by crawling around
the directory structure looking for a "default" license file. In the
case of text files, though, it's often the case that the file itself
mentions the license and therefore the file itself is inspected
looking for copyright or license text. This scanning is done by
determineLicensesFor() in licenses.dart.
This function uses patterns that are themselves in patterns.dart. In
this file we find all manner of long complicated and somewhat crazy
regular expressions. This is where you see quite how absurd this work
can actually be. It is left as an exercise to the reader to look for
the implications of many of the regular expressions; as one example,
though, consider the case of the pattern that matches the AFL/LGPL
dual license statement: there is one file in which the ZIP code for
the Free Software Foundation is off by one, for no clear reason,
leading to the pattern ending with "MA 0211[01]-1307, USA".
The license.dart file also contains the License object, the currently
simplistic normalizer (_reformat) for license text (which mostly just
removes comment syntax), the code that attempts to determine what
copyrights apply to which licenses, and the code that attempts to
identify the licenses themselves (at a high level), to make sure that
appropriate clauses are followed (e.g. including the copyright with a
BSD notice).