tools/licenses/lib/README - mirrors/engine - Git at Google

 This license collection script is, fundamentally, one giant pile of
 special cases. As such, while there is an attempt to model the rules
 that apply to licenses and apply some sort of order to the process,
 the code is less than clear. This file attempts to provide an overview.

 main.dart is the core of the operation. It first walks the entire
 directory tree starting from the root of the repository (which is to
 be specified on the command line as the only argument), creating an
 in-memory representation of the project (make sure to run this only
 after you've run gclient sync, so that all dependencies are on disk).
 This is the step that is labeled "Preparing data structures".

 Then, it walks this in-memory representation, attempting to assign to
 each file one or more licenses. This is the step labeled "Collecting
 licenses", which takes a long time.

 Finally, it prints out these licenses.

 The in-memory representation is a tree of RepositoryEntry objects.
 There's three important types of these objects: RepositoryDirectory
 objects, which represent directories; RepositoryLicensedFile, which
 represents source files and resources that might end up in the binary,
 and RepositoryLicenseFile, which represents license files that do not
 themselves end up in the binary other than as a side-effect of this
 script.

 RepositoryDirectory objects contain three lists, the list of
 RepositoryDirectory subdirectories, the list of RepositoryLicensedFile
 children, and the list of RepositoryLicenseFile children.

 RepositoryDirectory objects are the objects that crawl the filesystem.

 While the script is pretty conservative (including probably more
 licenses than strictly necessary), it tries to avoid including
 material that isn't actually used. To do this, RepositoryDirectory
 objects only crawl directories and files for which shouldRecurse
 returns true. For example, shouldRecurse returns false for ".git"
 files. To simplify the configuration of such rules, the default is to
 apply the rules in the paths.dart file.

 Some directories and files require special handling, and have specific
 subclasses of the above classes. To create the appropriate objects,
 RepositoryDirectory calls createSubdirectory and createFile to create
 the nodes of the tree.


 The low-level handling of files is done by classes in filesystem.dart.
 This code supports transparently crawling into archives (e.g. .jar
 files), as well as handling UTF-8 vs latin1. It contains much magic
 and hard-coded file names and so on to handle distinguishing binary
 files from text files, and so forth.

 This code uses the cache described in cache.dart to try to avoid
 having to repeatedly reopen the same file many times in a row.


 In the case of a binary file, the license is found by crawling around
 the directory structure looking for a "default" license file. In the
 case of text files, though, it's often the case that the file itself
 mentions the license and therefore the file itself is inspected
 looking for copyright or license text. This scanning is done by
 determineLicensesFor() in licenses.dart.

 This function uses patterns that are themselves in patterns.dart. In
 this file we find all manner of long complicated and somewhat crazy
 regular expressions. This is where you see quite how absurd this work
 can actually be. It is left as an exercise to the reader to look for
 the implications of many of the regular expressions; as one example,
 though, consider the case of the pattern that matches the AFL/LGPL
 dual license statement: there is one file in which the ZIP code for
 the Free Software Foundation is off by one, for no clear reason,
 leading to the pattern ending with "MA 0211[01]-1307, USA".


 The license.dart file also contains the License object, the code that
 attempts to determine what copyrights apply to which licenses, and the
 code that attempts to identify the licenses themselves (at a high
 level), to make sure that appropriate clauses are followed (e.g.
 including the copyright with a BSD notice).


 In formatter.dart you will find the only part of this codebase that is
 actually tested at this time; this is the code that reformats blobs of
 text to remove comments and decorations and the like.


 The biggest problem with this script right now is that it is absurdly
 slow, largely because of the many, many applications of regular
 expressions in pattern.dart (especially the ones that use _linebreak).
 In regexp_debug.dart we have a wrapper around RegExp that attempts to
 quantify this cost, you can see the results if you run the script with
 `--verbose`. Long term, the right solution is to change from these
 all-in-one patterns to a world where we first tokenize each file, then
 apply word-by-word pattern matching to the tokenized input.
	This license collection script is, fundamentally, one giant pile of
	special cases. As such, while there is an attempt to model the rules
	that apply to licenses and apply some sort of order to the process,
	the code is less than clear. This file attempts to provide an overview.

	main.dart is the core of the operation. It first walks the entire
	directory tree starting from the root of the repository (which is to
	be specified on the command line as the only argument), creating an
	in-memory representation of the project (make sure to run this only
	after you've run gclient sync, so that all dependencies are on disk).
	This is the step that is labeled "Preparing data structures".

	Then, it walks this in-memory representation, attempting to assign to
	each file one or more licenses. This is the step labeled "Collecting
	licenses", which takes a long time.

	Finally, it prints out these licenses.

	The in-memory representation is a tree of RepositoryEntry objects.
	There's three important types of these objects: RepositoryDirectory
	objects, which represent directories; RepositoryLicensedFile, which
	represents source files and resources that might end up in the binary,
	and RepositoryLicenseFile, which represents license files that do not
	themselves end up in the binary other than as a side-effect of this
	script.

	RepositoryDirectory objects contain three lists, the list of
	RepositoryDirectory subdirectories, the list of RepositoryLicensedFile
	children, and the list of RepositoryLicenseFile children.

	RepositoryDirectory objects are the objects that crawl the filesystem.

	While the script is pretty conservative (including probably more
	licenses than strictly necessary), it tries to avoid including
	material that isn't actually used. To do this, RepositoryDirectory
	objects only crawl directories and files for which shouldRecurse
	returns true. For example, shouldRecurse returns false for ".git"
	files. To simplify the configuration of such rules, the default is to
	apply the rules in the paths.dart file.

	Some directories and files require special handling, and have specific
	subclasses of the above classes. To create the appropriate objects,
	RepositoryDirectory calls createSubdirectory and createFile to create
	the nodes of the tree.


	The low-level handling of files is done by classes in filesystem.dart.
	This code supports transparently crawling into archives (e.g. .jar
	files), as well as handling UTF-8 vs latin1. It contains much magic
	and hard-coded file names and so on to handle distinguishing binary
	files from text files, and so forth.

	This code uses the cache described in cache.dart to try to avoid
	having to repeatedly reopen the same file many times in a row.


	In the case of a binary file, the license is found by crawling around
	the directory structure looking for a "default" license file. In the
	case of text files, though, it's often the case that the file itself
	mentions the license and therefore the file itself is inspected
	looking for copyright or license text. This scanning is done by
	determineLicensesFor() in licenses.dart.

	This function uses patterns that are themselves in patterns.dart. In
	this file we find all manner of long complicated and somewhat crazy
	regular expressions. This is where you see quite how absurd this work
	can actually be. It is left as an exercise to the reader to look for
	the implications of many of the regular expressions; as one example,
	though, consider the case of the pattern that matches the AFL/LGPL
	dual license statement: there is one file in which the ZIP code for
	the Free Software Foundation is off by one, for no clear reason,
	leading to the pattern ending with "MA 0211[01]-1307, USA".


	The license.dart file also contains the License object, the code that
	attempts to determine what copyrights apply to which licenses, and the
	code that attempts to identify the licenses themselves (at a high
	level), to make sure that appropriate clauses are followed (e.g.
	including the copyright with a BSD notice).


	In formatter.dart you will find the only part of this codebase that is
	actually tested at this time; this is the code that reformats blobs of
	text to remove comments and decorations and the like.


	The biggest problem with this script right now is that it is absurdly
	slow, largely because of the many, many applications of regular
	expressions in pattern.dart (especially the ones that use _linebreak).
	In regexp_debug.dart we have a wrapper around RegExp that attempts to
	quantify this cost, you can see the results if you run the script with
	`--verbose`. Long term, the right solution is to change from these
	all-in-one patterns to a world where we first tokenize each file, then
	apply word-by-word pattern matching to the tokenized input.