This CI is used on-top of (not in replacement of) AOSP's TreeHugger. It gives early testing signals and coverage on other OSes and older Android devices not supported by TreeHugger.
See the Testing page for more details about the project testing strategy.
There are four major components:
They are coupled via the Firebase DB. The DB is the source of truth for the whole CI.
The Controller orchestrates the CI. It's the most trusted piece of the system.
It is based on a background AppEngine service. Such service is only triggered by deferred tasks and periodic Cron jobs.
The Controller is the only entity which performs authenticated access to Gerrit. It uses a non-privileged gmail account and has no meaningful voting power.
The controller loop does mainly the following:
N new jobs in the database, one for each configuration defined in config.py (e.g. linux-debug, android-release, ...).Presubmit-Ready.The frontend is an AppEngine service that hosts the CI website @ ci.perfetto.dev. Conversely to the Controller, it is exposed to the public via HTTP.
The actual testing job happens inside these Google Compute Engine VMs. The GCE instance is running a CrOS-based Container-Optimized OS.
The whole system image is read-only. The VM itself is stateless. No state is persisted outside of the DB and Google Cloud Storage (only for UI artifacts). The SSD is used only as a scratch disk and is cleared on each reboot.
VMs are dynamically spawned using the Google Cloud Autoscaler and use a Stackdriver Custom Metric pushed by the Controller as cost function. Such metric is the number of queued + running jobs.
Each VM runs two types of Docker containers: worker and the sandbox. They are in a 1:1 relationship, each worker controls at most one sandbox associated. Workers are always alive (they work in polling-mode), while sandboxes are started and stopped by the worker on-demand.
On each GCE instance there are M (currently 10) worker containers running and hence up to M sandboxes.
Worker containers are trusted entities. They can impersonate the GCE service account and have R/W access to the DB. They can also spawn sandbox containers.
Their behavior depends only on code that is manually deployed and doesn't depend on the checkout under test. The reason why workers are Docker containers is NOT security but only reproducibility and maintenance.
Each worker does the following:
/jobs_queued sub-tree of the DB./jobs_running./logs sub-tree of the DB./jobs_running and updating the /jobs/$jobId/status fields.Sandbox containers are untrusted entities. They can access the internet (for git pull / install-build-deps) but they cannot impersonate the GCE service account, cannot write into the DB, cannot write into GCS buckets. Docker here is used both as an isolation boundary and for reproducibility / debugging.
Each sandbox does the following:
A sandbox container is almost completely stateless with the only exception of the semi-ephemeral /ci/cache mount-point. This mount-point is tmpfs-based (hence cleared on reboot) but is shared across all sandboxes. It's used only to maintain the shared ccache.
The whole CI is based on Firebase Realtime DB. It is a high-scale JSON object accessible via a simple REST API. Clients can GET/PUT/PATCH/DELETE individual sub-nodes without having a local full-copy of the DB.
/ci # For post-submit jobs. /branches /master-20190626000853 # ┃ ┗━ Committer-date of the HEAD of the branch. # ┗━ Branch name { author: "primiano@google.com" rev: "0552edf491886d2bb6265326a28fef0f73025b6b" subject: "Cloud-based CI" time_committed: "2019-07-06T02:35:14Z" jobs: { 20190708153242--branches-master-20190626000853--android-...: 0 20190708153242--branches-master-20190626000853--linux-...: 0 ... } } /master-20190701235742 {...} # For pre-submit jobs. /cls /1000515-65 { change_id: "platform%2F...~I575be190" time_queued: "2019-07-08T15:32:42Z" time_ended: "2019-07-08T15:33:25Z" revision_id: "18c2e4d0a96..." wants_vote: true voted: true jobs: { 20190708153242--cls-1000515-65--android-clang: 0 ... 20190708153242--cls-1000515-65--ui-clang: 0 } } /1000515-66 {...} ... /1011130-3 {...} /cls_pending # Effectively this is an array of pending CLs that we might need to # vote on at the end. Only the keys matter, the values have no # semantic and are always 0. /1000515-65: 0 /jobs /20190708153242--cls-1000515-65--android-clang-arm-debug: # ┃ ┃ ┗━ Job type. # ┃ ┗━ Path of the CL or branch object. # ┗━ Datetime when the job was created. { src: "cls/1000515-66" status: "QUEUED" "STARTED" "COMPLETED" "FAILED" "TIMED_OUT" "CANCELLED" "INTERRUPTED" time_ended: "2019-07-07T12:47:22Z" time_queued: "2019-07-07T12:34:22Z" time_started: "2019-07-07T12:34:25Z" type: "android-clang-arm-debug" worker: "zqz2-worker-2" } /20190707123422--cls-1000515-66--android-clang-arm-rel {..} /jobs_queued # Effectively this is an array. Only the keys matter, the values # have no semantic and are always 0. /20190708153242--cls-1000515-65--android-clang-arm-debug: 0 /jobs_running # Effectively this is an array. Only the keys matter, the values # have no semantic and are always 0. /20190707123422--cls-1000515-66--android-clang-arm-rel /logs /20190707123422--cls-1000515-66--android-clang-arm-rel /00a053-0000: "+ chmod 777 /ci/cache /ci/artifacts" # ┃ ┗━ Monotonic counter to establish total order on log lines # ┃ retrieved within the same read() batch. # ┃ # ┗━ Hex-encoded timestamp, relative since start of test. /00a053-0001: "+ chown perfetto.perfetto /ci/ramdisk" ...
This is what happens, in order, on a worker instance from boot to the test run.
make -C /infra/ci worker-start ┗━ gcloud start ... [GCE] # From /infra/ci/worker/gce-startup-script.sh docker run worker-1 ... ... docker run worker-N ... [worker-X] # From /infra/ci/worker/Dockerfile ┗━ /infra/ci/worker/worker.py ┗━ docker run sandbox-X ... [sandbox-X] # From /infra/ci/sandbox/Dockerfile ┗━ /infra/ci/sandbox/init.sh ┗━ /infra/ci/sandbox/testrunner.sh ┣━ git fetch refs/changes/... ┇ ... ┇ # This env var is passed by the test definition ┇ # specified in /infra/ci/config.py . ┗━ $PERFETTO_TEST_SCRIPT ┣━ # Which is one of these: ┣━ /test/ci/android_tests.sh ┣━ /test/ci/fuzzer_tests.sh ┣━ /test/ci/linux_tests.sh ┗━ /test/ci/ui_tests.sh ┣━ ninja ... ┗━ out/dist/{unit,integration,...}test
iptables (for the sandboxed network).N worker containers in Docker.Test-locally: make -C infra/ci/frontend test
Deploy with make -C infra/ci/frontend deploy
Deploy with make -C infra/ci/controller deploy
It is possible to try locally via the make -C infra/ci/controller test but this involves:
gcloud cli doesn't seem to work, b/136828660)test-credentials.json (they are in the internal Team drive).Build and push the new docker containers with:
make -C infra/ci build push
Restart the GCE instances, either manually or via
make -C infra/ci restart-workers
This can be useful when there is an outage and too many jobs pile up.
make -C infra/ci stop-workersjobs_running, jobs_queued, workers subtreesmake -C infra/ci start-workersBoth the Firebase DB and the gs://perfetto-artifacts GCS bucket are world-readable and writable by the GAE and GCE service accounts.
The GAE service account also has the ability to log into Gerrit using a dedicated gmail.com account. The GCE service account doesn't.
Overall, no account in this project has any interesting privilege:
This CI deals only with functional and performance testing and doesn't deal with any sort of continuous deployment.
Presubmit jobs are only triggered if at least one of the following is true:
Sandboxes are not too hard to escape (Docker is the only boundary) and can pollute each other via the shared ccache.
As such neither pre-submit nor post-submit build artifacts are considered trusted. They are only used for establishing functional correctness and performance regression testing.
Binaries built by the CI are not ran on any other machines outside of the CI project. They are deliberately not downloadable.
The only build artifacts that are retained (for up to 30 days) and uploaded to the GCS bucket are the UI artifacts. This is for the only sake of getting visual previews of the HTML changes.
UI artifacts are served from a different origin (the GCS per-bucket API) than the production UI.