# Perfetto CI design document

This CI is used on-top of (not in replacement of) AOSP's TreeHugger.
It gives early testing signals and coverage on other OSes and older Android
devices not supported by TreeHugger.

See the [Testing](/docs/contributing/testing.md) page for more details about the
project testing strategy.

## Architecture diagram

![Architecture diagram](/docs/images/continuous-integration.png)

There are four major components:

1. Frontend: AppEngine.
2. Controller: AppEngine BG service.
3. Workers: Compute Engine + Docker.
4. Database: Firebase realtime database.

They are coupled via the Firebase DB. The DB is the source of truth for the
whole CI.

## Controller

The Controller orchestrates the CI. It's the most trusted piece of the system.

It is based on a background AppEngine service. Such service is only
triggered by deferred tasks and periodic Cron jobs.

The Controller is the only entity which performs authenticated access to Gerrit.
It uses a non-privileged gmail account and has no meaningful voting power.

The controller loop does mainly the following:

- It periodically (every 5s) polls Gerrit for CLs updated in the last 24h.
- It checks the list of CLs against the list of already known CLs in the DB.
- For each new CL it enqueues `N` new jobs in the database, one for each
  configuration defined in [config.py](/infra/ci/config.py) (e.g. `linux-debug`,
  `android-release`, ...).
- It monitors the state of jobs. When all jobs for a CL have been completed,
  it posts a comment and adds the vote if the CL is marked as `Presubmit-Ready`.
- It does some other less-relevant bookkeeping.
- AppEngine is highly reliable and self-healing. If a task fails (e.g. because
  of a Gerrit 500) it will be automatically re-tried with exponential backoff.

## Frontend

The frontend is an AppEngine service that hosts the CI website @
[ci.perfetto.dev](https://ci.perfetto.dev).
Conversely to the Controller, it is exposed to the public via HTTP.

- It's an almost fully static website based on HTML and Javascript.
- The only backend-side code ([frontend.py](/infra/ci/frontend/frontend.py))
  is used to proxy XHR GET requests to Gerrit, due to the lack of Gerrit
  CORS headers.
- Such XHR requests are GET-only and anonymous.
- The frontend python code also serves as a memcache layer for Gerrit requests
  that return immutable data (e.g. revision logs) to reduce the likeliness of
  hitting Gerrit errors / timeouts.

## Worker GCE VM

The actual testing job happens inside these Google Compute Engine VMs.
The GCE instance is running a CrOS-based
[Container-Optimized](https://cloud.google.com/container-optimized-os/docs/) OS.

The whole system image is read-only. The VM itself is stateless. No state is
persisted outside of the DB and Google Cloud Storage (only for UI artifacts).
The SSD is used only as a scratch disk and is cleared on each reboot.

VMs are dynamically spawned using the Google Cloud Autoscaler and use a
Stackdriver Custom Metric pushed by the Controller as cost function.
Such metric is the number of queued + running jobs.

Each VM runs two types of Docker containers: _worker_ and the _sandbox_.
They are in a 1:1 relationship, each worker controls at most one sandbox
associated. Workers are always alive (they work in polling-mode), while
sandboxes are started and stopped by the worker on-demand.

On each GCE instance there are M (currently 10) worker containers running and
hence up to M sandboxes.

### Worker containers

Worker containers are trusted entities. They can impersonate the GCE service
account and have R/W access to the DB. They can also spawn sandbox containers.

Their behavior depends only on code that is manually deployed and doesn't depend
on the checkout under test. The reason why workers are Docker containers is NOT
security but only reproducibility and maintenance.

Each worker does the following:

- Poll for an available job from the `/jobs_queued` sub-tree of the DB.
- Move such job into `/jobs_running`.
- Start the sandbox container, passing down the job config and the git revision
  via env vars.
- Stream the sandbox stdout to the `/logs` sub-tree of the DB.
- Terminate the sandbox container prematurely in case of timeouts or job
  cancellations requested by the Controller.
- Upload UI artifacts to GCS.
- Update the DB to reflect completion of jobs, removing the entry from
  `/jobs_running` and updating the `/jobs/$jobId/status` fields.

### Sandbox containers

Sandbox containers are untrusted entities. They can access the internet
(for git pull / install-build-deps) but they cannot impersonate the GCE service
account, cannot write into the DB, cannot write into GCS buckets.
Docker here is used both as an isolation boundary and for reproducibility /
debugging.

Each sandbox does the following:

- Checkout the code at the revision specified in the job config.
- Run one of the [test/ci/](/test/ci/) scripts which will build and run tests.
- Return either a success (0) or fail (!= 0) exit code.

A sandbox container is almost completely stateless with the only exception of
the semi-ephemeral `/ci/cache` mount-point. This mount-point is tmpfs-based
(hence cleared on reboot) but is shared across all sandboxes. It's used only to
maintain the shared ccache.

# Data model

The whole CI is based on
[Firebase Realtime DB](https://firebase.google.com/docs/database).
It is a high-scale JSON object accessible via a simple REST API.
Clients can GET/PUT/PATCH/DELETE individual sub-nodes without having a local
full-copy of the DB.

```bash
/ci
    # For post-submit jobs.
    /branches
        /main-20190626000853
        # ┃     ┗━ Committer-date of the HEAD of the branch.
        # ┗━ Branch name
        {
            author: "primiano@google.com"
            rev: "0552edf491886d2bb6265326a28fef0f73025b6b"
            subject: "Cloud-based CI"
            time_committed: "2019-07-06T02:35:14Z"
            jobs:
            {
                20190708153242--branches-main-20190626000853--android-...: 0
                20190708153242--branches-main-20190626000853--linux-...:  0
                ...
            }
        }
        /main-20190701235742 {...}

    # For pre-submit jobs.
    /cls
        /1000515-65
        {
            change_id:    "platform%2F...~I575be190"
            time_queued:  "2019-07-08T15:32:42Z"
            time_ended:   "2019-07-08T15:33:25Z"
            revision_id:  "18c2e4d0a96..."
            wants_vote:   true
            voted:        true
            jobs: {
                20190708153242--cls-1000515-65--android-clang:  0
                ...
                20190708153242--cls-1000515-65--ui-clang:       0
            }
        }
        /1000515-66 {...}
        ...
        /1011130-3 {...}

    /cls_pending
       # Effectively this is an array of pending CLs that we might need to
       # vote on at the end. Only the keys matter, the values have no
       # semantic and are always 0.
       /1000515-65: 0

    /jobs
        /20190708153242--cls-1000515-65--android-clang-arm-debug:
        #  ┃               ┃             ┗━ Job type.
        #  ┃               ┗━ Path of the CL or branch object.
        #  ┗━ Datetime when the job was created.
        {
            src:          "cls/1000515-66"
            status:       "QUEUED"
                          "STARTED"
                          "COMPLETED"
                          "FAILED"
                          "TIMED_OUT"
                          "CANCELLED"
                          "INTERRUPTED"
            time_ended:   "2019-07-07T12:47:22Z"
            time_queued:  "2019-07-07T12:34:22Z"
            time_started: "2019-07-07T12:34:25Z"
            type:         "android-clang-arm-debug"
            worker:       "zqz2-worker-2"
        }
        /20190707123422--cls-1000515-66--android-clang-arm-rel {..}

    /jobs_queued
        # Effectively this is an array. Only the keys matter, the values
        # have no semantic and are always 0.
        /20190708153242--cls-1000515-65--android-clang-arm-debug: 0

    /jobs_running
        # Effectively this is an array. Only the keys matter, the values
        # have no semantic and are always 0.
        /20190707123422--cls-1000515-66--android-clang-arm-rel

    /logs
        /20190707123422--cls-1000515-66--android-clang-arm-rel
            /00a053-0000: "+ chmod 777 /ci/cache /ci/artifacts"
            # ┃      ┗━ Monotonic counter to establish total order on log lines
            # ┃         retrieved within the same read() batch.
            # ┃
            # ┗━ Hex-encoded timestamp, relative since start of test.
            /00a053-0001: "+ chown perfetto.perfetto /ci/ramdisk"
            ...

```

# Sequence Diagram

This is what happens, in order, on a worker instance from boot to the test run.

```bash
make -C /infra/ci worker-start
┗━ gcloud start ...

[GCE] # From /infra/ci/worker/gce-startup-script.sh
docker run worker-1 ...
...
docker run worker-N ...

[worker-X] # From /infra/ci/worker/Dockerfile
┗━ /infra/ci/worker/worker.py
  ┗━ docker run sandbox-X ...

[sandbox-X] # From /infra/ci/sandbox/Dockerfile
┗━ /infra/ci/sandbox/init.sh
  ┗━ /infra/ci/sandbox/testrunner.sh
    ┣━ git fetch refs/changes/...
    ┇  ...
    ┇  # This env var is passed by the test definition
    ┇  # specified in /infra/ci/config.py .
    ┗━ $PERFETTO_TEST_SCRIPT
       ┣━ # Which is one of these:
       ┣━ /test/ci/android_tests.sh
       ┣━ /test/ci/fuzzer_tests.sh
       ┣━ /test/ci/linux_tests.sh
       ┗━ /test/ci/ui_tests.sh
          ┣━ ninja ...
          ┗━ out/dist/{unit,integration,...}test
```

### [gce-startup-script.sh](/infra/ci/worker/gce-startup-script.sh)

- Is ran once per GVE vm, at (re)boot.
- It prepares the tmpfs mountpoint for the shared ccache.
- It wipes the SSD scratch disk for the build artifacts
- It pulls the latest {worker, sandbox} container images from
  the Google Cloud Container registry.
- Sets up Docker and `iptables` (for the sandboxed network).
- Starts `N` worker containers in Docker.

### [worker.py](/infra/ci/worker/worker.py)

- It polls the DB to retrieve a job.
- When a job is retrieved starts a sandbox container.
- It streams the container stdout/stderr to the DB.
- It upload the build artifacts to GCS.

### [testrunner.sh](/infra/ci/sandbox/testrunner.sh)

- It is pinned in the container image. Does NOT depend on the particular
  revision being tested.
- Checks out the repo at the revision specified (by the Controller) in the
  job config pulled from the DB.
- Sets up ccache
- Deals with caching of buildtools/.
- Runs the test script specified in the job config from the checkout.

### [{android,fuzzer,linux,ui}_tests.sh](/test/ci/linux_tests.sh)

- Are NOT pinned in the container and are ran from the checked out revision.
- Finally build and run the test.

## Playbook

### Frontend (JS/HTML/CSS) changes

Test-locally: `make -C infra/ci/frontend test`

Deploy with `make -C infra/ci/frontend deploy`

### Controller changes

Deploy with `make -C infra/ci/controller deploy`

It is possible to try locally via the `make -C infra/ci/controller test`
but this involves:

- Manually stopping the production AppEngine instance via the Cloud Console
  (stopping via the `gcloud` cli doesn't seem to work, b/136828660)
- Downloading the testing service credentials `test-credentials.json`
  (they are in the internal Team drive).

### Worker/Sandbox changes

1. Build and push the new docker containers with:

   `make -C infra/ci build push`

2. Restart the GCE instances, either manually or via

   `make -C infra/ci restart-workers`

### Purging the job queue

This can be useful when there is an outage and too many jobs pile up.
 - Stop the workers: `make -C infra/ci stop-workers`
 - Open https://console.firebase.google.com/u/0/project/perfetto-ci/database/perfetto-ci/data/~2Fci
 - Delete the `jobs_running`, `jobs_queued`, `workers` subtrees
 - Restart the workers: `make -C infra/ci start-workers`

## Security considerations

- Both the Firebase DB and the gs://perfetto-artifacts GCS bucket are
  world-readable and writable by the GAE and GCE service accounts.

- The GAE service account also has the ability to log into Gerrit using a
  dedicated gmail.com account. The GCE service account doesn't.

- Overall, no account in this project has any interesting privilege:
  - The Gerrit account used for commenting on CLs is just a random gmail account
    and has no special voting power.
  - The service accounts of GAE and GCE don't have any special capabilities
    outside of the CI project itself.

- This CI deals only with functional and performance testing and doesn't deal
  with any sort of continuous deployment.

- Presubmit jobs are only triggered if at least one of the following is true:
  - The owner of the CL is a @google.com account.
  - The user that applied the Presubmit-Ready label is a @google.com account.

- Sandboxes are not too hard to escape (Docker is the only boundary) and can
  pollute each other via the shared ccache.

- As such neither pre-submit nor post-submit build artifacts are considered
  trusted. They are only used for establishing functional correctness and
  performance regression testing.

- Binaries built by the CI are not ran on any other machines outside of the
  CI project. They are deliberately not downloadable.

- The only build artifacts that are retained (for up to 30 days) and uploaded to
  the GCS bucket are the UI artifacts. This is for the only sake of getting
  visual previews of the HTML changes.

- UI artifacts are served from a different origin (the GCS per-bucket API) than
  the production UI.
