Primiano Tucci | a662485 | 2020-05-21 19:12:50 +0100 | [diff] [blame] | 1 | # Perfetto CI design document |
Primiano Tucci | 94f90f0 | 2019-07-09 19:47:12 +0100 | [diff] [blame] | 2 | |
Primiano Tucci | a662485 | 2020-05-21 19:12:50 +0100 | [diff] [blame] | 3 | This CI is used on-top of (not in replacement of) AOSP's TreeHugger. |
Primiano Tucci | 94f90f0 | 2019-07-09 19:47:12 +0100 | [diff] [blame] | 4 | It gives early testing signals and coverage on other OSes and older Android |
| 5 | devices not supported by TreeHugger. |
| 6 | |
Primiano Tucci | a662485 | 2020-05-21 19:12:50 +0100 | [diff] [blame] | 7 | See the [Testing](/docs/contributing/testing.md) page for more details about the |
| 8 | project testing strategy. |
Primiano Tucci | 94f90f0 | 2019-07-09 19:47:12 +0100 | [diff] [blame] | 9 | |
| 10 | ## Architecture diagram |
| 11 | |
Primiano Tucci | a662485 | 2020-05-21 19:12:50 +0100 | [diff] [blame] | 12 |  |
Primiano Tucci | 94f90f0 | 2019-07-09 19:47:12 +0100 | [diff] [blame] | 13 | |
| 14 | There are four major components: |
| 15 | |
| 16 | 1. Frontend: AppEngine. |
| 17 | 2. Controller: AppEngine BG service. |
| 18 | 3. Workers: Compute Engine + Docker. |
| 19 | 4. Database: Firebase realtime database. |
| 20 | |
| 21 | They are coupled via the Firebase DB. The DB is the source of truth for the |
| 22 | whole CI. |
| 23 | |
| 24 | ## Controller |
| 25 | |
| 26 | The Controller orchestrates the CI. It's the most trusted piece of the system. |
| 27 | |
| 28 | It is based on a background AppEngine service. Such service is only |
| 29 | triggered by deferred tasks and periodic Cron jobs. |
| 30 | |
Primiano Tucci | a662485 | 2020-05-21 19:12:50 +0100 | [diff] [blame] | 31 | The Controller is the only entity which performs authenticated access to Gerrit. |
Primiano Tucci | 94f90f0 | 2019-07-09 19:47:12 +0100 | [diff] [blame] | 32 | It uses a non-privileged gmail account and has no meaningful voting power. |
| 33 | |
| 34 | The controller loop does mainly the following: |
| 35 | |
| 36 | - It periodically (every 5s) polls Gerrit for CLs updated in the last 24h. |
| 37 | - It checks the list of CLs against the list of already known CLs in the DB. |
| 38 | - For each new CL it enqueues `N` new jobs in the database, one for each |
| 39 | configuration defined in [config.py](/infra/ci/config.py) (e.g. `linux-debug`, |
| 40 | `android-release`, ...). |
| 41 | - It monitors the state of jobs. When all jobs for a CL have been completed, |
| 42 | it posts a comment and adds the vote if the CL is marked as `Presubmit-Ready`. |
| 43 | - It does some other less-relevant bookkeeping. |
| 44 | - AppEngine is highly reliable and self-healing. If a task fails (e.g. because |
| 45 | of a Gerrit 500) it will be automatically re-tried with exponential backoff. |
| 46 | |
| 47 | ## Frontend |
| 48 | |
| 49 | The frontend is an AppEngine service that hosts the CI website @ |
| 50 | [ci.perfetto.dev](https://ci.perfetto.dev). |
| 51 | Conversely to the Controller, it is exposed to the public via HTTP. |
| 52 | |
| 53 | - It's an almost fully static website based on HTML and Javascript. |
| 54 | - The only backend-side code ([frontend.py](/infra/ci/frontend/frontend.py)) |
| 55 | is used to proxy XHR GET requests to Gerrit, due to the lack of Gerrit |
| 56 | CORS headers. |
| 57 | - Such XHR requests are GET-only and anonymous. |
| 58 | - The frontend python code also serves as a memcache layer for Gerrit requests |
| 59 | that return immutable data (e.g. revision logs) to reduce the likeliness of |
| 60 | hitting Gerrit errors / timeouts. |
| 61 | |
| 62 | ## Worker GCE VM |
| 63 | |
| 64 | The actual testing job happens inside these Google Compute Engine VMs. |
| 65 | The GCE instance is running a CrOS-based |
| 66 | [Container-Optimized](https://cloud.google.com/container-optimized-os/docs/) OS. |
| 67 | |
| 68 | The whole system image is read-only. The VM itself is stateless. No state is |
| 69 | persisted outside of the DB and Google Cloud Storage (only for UI artifacts). |
| 70 | The SSD is used only as a scratch disk and is cleared on each reboot. |
| 71 | |
| 72 | VMs are dynamically spawned using the Google Cloud Autoscaler and use a |
| 73 | Stackdriver Custom Metric pushed by the Controller as cost function. |
| 74 | Such metric is the number of queued + running jobs. |
| 75 | |
| 76 | Each VM runs two types of Docker containers: _worker_ and the _sandbox_. |
| 77 | They are in a 1:1 relationship, each worker controls at most one sandbox |
| 78 | associated. Workers are always alive (they work in polling-mode), while |
| 79 | sandboxes are started and stopped by the worker on-demand. |
| 80 | |
| 81 | On each GCE instance there are M (currently 10) worker containers running and |
| 82 | hence up to M sandboxes. |
| 83 | |
| 84 | ### Worker containers |
| 85 | |
| 86 | Worker containers are trusted entities. They can impersonate the GCE service |
| 87 | account and have R/W access to the DB. They can also spawn sandbox containers. |
| 88 | |
| 89 | Their behavior depends only on code that is manually deployed and doesn't depend |
| 90 | on the checkout under test. The reason why workers are Docker containers is NOT |
| 91 | security but only reproducibility and maintenance. |
| 92 | |
| 93 | Each worker does the following: |
| 94 | |
| 95 | - Poll for an available job from the `/jobs_queued` sub-tree of the DB. |
| 96 | - Move such job into `/jobs_running`. |
| 97 | - Start the sandbox container, passing down the job config and the git revision |
| 98 | via env vars. |
| 99 | - Stream the sandbox stdout to the `/logs` sub-tree of the DB. |
| 100 | - Terminate the sandbox container prematurely in case of timeouts or job |
| 101 | cancellations requested by the Controller. |
| 102 | - Upload UI artifacts to GCS. |
| 103 | - Update the DB to reflect completion of jobs, removing the entry from |
| 104 | `/jobs_running` and updating the `/jobs/$jobId/status` fields. |
| 105 | |
| 106 | ### Sandbox containers |
| 107 | |
| 108 | Sandbox containers are untrusted entities. They can access the internet |
| 109 | (for git pull / install-build-deps) but they cannot impersonate the GCE service |
| 110 | account, cannot write into the DB, cannot write into GCS buckets. |
| 111 | Docker here is used both as an isolation boundary and for reproducibility / |
| 112 | debugging. |
| 113 | |
| 114 | Each sandbox does the following: |
| 115 | |
| 116 | - Checkout the code at the revision specified in the job config. |
| 117 | - Run one of the [test/ci/](/test/ci/) scripts which will build and run tests. |
| 118 | - Return either a success (0) or fail (!= 0) exit code. |
| 119 | |
| 120 | A sandbox container is almost completely stateless with the only exception of |
| 121 | the semi-ephemeral `/ci/cache` mount-point. This mount-point is tmpfs-based |
| 122 | (hence cleared on reboot) but is shared across all sandboxes. It's used only to |
| 123 | maintain the shared ccache. |
| 124 | |
| 125 | # Data model |
| 126 | |
| 127 | The whole CI is based on |
| 128 | [Firebase Realtime DB](https://firebase.google.com/docs/database). |
| 129 | It is a high-scale JSON object accessible via a simple REST API. |
| 130 | Clients can GET/PUT/PATCH/DELETE individual sub-nodes without having a local |
| 131 | full-copy of the DB. |
| 132 | |
| 133 | ```bash |
| 134 | /ci |
| 135 | # For post-submit jobs. |
| 136 | /branches |
| 137 | /master-20190626000853 |
| 138 | # ┃ ┗━ Committer-date of the HEAD of the branch. |
| 139 | # ┗━ Branch name |
| 140 | { |
| 141 | author: "primiano@google.com" |
| 142 | rev: "0552edf491886d2bb6265326a28fef0f73025b6b" |
| 143 | subject: "Cloud-based CI" |
| 144 | time_committed: "2019-07-06T02:35:14Z" |
| 145 | jobs: |
| 146 | { |
| 147 | 20190708153242--branches-master-20190626000853--android-...: 0 |
| 148 | 20190708153242--branches-master-20190626000853--linux-...: 0 |
| 149 | ... |
| 150 | } |
| 151 | } |
| 152 | /master-20190701235742 {...} |
| 153 | |
| 154 | # For pre-submit jobs. |
| 155 | /cls |
| 156 | /1000515-65 |
| 157 | { |
| 158 | change_id: "platform%2F...~I575be190" |
| 159 | time_queued: "2019-07-08T15:32:42Z" |
| 160 | time_ended: "2019-07-08T15:33:25Z" |
| 161 | revision_id: "18c2e4d0a96..." |
| 162 | wants_vote: true |
| 163 | voted: true |
| 164 | jobs: { |
| 165 | 20190708153242--cls-1000515-65--android-clang: 0 |
| 166 | ... |
| 167 | 20190708153242--cls-1000515-65--ui-clang: 0 |
| 168 | } |
| 169 | } |
| 170 | /1000515-66 {...} |
| 171 | ... |
| 172 | /1011130-3 {...} |
| 173 | |
| 174 | /cls_pending |
| 175 | # Effectively this is an array of pending CLs that we might need to |
| 176 | # vote on at the end. Only the keys matter, the values have no |
| 177 | # semantic and are always 0. |
| 178 | /1000515-65: 0 |
| 179 | |
| 180 | /jobs |
| 181 | /20190708153242--cls-1000515-65--android-clang-arm-debug: |
| 182 | # ┃ ┃ ┗━ Job type. |
| 183 | # ┃ ┗━ Path of the CL or branch object. |
| 184 | # ┗━ Datetime when the job was created. |
| 185 | { |
| 186 | src: "cls/1000515-66" |
| 187 | status: "QUEUED" |
| 188 | "STARTED" |
| 189 | "COMPLETED" |
| 190 | "FAILED" |
| 191 | "TIMED_OUT" |
| 192 | "CANCELLED" |
Primiano Tucci | 1448e00 | 2019-07-21 12:56:29 +0100 | [diff] [blame] | 193 | "INTERRUPTED" |
Primiano Tucci | 94f90f0 | 2019-07-09 19:47:12 +0100 | [diff] [blame] | 194 | time_ended: "2019-07-07T12:47:22Z" |
| 195 | time_queued: "2019-07-07T12:34:22Z" |
| 196 | time_started: "2019-07-07T12:34:25Z" |
| 197 | type: "android-clang-arm-debug" |
| 198 | worker: "zqz2-worker-2" |
| 199 | } |
| 200 | /20190707123422--cls-1000515-66--android-clang-arm-rel {..} |
| 201 | |
| 202 | /jobs_queued |
| 203 | # Effectively this is an array. Only the keys matter, the values |
| 204 | # have no semantic and are always 0. |
| 205 | /20190708153242--cls-1000515-65--android-clang-arm-debug: 0 |
| 206 | |
| 207 | /jobs_running |
| 208 | # Effectively this is an array. Only the keys matter, the values |
| 209 | # have no semantic and are always 0. |
| 210 | /20190707123422--cls-1000515-66--android-clang-arm-rel |
| 211 | |
| 212 | /logs |
| 213 | /20190707123422--cls-1000515-66--android-clang-arm-rel |
| 214 | /00a053-0000: "+ chmod 777 /ci/cache /ci/artifacts" |
| 215 | # ┃ ┗━ Monotonic counter to establish total order on log lines |
| 216 | # ┃ retrieved within the same read() batch. |
| 217 | # ┃ |
| 218 | # ┗━ Hex-encoded timestamp, relative since start of test. |
| 219 | /00a053-0001: "+ chown perfetto.perfetto /ci/ramdisk" |
| 220 | ... |
| 221 | |
| 222 | ``` |
| 223 | |
| 224 | # Sequence Diagram |
| 225 | |
| 226 | This is what happens, in order, on a worker instance from boot to the test run. |
| 227 | |
| 228 | ```bash |
| 229 | make -C /infra/ci worker-start |
| 230 | ┗━ gcloud start ... |
| 231 | |
| 232 | [GCE] # From /infra/ci/worker/gce-startup-script.sh |
| 233 | docker run worker-1 ... |
| 234 | ... |
| 235 | docker run worker-N ... |
| 236 | |
| 237 | [worker-X] # From /infra/ci/worker/Dockerfile |
| 238 | ┗━ /infra/ci/worker/worker.py |
| 239 | ┗━ docker run sandbox-X ... |
| 240 | |
| 241 | [sandbox-X] # From /infra/ci/sandbox/Dockerfile |
| 242 | ┗━ /infra/ci/sandbox/init.sh |
| 243 | ┗━ /infra/ci/sandbox/testrunner.sh |
| 244 | ┣━ git fetch refs/changes/... |
| 245 | ┇ ... |
| 246 | ┇ # This env var is passed by the test definition |
| 247 | ┇ # specified in /infra/ci/config.py . |
| 248 | ┗━ $PERFETTO_TEST_SCRIPT |
| 249 | ┣━ # Which is one of these: |
| 250 | ┣━ /test/ci/android_tests.sh |
| 251 | ┣━ /test/ci/fuzzer_tests.sh |
| 252 | ┣━ /test/ci/linux_tests.sh |
| 253 | ┗━ /test/ci/ui_tests.sh |
| 254 | ┣━ ninja ... |
| 255 | ┗━ out/dist/{unit,integration,...}test |
| 256 | ``` |
| 257 | |
| 258 | ### [gce-startup-script.sh](/infra/ci/worker/gce-startup-script.sh) |
| 259 | |
| 260 | - Is ran once per GVE vm, at (re)boot. |
| 261 | - It prepares the tmpfs mountpoint for the shared ccache. |
| 262 | - It wipes the SSD scratch disk for the build artifacts |
| 263 | - It pulls the latest {worker, sandbox} container images from |
| 264 | the Google Cloud Container registry. |
| 265 | - Sets up Docker and `iptables` (for the sandboxed network). |
| 266 | - Starts `N` worker containers in Docker. |
| 267 | |
| 268 | ### [worker.py](/infra/ci/worker/worker.py) |
| 269 | |
| 270 | - It polls the DB to retrieve a job. |
| 271 | - When a job is retrieved starts a sandbox container. |
| 272 | - It streams the container stdout/stderr to the DB. |
| 273 | - It upload the build artifacts to GCS. |
| 274 | |
| 275 | ### [testrunner.sh](/infra/ci/sandbox/testrunner.sh) |
| 276 | |
| 277 | - It is pinned in the container image. Does NOT depend on the particular |
| 278 | revision being tested. |
| 279 | - Checks out the repo at the revision specified (by the Controller) in the |
| 280 | job config pulled from the DB. |
| 281 | - Sets up ccache |
| 282 | - Deals with caching of buildtools/. |
| 283 | - Runs the test script specified in the job config from the checkout. |
| 284 | |
| 285 | ### [{android,fuzzer,linux,ui}_tests.sh](/test/ci/linux_tests.sh) |
| 286 | |
| 287 | - Are NOT pinned in the container and are ran from the checked out revision. |
| 288 | - Finally build and run the test. |
| 289 | |
| 290 | ## Playbook |
| 291 | |
| 292 | ### Frontend (JS/HTML/CSS) changes |
| 293 | |
| 294 | Test-locally: `make -C infra/ci/frontend test` |
| 295 | |
| 296 | Deploy with `make -C infra/ci/frontend deploy` |
| 297 | |
| 298 | ### Controller changes |
| 299 | |
| 300 | Deploy with `make -C infra/ci/controller deploy` |
| 301 | |
| 302 | It is possible to try locally via the `make -C infra/ci/controller test` |
| 303 | but this involves: |
| 304 | |
| 305 | - Manually stopping the production AppEngine instance via the Cloud Console |
| 306 | (stopping via the `gcloud` cli doesn't seem to work, b/136828660) |
| 307 | - Downloading the testing service credentials `test-credentials.json` |
| 308 | (they are in the internal Team drive). |
| 309 | |
| 310 | ### Worker/Sandbox changes |
| 311 | |
| 312 | 1. Build and push the new docker containers with: |
| 313 | |
| 314 | `make -C infra/ci build push` |
| 315 | |
| 316 | 2. Restart the GCE instances, either manually or via |
| 317 | |
| 318 | `make -C infra/ci restart-workers` |
| 319 | |
| 320 | |
| 321 | ## Security considerations |
| 322 | |
| 323 | - Both the Firebase DB and the gs://perfetto-artifacts GCS bucket are |
| 324 | world-readable and writable by the GAE and GCE service accounts. |
| 325 | |
| 326 | - The GAE service account also has the ability to log into Gerrit using a |
| 327 | dedicated gmail.com account. The GCE service account doesn't. |
| 328 | |
| 329 | - Overall, no account in this project has any interesting privilege: |
| 330 | - The Gerrit account used for commenting on CLs is just a random gmail account |
| 331 | and has no special voting power. |
| 332 | - The service accounts of GAE and GCE don't have any special capabilities |
| 333 | outside of the CI project itself. |
| 334 | |
| 335 | - This CI deals only with functional and performance testing and doesn't deal |
| 336 | with any sort of continuous deployment. |
| 337 | |
| 338 | - Presubmit jobs are only triggered if at least one of the following is true: |
| 339 | - The owner of the CL is a @google.com account. |
| 340 | - The user that applied the Presubmit-Ready label is a @google.com account. |
| 341 | |
| 342 | - Sandboxes are not too hard to escape (Docker is the only boundary) and can |
| 343 | pollute each other via the shared ccache. |
| 344 | |
| 345 | - As such neither pre-submit nor post-submit build artifacts are considered |
| 346 | trusted. They are only used for establishing functional correctness and |
| 347 | performance regression testing. |
| 348 | |
| 349 | - Binaries built by the CI are not ran on any other machines outside of the |
| 350 | CI project. They are deliberately not downloadable. |
| 351 | |
| 352 | - The only build artifacts that are retained (for up to 30 days) and uploaded to |
| 353 | the GCS bucket are the UI artifacts. This is for the only sake of getting |
| 354 | visual previews of the HTML changes. |
| 355 | |
| 356 | - UI artifacts are served from a different origin (the GCS per-bucket API) than |
| 357 | the production UI. |