State of Performance Test Engineering (H2/2019)

It’s a new year, and time for me to post an update on the state of Firefox performance test engineering. The last update was in July 2019 and covered the first half of the year. This update covers the second half of 2019.

Team

The team consists of 9 engineers based in California, Toronto, Montreal, and Romania.

Tests

We currently support 4 frameworks:

Are We Slim Yet (AWSY) - memory consumption by the browser
build_metrics - build times, installer size, and other compiler-specific insights
Raptor - page load and browser benchmarks
Talos - various time-based performance KPIs on the browser

At the time of writing, there are 511 test suites (query) for the above frameworks. This is up from 263 in the H1/2019 report.

The following are some highlights of tests introduced in H2/2019:

Dashboards

In addition to updating health.graphics and related dashboards based on test changes, we have also contributed summary dashboards for media playback performance and power usage.

We have also contributed several improvements to the Perfherder tool used by the performance sheriffs for monitoring and investigating regression and improvement alerts. Some example are listed below:

Ability for sheriffs to assign alerts
Allow users to retrigger from the compare view
New view showing active performance tests
Highlight prioritised alerts for investigation
Show measurement unit in graphs

Hardware

Our tests are running on the following hardware in continuous integration:

38 Moto G5 devices (with two of these dedicated to power testing)
23 Pixel 2 devices (with two of these dedicated to power testing)
16 Acer Aspire 15 laptops (2017 reference hardware)
35 Lenovo Yoga C630 laptops (Windows ARM64)
2 Apple MacBook Pro laptops (used for power testing)

See this wiki page for more details on the hardware used.

Sheriffs

We have 3 performance sheriffs 🤠 dedicating up to 50% of their time to this role. Outside of this, they assist with improving our tools and test harnesses.

Q3/2019

During the third quarter of 2019 our perfomance tools generated 1110 alert summaries. This is an average of 12 alert summaries every day.

Of the four test frameworks, build_metrics caused the most alert summaries, accounting for 42% of the total. The raptor framework was the second biggest contributor, with 30%.

Alert Summaries by Framework (Q3/2019)

Of the alert summaries generated, 13% were improvements. Of the alert summaries showing regressions, 51% were determined to be invalid, 5% were backed out, 23% were accepted, and 18% were fixed.

Regressions by Status (Q3/2019)

Our performance sheriffs triaged 63% of alerts within a day, and an additional 22% within 3 days.

Triage Response (Q3/2019)

Bugs were filed within 3 days for 53% of confirmed regressions, with a further 16% within 5 days.

Regression Bug Response (Q3/2019)

Here are some highlights for some of our sheriffed frameworks:

Are We Slim Yet (AWSY)

On July 22nd we detected up to 18.9% improvement in AWSY for macOS. This was caused by Paul Bone’s patch on bug 1567366, which switched to using MADV_FREE_REUSABLE.
The largest fixed regression detected from AWSY was noticed on August 2nd, and had an impact of up to 21.57% regression to images. This was attributed to bug 1570745 and was fixed by switching AWSY to the common scenario for new tab page rather than the new user experience.

Raptor

The improvement alert with the highest magnitude for Raptor was created on July 16th, which saw up to a massive 46.06% improvement to page load time on desktop. The gains were attributed to bug 1541229, which tweaked idle detection during page load.
The largest regression alert that was ultimately fixed for Raptor was generated on August 13th. This showed up to 21.86% regression to the cold page load tests on Android, and was caused by bug 1557282. This was also noticed in Telemetry and was fixed via bug 1575794.

Talos

For Talos, the alert showing the largest improvement was created on August 13th, which showed up to 13.09% improvement to ts_paint. It was attributed to bug 1572646, which optimised picture cache tiles that are solid colours.
The largest fixed regression alert for Talos was spotted on August 8th, and includes a 4.24% hit to ts_paint on Windows. It was caused by bug 1539651, and was fixed by Brian Grinstead in bug 1573158.

Q4/2019

In the final quarter, our perfomance tools generated 996 alert summaries. This is an average of 11 alert summaries every day.

Of the four test frameworks, build_metrics caused the most alert summaries, accounting for 49% of the total. The raptor framework was the second biggest contributor, with 25%.

Alert Summaries by Framework (Q4/2019)

Of the alert summaries generated, 14% were improvements. Of the alert summaries showing regressions, 50% were determined to be invalid, 8% were backed out, 17% were accepted, and 10% were fixed.

Regressions by Status (Q4/2019)

Our performance sheriffs triaged 61% of alerts within a day, and an additional 26% within 3 days.

Triage Response (Q4/2019)

Bugs were filed within 3 days for 67% of confirmed regressions, with a further 11% within 5 days.

Regression Bug Response (Q4/2019)

Here are some highlights for some of our sheriffed frameworks:

Are We Slim Yet (AWSY)

The largest improvement noticed for AWSY was on October 11th, where we saw up to a 3.88% decrease in Base Content JS. This was attributed to the work by André Bargull bug 1570370 to move language tag parsing and Intl.Locale to C++.
On September 30th, a 6.15% regression was noticed in JS, caused by bug 1345830. This was fixed by Gijs in bug 1586220.
The open regression with the highest impact for AWSY was detected on June 12th. Whilst there were a large number of improvements in this alert, the 12.67% regression to images was also significant. The regression was caused by bug 1558763, which changed the value of a preference within Marionette. It looks like this may have been fixed, but our performance sheriffs have yet to verify this.

Raptor

The most significant improvement detected by Raptor was on December 2nd. Up to 35.59% boost to many page load and benchmark tests on macOS. This was due to Nathan Froyd’s patch in bug 1599133 to enable constructing Sequence from moved nsTArrays.
On November 4th, a regression alert of 9.53% was reported against the Wikipedia page load test for bing.com on desktop. It turned out that bug 1591717 caused the unexpected regression. It was fixed by a patch by Emilio Cobos Álvarez to turn layout.css.notify-of-unvisited off for now.
Due to many page recordings being recreated, there are a lot of open alerts that require closer examination.

Talos

On December 18th we detected an improvement of up to 32.21% to several tests on macOS. This was thanks to Chris Manchester’s work on enabling PGO (Profile-Guided Optimisations) for the platform in bug 1604578.
Talos detected a 35.21% regression to perf_reftest_singletons on November 14th, which was caused by bug 1588431 and fixed by Emilio Cobos Álvarez in bug 1596712.
The open regression alert with the highest magnitude was opened on November 26th, and reports up to 196.47% regression to tp5o. It was caused by bug 1512011, which replaced mozhttpd with wptserve in Talos.