Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(android): AOSP observability blog post #534

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Blisse
Copy link
Contributor

@Blisse Blisse commented Nov 25, 2024

No description provided.

Copy link

cloudflare-workers-and-pages bot commented Nov 25, 2024

Deploying interrupt with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3837eac
Status: ✅  Deploy successful!
Preview URL: https://8d209cc3.interrupt.pages.dev
Branch Preview URL: https://victorlai-interrupt.interrupt.pages.dev

View logs

@Blisse Blisse force-pushed the victorlai/interrupt branch from 4827e2f to 3837eac Compare November 26, 2024 00:57
@Blisse Blisse marked this pull request as ready for review November 26, 2024 18:16
@Blisse Blisse requested a review from a team as a code owner November 26, 2024 18:16

There are definitely many good reasons to build on top of AOSP. Android is really the only option for custom, touch-first experiences, and the popularity and price point of phone, tablet, and watch form factors is really attractive for a lot of different product experiences. Android SoC vendors bundle support for a lot of common capabilities by default: Bluetooth, LTE, Wi-Fi, cameras, batteries, sensors, and more. And Android also has an incredibly strong community and ecosystem, the continued support of Google, and millions of apps and developers.

But people often don’t distinguish between Android developers and AOSP developers enough. While Android *app* development is incredibly popular especially as a hobby, AOSP development is only really done for work due to the custom SoC required (outside of the LineageOS or GrapheneOS folks). More importantly, the actual work done by Android versus AOSP developers is often totally different. Android *app* developers are generally only involved with product-centric app development (which in itself is a huge problem space), whereas AOSP developers may develop some apps, but also work much closer to the kernel, by writing drivers, implementing vendor or hardware interfaces, creating and updating sepolicy, and more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/for work/professionally


Android developers are very familiar with Android **app** observability tools - everyone by default integrates Crashlytics, Bugsnag, Sentry, Instabug, or another observability SDK. When I worked on an AOSP product, we actually built 10 or so apps to replicate some device functionality, so we had to integrate the observability SDK into all 10 apps. But what about the other 10 to 100 apps and native services that we don't normally touch that were part of the generic AOSP implementation?

When you build on top of AOSP, you’re actually potentially running 4 different sources of "apps": your own custom-built apps, your SoC vendor’s apps, the generic AOSP apps, and if you have an app store, third-party apps installed from the app store. You can only install an app observability SDK in your own custom-built apps, you can’t install SDKs into all these other apps, so you immediately hit the limit of what information app observability SDKs can provide.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you build a device on top of AOSP


![Apps you don't build yourself can't integrate app observability SDKs](/img/aosp-observability/app-sources.svg)

There are also 4 kinds of “apps”: (1) the common Android app written in Java or Kotlin, (2) native apps using C++ using the Android NDK, (3) binaries written in C++ or Rust, and (4) init.rc shell services which can invoke the binaries and more. App observability SDKs often come with NDK support, but they run into the same problem of only working on apps you build yourself, and there are 10 to 100 other apps running on the device that aren’t monitored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4 types seem a bit muddled (what's the difference between native apps using C++ vs binaries written in C++?). Mention the kernel? Or maybe divide between "apps" whether native or not, vs system services, vs kernel?


As time went on though, we started running into more and more situations where the cached logs didn’t actually span the time of the incident, so they were basically useless. This is the natural result of a reactive pull-based model, and so we got to brainstorming what other solutions were possible.

We eventually landed on the desire for a push-based model - literally a push messaging system, where we could poke the device remotely to trigger an upload. And we designed an app that could query FileProviders and ContentProviders for any additional data. But we hit a couple of roadblocks: our device was not GMS-compliant so we didn’t have Firebase Cloud Messaging and would have to integrate our own push messaging system, and we needed to enlist the help of a backend team to build the push backend and the push trigger, but also a storage solution, and no team had the budget to help and our teams didn’t have the expertise to contribute. This ends up being a common sticking point with internal tools for mobile teams - Android and AOSP developers have a lot of mobile experience and expertise, but often need help creating a maintainable backend solution that makes sense for the volume and cost of storing and processing the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define GMS


### Android Bug Reports

Logs are one tool to diagnose an issue, but Android has so much more information if you can access it - namely, Android Bug Reports, which gathers data from all over the device into a single zip file. So any AOSP observability solution that can capture and upload logcat logs automatically, may also want to build a way to trigger and upload Android Bug Reports when needed, too. And so Bug Reports are often the second thing that developers build when they just need more AOSP observability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section makes it sounds like bugreports are the answer (not sure how deep you want to go into the alternatives, but maybe mention e.g. that they contain a complete dumpsys dump, which could also be captured individually)


## Understand the trends

As we encountered more and more bugs, and downloaded more and more logs, several colleagues decided to write a number of different parsers in Python that they would run on the logs to rule out certain problems, or filter the logs for certain crashes. As they encountered more and more issues, if they could identify the issue in the log, they could write a parser for that issue and re-use it for next time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add an example here of what we'd be looking for in the logs


Most teams that develop Android hardware already buy Android app observability tools, like Crashlytics, plus a general purpose product analytics tool, like Grafana, for their backend, so it seems natural to try to fit AOSP data into these tools. But these tools are not designed for AOSP, they’re designed for apps or for servers, and so their data models or pricing structures can feel like they don’t make sense.

- Existing Android app observability tools have a battle-tested data model for Java and Kotlin app crashes and C++ tombstones. But there’s really no way to record any other forms of data, so you have to coerce other kinds of crashes, like kernel panics, into the data model, which can be finicky and tedious to manually manage. -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not really tombstones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feels like we should mention somewhere here: that part of the solution is collecting the right information on the device (e.g. dropbox + dumpsys vs bugreport) - signal/noise.


There's a ton of content on what observability means for Android apps, but content and tooling for AOSP device observability feels sparse in comparison. Most AOSP developers end up building internal observability tooling because of the lack of standard tooling. And most AOSP developers end on a similar journey: they discover the limits of the their apps observability tool, and they figure they want to pull logcat logs, then they find a way to push the logs from the device, and then they think of ways to pull the logs, and then discover even more Android data they want to pull, and then they finally realize they don't want to pull and parse the data individually and manually every time there's a problem.

One of the reasons I joined Memfault was to build the tooling I wish I had previously, and to help other Android developers working on AOSP avoid going through the same years of struggles, to learn the same lessons I had. And there is still many more lessons in AOSP observability to learn, if you have experience in observability at scale, I'd love to hear from you.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the narrative - I think it would be great with some example problems that you were trying to diagnose at scale, how logs might have helped you find them on an individual device, but not at the fleet level, etc - then whether either FWLS/L2M or other on-device stuff e.g. dumpsys feeding into a metrics-like system would have helped you.

If this blog is targeted at "monitoring" (vs "debugging") then I think that's a good start (just needs a bit of editing). But I'm wondering if we should also discuss how to make sense of the data for an individual device (i.e. timeline + device dashboard replaces that python tooling, to visualize)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants