Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News

por Teller's Tech - DevOps, SRE and Cloud Podcast
Ship It Conversations: Guardsquare’s Joel DeStefano on Mobile App Security, Runtime Protection, App Hardening, and Why Scanning Isn’t Enough
This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps. In this Ship It: Conversations episode, I talk with Joel DeStefano from Guardsquare about mobile app security, why it is different from backend and cloud security, and why scanning alone is not enough once an app is shipped into the real world. We talk about the shift in trust model that happens with mobile apps. In backend and cloud systems, teams usually have more control over the runtime, infrastructure, policies, and monitoring. With mobile, the app becomes a public artifact running on someone else’s device, in an environment you do not fully control. The bigger theme here is that mobile security is not just “scan it before release.” Scanning matters, but teams also need to think about app hardening, obfuscation, runtime protection, monitoring, and whether the app connecting back to their APIs is genuine and uncompromised. Highlights • Why mobile changes the trust model compared to backend and cloud systems • What DevOps, SRE, and platform teams should understand about mobile app risk • Why scanning is useful, but not enough by itself • The danger of assuming app store approval means an app is secure • Why “we do not store sensitive data in the app” can be a misleading security argument • How attackers can reverse engineer apps, inspect workflows, and learn how the app talks to backend APIs • What code hardening and obfuscation actually help protect against • Why runtime checks matter for rooted devices, compromised environments, debuggers, hooking frameworks, overlays, and accessibility abuse • The difference between Android and iOS security assumptions • Why the OS is not responsible for protecting your app’s business logic • How mobile security should fit into CI/CD without destroying release velocity • What should block a release versus what should become tracked risk • Why testing, hardening, runtime protection, and monitoring should work together as one strategy • How AI may speed up attackers without fundamentally changing the need for strong security fundamentals • Joel’s advice for improving mobile security posture: start with the app’s critical workflows, backend interactions, and real business risk Joel / Guardsquare links • Guardsquare: https://hubs.ly/Q04fJgkJ0 • Guardsquare Blog: https://www.guardsquare.com/blog OWASP mobile security links • OWASP Mobile Application Security: https://owasp.org/www-project-mobile-app-security/ • OWASP MASVS: https://mas.owasp.org/MASVS/ Our links More episodes + show notes + links: https://shipitweekly.fm On Call Brief: https://oncallbrief.com
PeopleSoft Zero-Day Exploited, npm v12 Install Script Changes, GitHub Agentic Tokens, Anthropic Model Risk, and Default Trust Breaking
This episode of Ship It Weekly is about default trust getting punished. Brian covers Oracle’s emergency PeopleSoft advisory for CVE-2026-35273, npm v12 changing install-script defaults, GitHub Agentic Workflows moving away from long-lived personal access tokens, and Anthropic disabling Fable 5 and Mythos 5 after a U.S. export-control directive. The common thread: legacy ERP systems, package installs, CI/CD agents, and AI models all become production risks when teams trust the default without checking what that trust can actually do. In the lightning round, Brian covers Tekton CloudEvents moving to a dedicated events controller, NVIDIA Triton Inference Server 26.04 changing inference defaults, AWS Nitro Isolation Engine bringing formal verification to Graviton5-based isolation, and Homebrew 6.0 adding explicit trust for third-party taps. The bigger theme: production does not care why you trusted the default. It only cares what that default was allowed to do. The bigger theme: production does not care why you trusted the default. It only cares what that default was allowed to do. Links Oracle PeopleSoft CVE-2026-35273 advisory https://www.oracle.com/security-alerts/alert-cve-2026-35273.html npm v12 breaking changes https://github.blog/changelog/2026-06-09-upcoming-breaking-changes-for-npm-v12/ GitHub Agentic Workflows no longer need PATs https://github.blog/changelog/2026-06-11-agentic-workflows-no-longer-need-a-personal-access-token/ Anthropic Fable 5 / Mythos 5 access statement https://www.anthropic.com/news/fable-mythos-access Tekton Pipelines releases https://github.com/tektoncd/pipeline/releases NVIDIA Triton Inference Server 26.04 release notes https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-26-04.html AWS Nitro Isolation Engine https://aws.amazon.com/blogs/compute/aws-nitro-isolation-engine-formally-verifying-the-hypervisor-in-the-aws-nitro-system/ Homebrew 6.0.0 https://brew.sh/2026/06/11/homebrew-6.0.0/ This week’s On Call Brief https://www.tellerstech.com/on-call-brief-news/2026-W25/ More episodes and show notes https://shipitweekly.fm/
Ship It Conversations: Meta’s Francois Richard on AI Incident Response, SLOs, and Reliability at Scale
This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps. In this Ship It: Conversations episode, I talk with Francois Richard, Engineering Director at Meta, about reliability at scale, how AI is changing production risk, what teams actually learn from incidents, and why recovery practice matters just as much as prevention. We talk about the proactive and reactive sides of reliability, why SLOs should represent a promise to users instead of just another dashboard number, how incident reviews should drive real system improvements, and how teams can practice recovery before production forces the lesson on them. The bigger theme here is that reliability is not just about avoiding failure. It is about knowing what happens when prevention fails. That means practicing regional failure, understanding overload behavior, improving incident response, using AI carefully during investigation, and making reliability targets match the actual lifecycle and importance of the system. Highlights • Why reliability work starts with both prevention and recovery • The difference between reactive incident response and proactive reliability engineering • How Meta thinks about disaster recovery testing and regional failure practice • Why an SLO should be treated like a promise to users, not just a dashboard metric • How SLO trends help teams decide when to invest more in reliability or take more product risk • What engineers actually learn during the “pressure cooker” of an incident • Why incident reviews should produce follow-up work, not just a nicer explanation of what broke • The difference between finding the cause of an incident and improving the system • Where AI agents can help with incident investigation, telemetry, metrics, and query building • Why AI-generated code can increase change volume while reducing human context • How faster code generation changes the kinds of reliability problems teams should expect • Why recovery practice matters, especially for region loss, traffic spikes, overload, and restart behavior • What smaller DevOps and SRE teams can learn from Meta-scale reliability patterns • Why not every system needs six nines, especially early in a product lifecycle • How to think about reliability investment based on user promise, product maturity, and operational risk • Why At Scale Systems & Reliability is focused on the infrastructure behind AI and the use of AI to operate large-scale systems Francois’ links • LinkedIn: https://www.linkedin.com/in/francoisrichard/ At Scale links • Systems & Reliability 2026: https://bit.ly/4xd2FdG • At Scale Conferences: https://atscaleconference.com/ Our links More episodes + show notes + links: https://shipitweekly.fm On Call Brief: https://oncallbrief.com
Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production
This episode of Ship It Weekly is about the hidden glue holding production together. Brian covers Coinbase’s May 7 outage postmortem, where an AWS us-east-1 cooling failure exposed the difference between being “multi-AZ” on paper and actually being able to recover when stateful, low-latency systems are tied to a failed zone. Then he looks at Meta’s AI-assisted Instagram support issue and why account recovery is identity infrastructure, not just customer support. If AI can influence password resets, email changes, MFA resets, or account ownership flows, that workflow needs to be treated like a production control plane. The episode also covers AWS AgentCore CLI CVE-2026-11393, where collaborator metadata could break out into generated Python code during agent import, and an Apigee cross-tenant issue from Google’s Apigee security bulletins that shows why tenant isolation has to be tested beyond the obvious happy path. Links Coinbase May 7 outage postmortem https://www.coinbase.com/blog/a-postmortem-of-our-may-7-2026-outage Meta AI support / Instagram account recovery reporting https://www.theverge.com/tech/945658/meta-ai-support-chatbot-exploit-instagram-accounts AWS AgentCore CLI CVE-2026-11393 https://aws.amazon.com/security/security-bulletins/2026-040-aws/ AgentCore CLI GitHub advisory https://github.com/aws/agentcore-cli/security/advisories/GHSA-m4x6-gwgp-4pm7 Google Apigee security bulletins https://docs.cloud.google.com/apigee/docs/security-bulletins/security-bulletins Cloudflare real-time threat intel WAF rules https://blog.cloudflare.com/realtime-threat-intel-waf-rules/ AWS Lambda tenant isolation with event source mappings https://aws.amazon.com/blogs/compute/integrating-event-source-mappings-with-aws-lambda-tenant-isolation-mode/ Amazon OpenSearch Serverless next generation https://aws.amazon.com/about-aws/whats-new/2026/05/amazon-opensearch-serverless-next-generation-generally-available/ GitHub Enterprise Managed Users IP allow list coverage https://github.blog/changelog/2026-06-08-ip-allow-list-coverage-for-emu-namespaces-in-general-availability/ This week’s On Call Brief https://www.tellerstech.com/on-call-brief-news/2026-W24/ More episodes and show notes https://shipitweekly.fm/
Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries
This episode of Ship It Weekly is about automation’s hidden boundaries. Brian covers Kiro CLI CVE-2026-9255, where piped stdin could act like user approval, Amazon Braket SDK CVE-2026-9291 and the very normal Python pickle risk hiding inside quantum job results, AWS Organizations finally emitting CloudTrail events when accounts join or leave an org, and KEDA updates that remind us autoscaling upgrades are production behavior changes. The bigger thread this week is that automation does not remove boundaries. It moves them. Approval paths, trusted data, account membership, scaling signals, platform access, and AI-generated output all need clear ownership and visibility. Brian also covers Kubernetes Dashboard being archived with Headlamp as the path forward, Google Cloud Remote MCP Server for AlloyDB, Apache Kafka 4.3.0, and Atlassian’s AI-native SDLC productivity claims. Sponsored by @Scale: Systems & Reliability, happening June 25 at the Meydenbauer Center in Bellevue, Washington. Register at https://bit.ly/4xd2FdG Links Kiro CLI CVE-2026-9255 https://aws.amazon.com/security/security-bulletins/2026-035-aws/ Amazon Braket SDK CVE-2026-9291 https://aws.amazon.com/security/security-bulletins/2026-036-aws/ AWS Organizations CloudTrail account events https://aws.amazon.com/about-aws/whats-new/2026/05/aws-organizations-cloudtrail/ KEDA v2.20.0 release https://github.com/kedacore/keda/releases/tag/v2.20.0 KEDA v2.19.0 release https://github.com/kedacore/keda/releases/tag/v2.19.0 Kubernetes Dashboard archived / Headlamp path forward https://kubernetes.io/blog/2026/06/04/dashboard-archived-what-now/ Google Cloud Remote MCP Server for AlloyDB https://cloud.google.com/blog/products/databases/alloydb-remote-mcp-server-now-ga Apache Kafka 4.3.0 https://www.confluent.io/blog/apache-kafka-4-3-release-announcement/ Atlassian AI-native SDLC productivity claims https://www.atlassian.com/blog/software-teams/ai-native-sdlc This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W23/ More episodes and show notes https://shipitweekly.fm/
GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk
This episode of Ship It Weekly is about trusted tools becoming production dependencies. Brian covers a rough GitHub supply chain week, including the compromised Nx Console VS Code extension tied to exposed GitHub internal repositories and the Megalodon campaign abusing GitHub Actions workflows across thousands of public repos. The bigger thread this week is that the tools around production are increasingly part of production. Brian also covers Railway’s GCP account suspension outage, Discord’s voice outage during a Kubernetes migration, AWS changing SDK retry behavior, CVE-2026-9133 in the RabbitMQ AWS plugin, and a Reddit story about stolen AWS keys turning into a $14,000 Bedrock bill. Brian also touches on OpenTelemetry graduating from the CNCF, Claude Code security risk, GitLab Secrets Manager, Google Cloud AI spend caps, and a Redshift Python driver RCE. Full source list and extra links are available on this episode’s page at shipitweekly.fm. Links Nx Console compromise https://www.stepsecurity.io/blog/nx-console-vs-code-extension-compromised Megalodon GitHub Actions attack https://www.stepsecurity.io/blog/megalodon-mass-github-actions-secret-exfiltration-across-5-500-public-repositories Railway GCP outage https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage Discord voice outage https://discord.com/blog/behind-the-scenes-of-the-3-25-26-voice-outage AWS SDK retry changes https://aws.amazon.com/blogs/developer/announcing-updated-retry-behavior-for-aws-sdks-and-tools/ RabbitMQ AWS plugin CVE-2026-9133 https://aws.amazon.com/security/security-bulletins/2026-034-aws/ AWS Bedrock cost spike Reddit thread https://www.reddit.com/r/aws/comments/1tm3ydo/aws_bedrock_cost_spike_14000_usd/ This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W22/ More episodes and show notes https://shipitweekly.fm/
Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again
This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps. In this Ship It: Conversations episode, I talk with Jake Warner, founder and CEO of Cycle.io, about private cloud, bare metal, Kubernetes fatigue, and why some teams are rethinking how much infrastructure complexity they actually want to carry. We talk about why bare metal and private cloud are getting interesting again, especially around cost, performance, data sovereignty, compliance, and platform ownership. Jake explains how Cycle approaches infrastructure as a pool of resources, why he thinks in terms of “environments as code” instead of traditional infrastructure as code, and how teams can run containers and VMs together across bare metal, cloud, and hybrid environments. The bigger theme here is that this is not really a “cloud versus bare metal” conversation. It is about choosing the right level of abstraction. Sometimes Kubernetes is the right answer. Sometimes managed cloud services make sense. And sometimes teams just need a more opinionated platform that lets developers ship without requiring a large DevOps army to keep everything running. Highlights • Why some teams are moving back toward private cloud and bare metal • The role of cost, data sovereignty, compliance, and performance in infrastructure decisions • Why bare metal does not have to mean going back to old-school racking and stacking pain • How Cycle turns raw compute into a private cloud-style resource pool • Why Jake thinks about “environments as code” instead of only infrastructure as code • What “no DevOps army required” means in practice for engineering-heavy teams • Why some companies need VMs and containers running together on the same platform • Where Kubernetes still makes sense, especially for highly customized infrastructure needs • Why opinionated platforms can be valuable when teams want fewer knobs and better defaults • Active-active thinking, failover risk, and why application-level replication often matters more than platform-level storage magic • Why bandwidth, performance density, and predictable pricing can make bare metal attractive again • The weird continued gravity of AWS us-east-1, even for teams trying to move workloads elsewhere • How AI workloads, GPUs, and hype cycles fit into the private cloud and platform conversation • Jake’s advice for modernizing hybrid or on-prem infrastructure: containerize first, then look hard at your dependencies Jake’s links • Cycle.io: https://cycle.io/ • Cycle Slack community: https://slack.cycle.io/ • Jake Warner on LinkedIn: https://www.linkedin.com/in/jakewarner/ Our links More episodes + show notes + links: https://shipitweekly.fm On Call Brief: https://oncallbrief.com
CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk
This episode of Ship It Weekly is about secrets, agents, risky defaults, and follow-up work that never gets done. Brian covers the CISA contractor GitHub leak involving AWS keys, internal docs, Terraform, Kubernetes, Argo CD, and CI/CD context, plus AWS DevOps Agent doing automated RCA across Datadog, Elasticsearch, CloudTrail, and EKS. Brian also covers MS Copilot Studio computer-using agents, Claude Code in Bitbucket Agentic Pipelines, CVE-2026-46333 and Kubernetes seccomp defaults, GitHub OIDC for Dependabot, Java pods getting OOMKilled, LLM-generated SQL that can be wrong but still run, and why postmortem action items die without ownership. Sponsored by Guardsquare https://hubs.ly/Q04fJgkJ0 Links CISA GitHub leak https://blog.gitguardian.com/how-we-got-a-cisa-github-leak-taken-down-in-26-hours/ AWS DevOps Agent RCA https://aws.amazon.com/blogs/devops/automate-root-cause-analysis-across-datadog-and-elasticsearch-with-aws-devops-agent/ Microsoft Copilot Studio computer-using agents https://techcommunity.microsoft.com/blog/copilot-studio-blog/computer-using-agents-in-microsoft-copilot-studio-are-now-generally-available/4519427 Atlassian Agentic Pipelines with Claude Code https://support.atlassian.com/bitbucket-cloud/docs/agentic-pipelines/ CVE-2026-46333 https://nvd.nist.gov/vuln/detail/CVE-2026-46333 Kubernetes seccomp https://kubernetes.io/docs/reference/node/seccomp/ GitHub OIDC for Dependabot and code scanning https://github.blog/changelog/2026-05-19-expanded-oidc-support-for-dependabot-and-code-scanning/ Java pods OOMKilled in Kubernetes https://dzone.com/articles/java-pod-oomkill-kubernetes LLM-generated SQL risks https://readyset.io/blog/why-llms-write-incorrect-sql-and-what-that-means-for-your-database Postmortem action items https://incident.io/blog/why-do-post-mortem-action-items-fail-how-to-make-incident-follow-ups-actually-get-done On Call Brief https://www.tellerstech.com/on-call-brief/2026-W21/ More episodes + show notes https://shipitweekly.fm/
AI Agents Get API Access and Identity: GitHub Copilot Cloud Agents, MCP Auth, Ansible Automation, OpenAI Daybreak, and the New Production Risk
This episode of Ship It Weekly is about AI agents moving from helpful coding assistants into real operational actors. Brian covers GitHub making Copilot cloud agent tasks available through a REST API, Auth0 bringing authentication and authorization to MCP servers, Red Hat positioning Ansible as a trusted execution layer for agentic IT operations, and OpenAI Daybreak pushing AI deeper into security research and remediation. The bigger thread this week is authority: what these agents can reach, what they can change, who approved the action, and who owns the outcome when something breaks. Brian also covers Discord’s ScyllaDB automation work, AWS GuardDuty crypto mining detection, queues and back pressure, and a Datadog PostgreSQL case where an index scan was still painfully slow. Sponsored by Guardsquare https://hubs.ly/Q04fJgkJ0 Links GitHub Copilot cloud agent tasks via REST API https://github.blog/changelog/2026-05-13-start-copilot-cloud-agent-tasks-via-the-rest-api/ GitHub REST API endpoints for agent tasks https://docs.github.com/en/rest/agent-tasks/agent-tasks Auth0 Auth for MCP is now generally available https://auth0.com/blog/auth0-auth-for-mcp-servers-generally-available/ Red Hat on Ansible as the execution layer for agentic IT https://www.redhat.com/en/about/press-releases/red-hat-establishes-ansible-automation-platform-trusted-execution-layer-it-operations-agentic-era OpenAI Daybreak https://openai.com/daybreak/ Discord automates ScyllaDB clusters at scale https://discord.com/blog/how-discord-automates-scylladb-clusters-at-scale AWS GuardDuty crypto mining detection and prevention https://aws.amazon.com/blogs/security/detecting-and-preventing-crypto-mining-in-your-aws-environment/ Queues do not absorb load, they delay failure https://dzone.com/articles/queues-dont-absorb-load-they-delay-bankruptcy Datadog on inefficient PostgreSQL index scans https://www.datadoghq.com/blog/detect-inefficient-index-scans-with-dbm/ This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W20/ More episodes and show notes https://shipitweekly.fm/
Cursor Deletes PocketOS Prod DB, .de DNSSEC Outage, Bluesky Postmortem, Argo CD, and Copy Fail
This episode of Ship It Weekly is about modern reliability getting squeezed from both directions. Old-school failures still hit hard, like broken DNSSEC, kernel privilege escalation bugs, and GitOps behavior changes. But newer automation layers add a second kind of risk, where AI agents, machine identity, and cloud control planes can do real damage fast when authority is too broad. Brian covers the Cursor and PocketOS production database wipe, the .de DNSSEC outage and Cloudflare’s response, Bluesky’s April outage postmortem, Argo CD v3.1.16 reaching end of life plus the v3.4.1 behavior change, Linux kernel CVE-2026-31431 under active exploitation, and why Google Cloud Agent Identity and AWS MCP Server GA both point to agents becoming first-class infrastructure actors. Sponsored by Guardsquare https://hubs.ly/Q04fJgkJ0 Links Cursor / PocketOS production database wipe https://www.tellerstech.com/on-call-brief/2026-W19/ Cloudflare on the .de DNSSEC outage https://blog.cloudflare.com/de-tld-outage-dnssec/ Bluesky April 2026 outage postmortem https://pckt.blog/b/jcalabro/april-2026-outage-post-mortem-219ebg2 Argo CD releases: v3.1.16 final release and v3.4.1 behavior change https://github.com/argoproj/argo-cd/releases Linux kernel CVE-2026-31431 https://nvd.nist.gov/vuln/detail/CVE-2026-31431 AWS bulletin for CVE-2026-31431 https://aws.amazon.com/security/security-bulletins/rss/2026-026-aws/ Google Cloud Agent Identity https://cloud.google.com/blog/products/identity-security/whats-new-in-iam-security-governance-and-runtime-defense AWS MCP Server is now generally available https://aws.amazon.com/blogs/aws/the-aws-mcp-server-is-now-generally-available/ Cross-region disaster recovery for Amazon EKS using AWS Backup https://aws.amazon.com/blogs/containers/cross-region-disaster-recovery-for-amazon-eks-using-aws-backup/ Google Ads new data retention policy starting June 1, 2026 https://ads-developers.googleblog.com/2026/05/new-data-retention-policy-for-google.html This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W19/ More episodes and show notes https://shipitweekly.fm/
1 de 5