AWSJune 9, 20262 sources

Apache Spark 4.0 reaches general availability on Amazon EMR

AI Analysis

AWS made Apache Spark 4.0 generally available across its full EMR lineup — EMR Serverless, EMR on EC2, and EMR on EKS. The headline capabilities include Spark Connect (decoupled client-server architecture enabling interactive PySpark development from anywhere), the new Variant data type for semi-structured data, SQL scripting, Python API improvements, and streaming enhancements.

Complementing the GA, Amazon SageMaker Unified Studio Notebooks now support EMR Serverless with Spark Connect, giving data engineers and analysts flexibility to pick the optimal Spark runtime for interactive analytics and data engineering. AWS also showcased an agentic migration path: the AWS Spark Upgrade Agent iteratively validates apps moving from Spark 3.5 to 4.0 on EMR Serverless and auto-diagnoses failures from CloudWatch logs until jobs succeed.

While less glamorous than frontier-model launches, the Spark 4.0 GA matters because data pipelines are the substrate of enterprise AI — feature engineering, training-data prep and analytics all run on Spark at scale. The agent-driven upgrade tooling is itself a sign of the agentic theme bleeding into infrastructure ops, where AI agents now handle migration drudgery. It fits a broad week of AWS agentic releases (Strands Agents + Bedrock AgentCore for insurance FNOL intake, an incident-triage agent with Amazon Quick and New Relic, Isaac Lab robot RL on SageMaker, and medical-record digitization with Bedrock Data Automation and HealthLake) — collectively underscoring AWS's strategy of embedding agents across the data and ops stack rather than competing head-on at the frontier-model layer.

Sources

aws.amazon.com

https://aws.amazon.com/blogs/big-data/announcing-general-availability-of-apache-spark-4-0-on-amazon-emr/

aws.amazon.com

https://aws.amazon.com/blogs/big-data/upgrade-pyspark-from-spark-3-5-to-spark-4-0-with-aws-spark-upgrade-agent/