The Apache Parquet project is making significant strides in decoupling its core library from Apache Hadoop dependencies, a move that promises to dramatically reduce the footprint and complexity for developers. Recent progress, highlighted by a tweet from "dex" stating, "> removing hadoop dependencies from parquet (as a bit) https://t.co/ScD7eDQVP4", underscores a long-standing community effort to make Parquet more lightweight and versatile. This ongoing work has already enabled developers to achieve over an 85% reduction in shaded JAR sizes.
Historically, the official Parquet Java library implementation has been tightly coupled with Hadoop, leading to substantial transitive dependencies. This integration, while useful within the Hadoop ecosystem, created challenges for developers aiming to use Parquet in smaller systems or environments without Hadoop. The bulky dependencies often resulted in large application binaries and increased complexity in dependency management.
The community has actively sought solutions to this problem, with discussions and issues dating back years, such as PARQUET-1822 and PARQUET-1775, advocating for a cleaner separation. Technical workarounds involve switching to non-Hadoop Parquet interfaces, removing Hadoop imports from custom code, and meticulously managing transitive dependencies. This "dependency surgery" allows applications to interact with Parquet files using standard Java NIO interfaces for I/O, even for cloud storage like S3, without the overhead of Hadoop's filesystem abstractions.
The benefits of this decoupling are substantial, primarily manifesting in significantly smaller library sizes and reduced classpath sprawl. For instance, some production systems have seen their shaded JARs shrink from approximately 657MB to around 96MB by cutting out most of the Hadoop transitive dependencies. This reduction makes Parquet more appealing for microservices, serverless functions, and other lightweight applications where a minimal footprint is critical.
While the tweet from "dex" indicates an incremental step, it reflects the continuous progress towards a more independent Parquet. The community anticipates that future major releases, potentially Parquet 2.0, will officially remove this explicit Hadoop coupling, further simplifying its adoption across diverse data processing landscapes. Parquet, an open-source columnar storage format, is already widely used beyond Hadoop, especially with cloud storage systems and data lakehouse frameworks like Apache Iceberg, Delta Lake, and Apache Hudi, making its independence even more crucial for modern data architectures.