Parquet, Delta, Iceberg & Ducklake - An introduction for developers

About This Session

CSVs are inefficient. Everyone knows that. And yet they are probably the most widely used file formats, also for data scientists. At the same time, data engineers talk about Parquet, Iceberg, and Ducklake—and roll their eyes when someone actually still uses CSV or JSON. As a software engineer, there's often nothing left to do but close your eyes and go for it. Even if you don't really understand it. You read CSV, you write Delta or Iceberg. The main thing is that the data guys don't complain. But what are the differences? Why do we store data in files in the first place? What is all this metadata? And what do I really need and what not? And why can't just look onto storage but also have to take the compute into account. It's high time to dive in.