Postgres Data Export to S3

The cluster focuses on tools, scripts, and strategies for exporting or syncing data from Postgres (and similar databases) to S3 in Parquet or Iceberg formats for querying with Athena, DuckDB, BigQuery, or ClickHouse, often for analytics workloads.

📉 Falling 0.4x Databases

2,336

Comments

Years Active

Top Authors

#8144

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

2015

106

2016

102

2017

127

2018

143

2019

162

2020

230

2021

224

2022

214

2023

262

2024

323

2025

199

2026

Top Contributors

zX41ZdbW (37) saisrirampur (16) chatmasta (15) simonw (14) georgewfraser (11)

Keywords

RAM e.g S3 OSS COPY AWS CSV BigQuery NFS VALUES parquet clickhouse postgres duckdb s3 sql csv data query etl

Sample Comments

exAspArk • Nov 7, 2024 • View on HN

Thank you!Yes, absolutely!1) You could use BemiDB to sync your Postgres data (e.g., partition time-series tables) to S3 in Iceberg format. Iceberg is essentially a "table" abstraction on top of columnar Parquet data files with a schema, history, etc.2) If you don't need strong consistency and fine with delayed data (the main trade-off), you can use just BemiDB to query and visualize all data directly from S3. From a query perspective, it's like DuckDB that talks Post

mind-blight • Sep 26, 2024 • View on HN

I've been using duckdb to import data into postgres (especially CSVs and JSON) and it has been really effective.Duckdb can run SQL across the different data formats and insert or update directly into postgres. I run duckdb with python and Prefect for batch jobs, but you can use whatever language or scheduler you perfer.I can't recommend this setup enough. The only weird things I've run into is a really complex join across multiple postgres tables and parquet files had a bug

oulipo • Oct 17, 2024 • View on HN

Cool, would this be better than using a clickhouse / duckdb extension that reads postgres and saves to Parquet?What would be recommended to output regularly old data to S3 as parquet file? To use a cron job which launches a second Postgres process connecting to the database and extracting the data, or using the regular database instance? doesn't that slow down the instance too much?

pid-1 • Jul 6, 2023 • View on HN

Why not use parquet files + AWS Athena?

gizmodo59 • Sep 7, 2022 • View on HN

Yeah.. It would be much easier to copy the data to S3/any object storage (better to convert it into a columnar format like parquet) and query it directly using a SQL on lake engine like Dremio or Athena or S3Select would work too.

eatonphil • Jun 25, 2021 • View on HN

Why won't you just copy your parquet data into Postgres?

jhgg • Jan 3, 2015 • View on HN

You should be able to achieve that with this tool paired with postgres foreign data wrappers!

scottpersinger • Mar 18, 2023 • View on HN

Built this recently to help a friend setup a Snowflake warehouse from their Postgres database. Also tested it with ClickHouse which is cool for running locally. Uses simple COPY which is fast and doesn't require binlog access, but doesn't support real-time replication as a result.

rockostrich • May 5, 2022 • View on HN

Cloud SQL has BigQuery connections that can be leveraged. But yea, this seems like a nice solution if you have a postgres instance outside of Cloud SQL. Another approach would be to write the CDC to a message queue and archive that to parquet.

bduerst • Apr 10, 2014 • View on HN

I know it's a pain to get data into it, but how about Big Query?