Postgres Data Export to S3
The cluster focuses on tools, scripts, and strategies for exporting or syncing data from Postgres (and similar databases) to S3 in Parquet or Iceberg formats for querying with Athena, DuckDB, BigQuery, or ClickHouse, often for analytics workloads.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Thank you!Yes, absolutely!1) You could use BemiDB to sync your Postgres data (e.g., partition time-series tables) to S3 in Iceberg format. Iceberg is essentially a "table" abstraction on top of columnar Parquet data files with a schema, history, etc.2) If you don't need strong consistency and fine with delayed data (the main trade-off), you can use just BemiDB to query and visualize all data directly from S3. From a query perspective, it's like DuckDB that talks Post
I've been using duckdb to import data into postgres (especially CSVs and JSON) and it has been really effective.Duckdb can run SQL across the different data formats and insert or update directly into postgres. I run duckdb with python and Prefect for batch jobs, but you can use whatever language or scheduler you perfer.I can't recommend this setup enough. The only weird things I've run into is a really complex join across multiple postgres tables and parquet files had a bug
Cool, would this be better than using a clickhouse / duckdb extension that reads postgres and saves to Parquet?What would be recommended to output regularly old data to S3 as parquet file? To use a cron job which launches a second Postgres process connecting to the database and extracting the data, or using the regular database instance? doesn't that slow down the instance too much?
Why not use parquet files + AWS Athena?
Yeah.. It would be much easier to copy the data to S3/any object storage (better to convert it into a columnar format like parquet) and query it directly using a SQL on lake engine like Dremio or Athena or S3Select would work too.
Why won't you just copy your parquet data into Postgres?
You should be able to achieve that with this tool paired with postgres foreign data wrappers!
Built this recently to help a friend setup a Snowflake warehouse from their Postgres database. Also tested it with ClickHouse which is cool for running locally. Uses simple COPY which is fast and doesn't require binlog access, but doesn't support real-time replication as a result.
Cloud SQL has BigQuery connections that can be leveraged. But yea, this seems like a nice solution if you have a postgres instance outside of Cloud SQL. Another approach would be to write the CDC to a message queue and archive that to parquet.
I know it's a pain to get data into it, but how about Big Query?