Does anyone have a process that can generate a small but representative subset of the data in the Data.fs and blobs?
We want to be able to run tests and develop on representative data, without having to pull the entire dataset into development/testing
We prefer to use data that is representative of the production data. In fact we often just sync the Data.fs and blobs from production to development. On larger projects this becomes impractical as the size of the data grows larger than the size our development instances can handle. A system that generates a representative subset would result in a smaller, more manageable footprint.
I'm also open to a totally different approach to solving the general issue.
You can use the statistics (Zope -> objects in cache and used, web analytics) to have an idea on what could be "representative of the production data".
Why testing instances cannot handle the production database?
The short answer is a matter of the size of our testing instances, they are pretty small, 5 to 10 GB. This is fine for modest projects and wonderful for our preferred cloud-centric, CI workflow.
Until a project database and accompanying blobs start to get bigger than 10 GB. Then the 'representative subset' approach starts to become very, very, very appealing. In fact, even if we were to test on the full database at critical points, there are many scenarios where the 'representative subset' is not only adequate for testing and development but far more efficient.
Even if you disregard our cloud based 5GB testing instances, think of the inconvenience of having to pull down 800GB to a local development machine.
(Maybe these are small project, modest budget, problems ).