Custom Dexterity Field Data Type

nazrulworld · August 30, 2020, 8:50am

Hi experts,
Right now I am working on a custom field like Plone JSONField and I need to think about space (size of data) in ZODB.
My question here which python data type would be a space saver (data size) and faster in pickling, is that plain string or dict in ZODB?

Update 1

Thank you so much, guys, for your valuable opinions. Actually I have tested this version https://pypi.org/project/plone.app.fhirfield/5.0.0b1/ JSON string as field value but found not so visible differences between field value as ObjectStorage(Persistent) (https://pypi.org/project/plone.app.fhirfield/4.0.0/) in terms of ZODB size which was my primary concern. In fact, value as object is faster than value as JSON string. Metrics are like:

Total JSON Files: 19500
ZODB Size after creating objects using RESTAPI: 22.3+ GB [JSON string as field value]
ZODB Size after zeopack : 236 MB (JSON str as field value)

ZODB Size after creating objects using RESTAPI: 22.5+ GB [Persistent object as field value]
ZODB Size after zeopack : 288 MB (Persistent object as field value)

zopyx · August 30, 2020, 1:12pm

I would store JSON as what it is: text (so as string). JSON handling in Python is usually "good enough" or fast enough. In doubt, use a custom JSON module like orjson. Depending on the nature of your JSON, you may think about compressing the JSON string...depends on your usage of this functionality.

espenmn · August 30, 2020, 7:47pm

Not sure if I understand completely what 'where you need to save space', but for database size (not speed) you could consider saving it as a 'blob' (?) https://docs.plone.org/external/plone.app.dexterity/docs/advanced/files-and-images.html

nazrulworld · August 30, 2020, 8:04pm

thank you so much, I am agree with you and will store as json text (json.dumps version).

nazrulworld · August 30, 2020, 8:15pm

@espenmn I am sorry for any misunderstanding, it's because may be for my bad phase construction.

Here actually I meant, size of same json data in dict (json.loads) and string (json.dumps) representation.
For example, I have a json file named ´patient.json´.

with open("patient.json", "r") as fp:
     dict_data = json.load(fp)
     str_data  =  json.dumps(dict_data)

So which one will be a smaller size in zodb between dict_data and str_data and when pickle (deserialize) object from zodb, which one will be faster?

jaroel · August 30, 2020, 10:14pm

I think you'll have to experiment and see what would work for you.

A small experiment I did with a small FHIR json dump (2.4K file) ended up as 2514 bytes pickled for text, and 1547 bytes pickled for a dict. This might be because of skipping whitespace in the dict variant.

I used ZODBbrowser to look at the raw pickle data and I used the following code:

import transaction

from persistent import Persistent
class Kek(Persistent):
    data = None

# using bin/instance debug
app.kek_text = Kek()
app.kek_text.data = open('problem.json').read()
app.kek_dict = Kek()
app.kek_dict.data = json.load(open('problem.json'))

transaction.commit()

I added the class Kek to bin/zodbbrowser so I don't have broken objects.

In my/this care storing a dict was smaller in storage. I expect loading a dict will be faster too, as you don't need to parse json on load anymore.

As for memory usage, I guess you want a dict to do stuff with anyhow, so storing the json would waste (in this case) 2.4K of memory per stored object.

Note: you could store gzipped json, which would bring down the storage to 881B v 2400B for text.

pigeonflight · August 30, 2020, 11:50pm

Nice... sounds like FHIR stuff on top of Plone
How does this relate to the stuff you were doing on top of Guillotina?
(just a guess, I could be completely wrong here)

zopyx · August 31, 2020, 4:30am

The core questions are:

how many of these JSON files do you have?
what is there average size?
do you need to process them in Plone?
do you need to display/edit them in Plone?

General rule of thumb: store JSON as-in...nowadays I could not care about some MB more because of some JSON files. If your requirements regarding the processing of JSON becomes more complicated (e.g. when you need to query the JSON file or if you have lots of JSON files), consider using an external document database like Arango for storing and searching arbitrary JSON documents...as said: all depends on numbers, usecases and requirements.

nazrulworld · August 31, 2020, 7:30am

it would be millions (in the running system has more than a million objects)
In json file in filesystem size avg: 35KB I could see right now (with indent 4, could be less than 35KB)
Yes plone is doing update/patch, delete, add through restapi
Not always need show in browser but instead of serving through restapi.
So that's my answer to the questions.

Let me explain some background.

we have Index - FHIR v5.0.0 FHIR based REST API server top of Plone and elasticsearch server.
For that, we are using plone.app.fhirfield · PyPI.
If you look at here plone.app.fhirfield/src/plone/app/fhirfield/value.py at 2.x.x · nazrulworld/plone.app.fhirfield · GitHub object of FhirResourceValue has been used as field value.
And finally we see a huge problem in size in zodb (not sure how to explain; one dexterity object creation, cause many contents in the various zodb buckets) for example, current situation.
Total Objects: 10,60,552 + (more than 50k plone users)
Data.fs Size: up to 58 GB (was previously more than 80GB but shrank to maybe 7/8 GB after zeopack)

Elasticsearch
Number of docs: 4082198|
Deleted docs: 1049272|
Size: 3,640MB

I am really inspired by ES about size, that's why planning to refactor plone.app.fhirfield and directly store json string to zodb instead of a complex field value.

I really like this idea! can you please share some ideas about how can I use it by keeping plone security features like workflow, permissions, local roles intact.

@zopyx sorry for the long discussion hope, you don’t mind

nazrulworld · August 31, 2020, 7:38am

@pigeonflight you got it right

I think fhirpath-guillotina · PyPI is going well, nothing to worried about situation like this as Guillotina is handling JSON perfectly!

You see, here is problem size in zodb and storing FhirResourceValue object as field value directly into zodb (not sure this cause flat zodb)

zopyx · August 31, 2020, 7:56am

Question yourself if Plone is the right choice for storing millions of JSON files. My answer would be: NO

nazrulworld · August 31, 2020, 8:12am

@zopyx agree with you

Problem is like these

we are using plone more than 10 years and almost all of our projects based on plone.
You know our team is actually gear-up with Plone.
We don't find yet any alternative to Plone security features like workflow, local_roles, etc. (honestly speaking because of this we have to strict on Plone)
Trying to improve Plone's capabilities in the healthcare system.

We are also looking at Guillotina as an alternative.

zopyx · August 31, 2020, 8:34am

Let's assume you can gzip a 35 KB JSON down to 10 KB then the resulting data will be 10 GB for 1 million JSON files. You can add this to your already existing 60 GB Data.fs or create 1 million files as blobs...I would go for Data.fs storage in this case...moving a 70 GB Data.fs file around is more straight forward than dealing with one million single JSON files as blob.

Side note: I am currently looking into other options for handling larger amounts of data beside the Data.fs.

My blueprint for most of our upcoming usecasesgoes like this:

create a data model in Python based on the "pydantic": https://pydantic-docs.helpmanual.io/usage/schema/
a pydantic schema can be converted into a JSON schema
store all data in ArangoDB (multi-model database) which support verification of incoming data based on JSON schema
create edit forms based on the JSON schema. There are some options for Vue, React and Angular (I think there was something in the making for Guillotina)
Integration with Plone: using some tiny Dexterity wrapper holding a reference to the related data ID in ArangoDB
So most of the "real data" would be stored in a scalable database (omitting a long and serious ZODB rant here) and the ZODB would only be used for holding the references...Dexterity here and there...nice approach but with many issues and too closely bound to z3c.form and ZODB in order to build half-way decent large and scalable solutions with Plone.

nazrulworld · August 31, 2020, 8:55am

You will be pleased to see that latest plone.app.fhirfield is actually using pydantic via https://pypi.org/project/fhir.resources/6.0.0b3/!
And my question was about string_json and dict_json data because pydantic!
You know that pydantic.BaseModel.json() api returns string of json data, so if I going to store dict json data into zodb then I have to again json.loads to make dict from string.