Indexing multi-lingual content with Elasticsearch

Hi there,

I am currently working on a fork of collective.elasticindex for making various improvement in the Elasticsearch integration with Plone. One aspect is better support for multi-lingual content. Elasticsearch supports language specific analyzers. For this reason our implementation will use different mapping types in Elasticsearch - one for each language used in a Plone site with the related language specific analyzer in Elasticsearch. Unfortunately
the 'Language' metadata in Plone content is often not very well maintained (in particular in multi-lingual sites). The value of the Language metadata determines the analyzer to be used in Elasticsearch. Can anyone sched some light on how and Linguaplone deal with the Language metadata in case of translations. Can we assume that the Language metadata of the translations of a particular document are correctly set or is there some magic under hood fetching the language code e.g. from Acquisition from the root language folder? The goal is obvious: having consisting Language metadata for all content in order to apply proper indexing and proper query behavior in Elasticsearch.


AFAIK p.a.multilingual sets and uses the language metadata field (for DX content it's in "language" attribute, but DX base class defines DC compatible "Language()" getter for it.