Download all of my files from Plone

angelawong · August 4, 2023, 7:34pm

I am trying to download all my files from Plone as we are planning on migrating the documents to another CMS. I have set up my WebDav and could connect to the Plone site but I cannot find the documents (PDF, DOC, XLS, PPT, etc.) anywhere. Am I looking the wrong place or using an incorrect tool to download my documents? I can see the different folders that we have created for our different divisions but cannot find any of the documents.

Please advise. Any help would be most appreciated.

Thanks so much,
Angela

dieter · August 4, 2023, 9:12pm

Angela Wong via Plone Community wrote at 2023-8-4 19:44 +0000:

I am trying to download all my files from Plone as we are planning on migrating the documents to another CMS. I have set up my WebDav and could connect to the Plone site but I cannot find the documents (PDF, DOC, XLS, PPT, etc.) anywhere. Am I looking the wrong place or using an incorrect tool to download my documents? I can see the different folders that we have created for our different divisions but cannot find any of the documents.

You are looking at the right places: if a folder
(called "Collection" by WebDAV)
contains a content object,
then the content object should show up below the folder in the WebDAV view
as well.

However, WebDAV discovers only content objects marked as
WebDAV resources (via the attribute __dav_resource__).
Maybe, this property is missing for the objects you do not see.
Furthermore, folders can override the method listDAVObjects, e.g.
to exclude some of their objects from the WebDAV view.

The content hierarchy is discovered via the WebDAV PropFind command,
implemented in webdav.davcmds.PropFind.apply.
You might want to debug this function to learn why you do not see
the expected result.

zopyx · August 5, 2023, 5:35am

We have well-working solutions for exports:

collective.exportimport
plone.restapi

angelawong · August 7, 2023, 8:27pm

Andreas, thank you.

I tried to install collectiveexportimport but I keep getting an error. I am using Plone 5.1

eggs +=
   collective.exportimport
   plone.restapi

zcml =
  collective.exportimport

Error message:
Installing client1.
Couldn't find index page for 'collective.exportimport' (maybe misspelled?)
Getting distribution for 'collective.exportimport'.
Couldn't find index page for 'collective.exportimport' (maybe misspelled?)
While:
  Installing client1.
  Getting distribution for 'collective.exportimport'.
Error: Couldn't find a distribution for 'collective.exportimport'.

Please advise. Thanks so much.

zopyx · August 8, 2023, 4:14am

Works for me

espenmn · August 8, 2023, 11:40am

Maybe you could just download the documents with 'wget' ?
Maybe make a collection showing all you files and download from there.

wget can also download all links in a sitemap (to download just pdfs, the syntax is something like

wget -A pdf

( google something like: download all files linked from a single page using wget AND wget parse sitemap.xml )

yurj · August 8, 2023, 12:43pm

Hi!

for older versions of Plone, you can use this scripts (you need External Methods)

extract_folder.py

import os
import base64

try:
    import json
except ImportError:
    # Python 2.54 / Plone 3.3 use simplejson
    # version 2.3.3
    import simplejson as json

from Products.Five.browser import BrowserView
from Products.CMFCore.interfaces import IFolderish
from DateTime import DateTime

#: Private attributes we add to the export list
EXPORT_ATTRIBUTES = ["portal_type", "id"]

#: Do we dump out binary data... default we do, but can be controlled with env var
# EXPORT_BINARY = os.getenv("EXPORT_BINARY", None)
EXPORT_BINARY = "true"
if EXPORT_BINARY:
    EXPORT_BINARY = EXPORT_BINARY == "true"
else:
    EXPORT_BINARY = True


class ExportFolderAsJSON(BrowserView):
    """
    Exports the current context folder Archetypes as JSON.

    Returns downloadable JSON from the data.
    """

    def convert(self, value, obj):
        """
        Convert value to more JSON friendly format.
        """
        if isinstance(value, DateTime):
            # Zope DateTime
            # https://pypi.python.org/pypi/DateTime/3.0.2
            return value.ISO8601()
        elif hasattr(value, "isBinary") and (value.isBinary('file') or value.isBinary('image')):

            if not EXPORT_BINARY:
                return None

            # Archetypes FileField and ImageField payloads
            # are binary as OFS.Image.File object
            #data = getattr(value.data, "data", None)
            #if not data:
            #    return None
            #return base64.b64encode(data)
            #"file": {
            #"content-type": "application/pdf",
            #"size": 1499339
            #},
            file = {}
            file['download'] = obj.absolute_url() + '/at_download/file'
            file['content-type'] = value.content_type
            file['filename'] = value.filename
            file['size'] = value.get_size()
            file["encoding"] = "base64"
            data = getattr(value, "data", None)
            file['data'] = base64.b64encode(data)
            return file
        else:
            # Passthrough
            return value

    def grabArchetypesData(self, obj):
        """
        Export Archetypes schemad data as dictionary object.

        Binary fields are encoded as BASE64.
        """
        data = {}
        for field in obj.Schema().fields():
            name = field.getName()
            value = field.getRaw(obj)
            print "%s" % (value.__class__)

            data[name] = self.convert(value, obj)
        data['UID'] = obj.UID()
        data['@id'] = obj.absolute_url()
        data['@type'] = obj.Type()
        item = data
        item = migrate_field(item, "excludeFromNav", "exclude_from_nav")
        item = migrate_field(item, "allowDiscussion", "allow_discussion")
        item = migrate_field(item, "subject", "subjects")

        # Some Date fields
        item = migrate_field(item, "expirationDate", "expires")
        item = migrate_field(item, "effectiveDate", "effective")
        item = migrate_field(item, "creation_date", "created")
        item = migrate_field(item, "modification_date", "modified")

        # Event fields
        item = migrate_field(item, "startDate", "start")
        item = migrate_field(item, "endDate", "end")
        item = migrate_field(item, "openEnd", "open_end")
        item = migrate_field(item, "wholeDay", "whole_day")
        item = migrate_field(item, "contactEmail", "contact_email")
        item = migrate_field(item, "contactName", "contact_name")
        item = migrate_field(item, "contactPhone", "contact_phone")
        item = migrate_field(item, "geolocation", "geolocation")
        item["review_state"] = "published"
        item["version"] = "current"
        item["versioning_enabled"] = True,
        item["working_copy"] = None
        item["working_copy_of"] = None
        if "text" in item:
            item["text"] = {
            "content-type": "text/html",
            "data": item["text"],
            "encoding": "utf-8"
            }
        item["language"] = {
            "encoding": "utf-8",
            "title":  "Italiano",
            "token": "it"
        }
        if 'versioning_enabled' in item:
            item['versioning_enabled'] = item['versioning_enabled'][0]
        if item['@type'] == 'Page':
           item['@type'] = 'Document'
        if 'homepage_url' in item:
           if item['homepage_url'].strip() == '':
               item['homepage_url'] = None
        if 'email' in item:
           if item['email'].strip() == '':
               item['email'] = None
        return data

    def grabAttributes(self, obj):
        data = {}
        for key in EXPORT_ATTRIBUTES:
            data[key] = self.convert(getattr(obj, key, None), obj)
        return data

    def export(self, folder, recursive=False, rs=''):
        """
        Export content items.

        Possible to do recursively nesting into the children.

        :return: list of dictionaries
        """

        array = []
        for obj in folder.listFolderContents(contentFilter={'review_state': rs}):
            data = self.grabArchetypesData(obj)
            data.update(self.grabAttributes(obj))

            if recursive:
                if IFolderish.providedBy(obj):
                    data["children"] = self.export(obj, True)

            array.append(data)

        return array

    def __call__(self):
        """
        """
        folder = self.context.aq_inner
        data = self.export(folder, rs='published')
        pretty = json.dumps(data, sort_keys=True, indent=4)
        self.request.response.setHeader("Content-type", "application/json")
        return pretty

def migrate_field(item, old, new):
    if old in item:
        item[new] = item[old]
        if old == 'geolocation':
            item['geolocation'] = {"latitude": item['geolocation'][0], "longitude": item['geolocation'][1]}
    return item

def extract(self, path, REQUEST):
    folder = self.unrestrictedTraverse(path)
    view = ExportFolderAsJSON(folder, None)
    data = view.export(folder, recursive=False, rs='published')

    pretty = json.dumps(data, sort_keys=True, indent=4)
    REQUEST.RESPONSE.setHeader("Content-type", "application/json")
    return pretty

extract_users.py:

# Memberdata export script for Plone 3.0
#
# based on:
#    http://transcyberia.info/archives/23-howto-sync-mailman-from-plone.html
#    http://www.zopelabs.com/cookbook/1140753093
#    http://plone.org/documentation/how-to/export-member-data-to-csv
#
# desc:
#    None of the scripts above can extract password hashes on Plone3.0, 
#    BUT THIS ONE CAN!!!
#
#    Execute this as normal External Script, and DON'T make it public accessible 
#    (unless you don't mind people having your hashes). You have been warned. 
#    Have fun (^,^)
#


from StringIO import StringIO
import csv
import time

def getMembersCSV(self):

    request = self.REQUEST
    text = StringIO()
    writer = csv.writer(text)

    # core properties (username/password)
    core_properties = ['member_id','password']

    # extra portal_memberdata properties
    extra_properties = ['fullname',
                        'email',
                        'location',
                        'home_page',
                        'description']

    properties = core_properties + extra_properties

    writer.writerow(properties)

    membership=self.portal_membership
    passwdlist=self.acl_users.source_users._user_passwords

    for memberId in membership.listMemberIds():
        row = []
        for property in properties:
            if property == 'member_id':
               row.append(memberId)
            elif property == 'password':
               row.append(passwdlist[memberId])
            else:
               member = membership.getMemberById(memberId)
               row.append(member.getProperty(property))

        writer.writerow(row)


    request.RESPONSE.setHeader('Content-Type','application/csv')
    request.RESPONSE.setHeader('Content-Length',len(text.getvalue()))
    request.RESPONSE.setHeader('Content-Disposition',
                               'inline;filename=members-%s.csv' %
                               time.strftime("%Y%m%d-%H%M%S",time.localtime()))

    return text.getvalue()

Take this like the last resort...

angelawong · August 10, 2023, 6:14pm

I don't see a folder called Collection but I do see _data in all of my content folders.

angelawong · August 15, 2023, 10:20pm

Thanks so much @espenmn! This worked great for my file types.

However, I have Dexterity Content types, such as a "policy" type that allowed PDFs uploaded to them. I can use wget to download PDFs from each single policy type page.

However, when I created a collection of "policy" type only and tried to do a wget recursively on that collection, it did not work.

This is what I used:
$ wget --force-directories --content-disposition --restrict-file-names=nocontrol --no-check-certificate -e robots=off -A.pdf,.ppt,.doc -r https://website.com/collection-policies

Any help would be most appreciated.

Thanks so much,
Angela