How do I query portal_catalog? (beginner)

Hi I'm a beginner, and I want to do some queries on the portal_catalog (mostly to figure out where the really big files are hiding).

I found this page about querying it: https://docs.plone.org/4/en/develop/plone/searching_and_indexing/query.html

However, I think it's a bit advanced for me. For example, it starts off giving this example code: "portal_catalog = site.portal_catalog"

But I don't even know where I would go to type that in! Yes I am that kinda beginner. Can someone point me in the direction of getting to the place where I can actually type in commands like that?

Also, I went to ZMI, and clicked on "portal_catalog" thinking this might be where I could query it, but I can't figure out what to do once I'm in there. Under the Catalog View, I see a list of all my files, but there is no "file size" column for me to click on to order it ascending or descending. So I don't know if this would be very helpful unless I can write a custom query.

Thanks again and please excuse my newbie-ness.

Welcome dcplebranch! The tutorial that you are following provides help when you are developing with Plone from the filesystem. But there is a quick and dirty way this query can be done in the ZMI, and it does involve writing some Python code, as shown in the tutorial. I'll also note that the code here will be different depending on which version of Plone you are using. Your link is pointing to the Plone 4 docs, so I've tested the following steps in Plone 4.

  1. Go to the ZMI (/manage), and in the upper right, choose 'Script (Python)' from the dropdown and click 'Add' if the form doesn't automatically submit.
  2. Name the script whatever id you want, like 'find_files'. Click 'Add and Edit'
  3. Clear out the default code given, and paste this in:
from Products.CMFCore.utils import getToolByName
catalog = getToolByName(context, 'portal_catalog')

files = catalog.searchResults({'portal_type': 'File'})

for file in files:
    obj = file.getObject()
    print(obj.absolute_url())
    file = obj.getFile()
    print('{} kb'.format(file.size() / 1024))

return printed

This code is querying over all the Images added into the site, and will print out the url to each image, then the size of the image in kilobytes. To see the results of this script, click the 'Test' tab at the top of the Script. It's nothing fancy, but will help get you the information you are looking for.

From here you can modify the script to better fit your needs. You may also want to look for all the Images in the site, for which you would need to change the portal_type, and do obj.getImage(). You can also add a condition to only print out files larger than a certain size. This will require some basic Python.

1 Like

Thanks @cdw9 !

I'm running Plone 4.0 by the way, yes I know I should upgrade to 4.3... will get around to that :slight_smile:

I tried your code, but I had to make a few changes (changed image.getObject() to file.getObject() and I had to change {} to {0})

But then it gave me this error...

Module ZPublisher.Publish, line 127, in publish
Module ZPublisher.mapply, line 77, in mapply
Module ZPublisher.Publish, line 47, in call_object
Module Shared.DC.Scripts.Bindings, line 324, in call
Module Shared.DC.Scripts.Bindings, line 361, in _bindAndExec
Module Products.PythonScripts.PythonScript, line 344, in _exec
Module script, line 10, in find_files
<PythonScript at /olli/find_files>
Line 10
Module Products.ATContentTypes.content.base, line 279, in size
Module Products.ATContentTypes.content.base, line 197, in get_size
Module plone.app.blob.field, line 277, in get_size
Module plone.app.blob.field, line 86, in get_size
Module plone.app.blob.utils, line 52, in openBlob
Module ZODB.Connection, line 838, in setstate
Module ZODB.Connection, line 914, in _setstate
Module ZODB.blob, line 652, in loadBlob
POSKeyError: 'No blob file'

My site is working fine and the blobstorage folder looks in place, so I'm not sure why it can't find the blob file. Anyway, just wanted to put this here in case you have any ideas. If not, I'll continue googling to see if I can figure it out. You've already been a huge help, thanks again for that reply!

Just thinking out loud here... I was looking around some more and in ZMI under portal_catalog > Advanced, there is a button that says "Clear and Rebuild"... I wonder if this will fix the blob file error? Or maybe it's unrelated.

Looks like there's at least one object in the site that in broken. You can wrap part of the code in a try/except:

try:
    file = obj.getFile()
    print('{} kb'.format(file.size() / 1024))
except POSKeyError:
    print('broken')

So for each file that has the error, the 'broken' text will print out instead of the file size.

Clearing the catalog won't fix the blob error. With the 'broken' bit printed out, you'll be able to see which objects are having a problem, so you can investigate further on those.

I also fixed a typo in the original code I posted (to do file.getObject()

Thank you @cdw9 ! It's working now!

Just a follow up... it was throwing another error:

TypeError: 'long' object is not callable

Although with the try / catch in place it works fine. But I think what it means is that at some point in that loop the file variable is of the type long... which is odd. What could it mean?

If you have a bare except: in place, then it's catching all errors, including the TypeError. But if you left the except POSKeyError:, then the TypeError is likely another result of those same broken objects.

Ah, ok, thanks again! :slight_smile: This was really helpful

Just another follow up.

I added up all the file sizes listed, and it came out to 1.4 gigs. And yet my blobstorage is 25 gigs. Any ideas what would cause this discrepancy?

The code I provided only looks at the File content type. Did you also search over the Image content type? Do you have any other content types that allow files to be uploaded?

Yes, I copied the script for Images and it came out to 11 megabytes. Also, I recently packed the database (like this morning), so I'm really confused. Thanks for any insight.

Oh, and here are all the different content types:

  • Collection
  • Discussion Item
  • File
  • Folder
  • Image
  • Link
  • News Item
  • Page

News Items also have an image field, but none of the other types have this in default Plone.

There's only one News Item, and it is 40 K

You linked some Plone 4 documentation, is that for a reason..? Either way, you should use the plone.api library for querying content in the portal_catalog. It makes this sort of thing a lot easier to do...

from plone import api

documents = api.content.find(portal_type='Document')

You linked some Plone 4 documentation, is that for a reason..

Yes, it was mentioned earlier that they are using Plone 4.0, so plone.api is not available.

So I looked at the blobstorage itself and ran it through a disk usage graph type thing... and it seems there are many copies of a 2.2 gig blob.

Now, I did already delete a file that was around that big earlier... so it's weird if that 2.2 gigs is that file (since I already deleted it)...

but I wonder if Plone is still storing old versions of that file somehow, even though it has been deleted. Is that possible? Again, the already deleted file does not show up when I list all the files out using the script above... but maybe there's some ghostly version of it still in blobstorage...

Any ideas appreciated!

Did you pack the database after removing the file?

Yes I did. And another clue, the file is under the tmp/savepoints/ folders...

I can try packing again, but I swear I already packed it after deleting those files.