Indexing media values for documents

Indexing media values for documents

By default, Umbraco Search doesn’t index anything for item picker property types. The reasoning behind this is two-fold:

First and foremost, it’s rather hard to index anything universally meaningful for a picked item. The item name might seem like an obvious choice, but what if it’s a media? Do media names really convey that much meaning? Perhaps a property value on the item, then - but which property? Or, perhaps the ID of the item? 🤔

Secondly, it’s really expensive to keep the search indexes up to date for these kinds of relations. It’s definitively doable - Umbraco already tracks all the relevant references. But is it really worth it at scale? Probably not.

Luckily, the Umbraco Search extension model allows for changing this default behavior when needed. In this post I’ll show you how media property values can be indexed to produce search results for documents with a media picker.

Custom indexing of media picker values

The property value handler is a core concept for Umbraco Search. I wrote about it in one of my previous posts, so I won’t go into details with it here.

A custom property value handler for the media picker is the obvious choice to customize the indexed values. This one indexes the altText property of the picked media, so the document containing the media picker will match full text search for these altText values:

using Umbraco.Cms.Core.Models;
using Umbraco.Cms.Core.Models.PublishedContent;
using Umbraco.Cms.Core.PublishedCache;
using Umbraco.Cms.Core.Serialization;
using Umbraco.Cms.Search.Core.Models.Indexing;
using Umbraco.Cms.Search.Core.PropertyValueHandlers;

namespace My.Site.PropertyValueHandlers;

public sealed class MediaPickerPropertyValueHandler : IPropertyValueHandler
{
    private readonly IJsonSerializer _jsonSerializer;
    private readonly IPublishedMediaCache _publishedMediaCache;

    public MediaPickerPropertyValueHandler(IJsonSerializer jsonSerializer, IPublishedMediaCache publishedMediaCache)
    {
        _jsonSerializer = jsonSerializer;
        _publishedMediaCache = publishedMediaCache;
    }

    public bool CanHandle(string propertyEditorAlias)
        => propertyEditorAlias is Umbraco.Cms.Core.Constants.PropertyEditors.Aliases.MediaPicker3;

    public IEnumerable<IndexField> GetIndexFields(IProperty property, string? culture, string? segment, bool published, IContentBase contentContext)
    {
        if (property.GetValue(culture, segment, published) is not string value)
        {
            // Expecting a string value (serialized picked media entities).
            return [];
        }

        // Deserialize the picked media entities.
        MediaPickerEntity[]? entities = _jsonSerializer.Deserialize<MediaPickerEntity[]>(value);
        if (entities?.Any() is not true)
        {
            return [];
        }

        // Fetch the picked media items from the published media cache.
        IPublishedContent[] mediaItems = entities
            .Select(entity => _publishedMediaCache.GetById(entity.MediaKey))
            .WhereNotNull()
            .ToArray();

        // Grab the alt texts (if any) from the picked media items.
        var altTexts = mediaItems
            .Select(media => media.Value<string>("altText"))
            .WhereNotNull()
            .ToArray();

        // Return the alt texts for indexing as searchable texts.
        return altTexts.Length > 0
            ? [
                new IndexField(
                    property.Alias,
                    new IndexValue
                    {
                        Texts = altTexts,
                    },
                    culture,
                    segment)
            ]
            : [];
    }

    // record for deserialization of picked media entities (we only need the media key)
    private record MediaPickerEntity(Guid MediaKey);
}

Keeping up with changes

The property value handler works fine in itself. But … what happens when an editor updates the altText of a media item? Exactly nothing. At least not in respect to the documents having that media item picked.

If that’s fine for your requirements, your job is done, and you don’t need to read any further. I think you should anyway, though, ‘cause this is where things get geeky 🤓

What’s really needed is a notification handler for media updates, to reindex all the documents where the updated media is picked.

The media picker already implements reference tracking, so the relevant documents are readily available 👏

The trick then is to trigger a reindex of these documents. Fortunately, Umbraco Search has that all covered with the IContentIndexingService. Here’s the full notification handler:

using Umbraco.Cms.Core;
using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Models;
using Umbraco.Cms.Core.Notifications;
using Umbraco.Cms.Core.Services;
using Umbraco.Cms.Core.Services.OperationStatus;
using Umbraco.Cms.Search.Core.Models.Indexing;
using Umbraco.Cms.Search.Core.Services.ContentIndexing;

namespace My.Site.NotificationHandlers;

// This notification handler is invoked whenever media is saved.
public class UpdateRelatedDocumentsNotificationHandler : INotificationAsyncHandler<MediaSavedNotification>
{
    private readonly ITrackedReferencesService _trackedReferencesService;
    private readonly IContentIndexingService _contentIndexingService;
    private readonly IIndexDocumentService _indexDocumentService;
    private readonly ILogger<UpdateRelatedDocumentsNotificationHandler> _logger;

    public UpdateRelatedDocumentsNotificationHandler(
        ITrackedReferencesService trackedReferencesService,
        IContentIndexingService contentIndexingService,
        IIndexDocumentService indexDocumentService,
        ILogger<UpdateRelatedDocumentsNotificationHandler> logger)
    {
        _trackedReferencesService = trackedReferencesService;
        _contentIndexingService = contentIndexingService;
        _logger = logger;
        _indexDocumentService = indexDocumentService;
    }

    public async Task HandleAsync(MediaSavedNotification notification, CancellationToken cancellationToken)
    {
        // Get the keys of the documents that references any of the saved media.
        Guid[] relatedDocumentKeys = await GetRelatedDocumentKeys(notification.SavedEntities);

        // Clear the cached index values to force a rebuild of the document index values.
        await _indexDocumentService.DeleteAsync(relatedDocumentKeys, true);

        // Trigger a reindex for the documents.
        _contentIndexingService.Handle(
            relatedDocumentKeys.Select(key =>
                ContentChange.Document(key, ChangeImpact.Refresh, ContentState.Published)));
    }

    private async Task<Guid[]> GetRelatedDocumentKeys(IEnumerable<IMedia> mediaItems)
    {
        var keys = new List<Guid>();

        foreach (IMedia media in mediaItems)
        {
            // Get all relations for the media.
            // NOTE: For simplicity we just fetch the first 1000 relations here. If you expect to have images with
            //       more relations than that, consider iterating through the relations with proper pagination.
            Attempt<PagedModel<RelationItemModel>, GetReferencesOperationStatus> attempt = await _trackedReferencesService
                .GetPagedRelationsForItemAsync(media.Key, UmbracoObjectTypes.Media, 0, 1000, true);

            if (attempt.Success is false)
            {
                _logger.LogError(
                    "Could not retrieve relations for media: {mediaKey}. The attempt failed with status: {status}",
                    media.Key,
                    attempt.Status);
                continue;
            }

            // Return the relations of type document (corresponds to the documents that have picked the media)
            keys.AddRange(attempt
                .Result
                .Items
                .Where(reference => Constants.UdiEntityType.Document.InvariantEquals(reference.NodeType))
                .Select(reference => reference.NodeKey));
        }

        return keys.Distinct().ToArray();
    }
}

That’s not too bad, is it?

The keen observer will have noticed the IIndexDocumentService above. This is part of an optimization effort in Umbraco Search.

Long story short, Umbraco Search stores a cache of index data in the Umbraco database. This allows for a much, much faster rebuild of search indexes when that is needed 🚀

Normally, Umbraco Search flushes the relevant cache before reindexing anything. However, since the change handling here is related to a media item and not the related documents, the document cache remains intact, and Umbraco Search will use that when reindexing. So - it needs to be explicitly flushed first.

Unlike property value handlers, a notification handler must be explicitly registered with Umbraco, so a composer is also required:

using My.Site.NotificationHandlers;
using Umbraco.Cms.Core.Composing;
using Umbraco.Cms.Core.Notifications;

namespace My.Site.DependencyInjection;

public class UpdateRelatedDocumentsNotificationHandlerComposer : IComposer
{
    public void Compose(IUmbracoBuilder builder)
        => builder.AddNotificationAsyncHandler<MediaSavedNotification, UpdateRelatedDocumentsNotificationHandler>();
}

Use with caution

Being able to trigger selective content reindexing is quite handy. This is just one example; I’ll follow up with more when I find the time to write them up ⌛

Now, just in case I didn’t make myself abundantly clear, allow me to reiterate: You must be very careful how you put this to use. Handy as it is, it also comes with a potential massive performance penalty.

Or even worse, you can quickly find yourself in an infinite reindexing loop ☠️

On that cheery note, happy reindexing 💜