April 9, 2024

AI and Cultural Narratives: What are the Limitations and Opportunities of Video Indexing for Arab Content?

By
Frederic Petitpont
Insights

The percentage of English to Arabic datasets is vastly disproportionate, and we see the resulting biases within AI video indexing models

Today, multimodal and generative AI significantly shapes how media companies and broadcasters tell stories, enabling them to share more compelling narratives. With the help of state-of-the-art AI models, media companies can better understand and describe what is in audiovisual media, enabling quick and relevant content searches. 

Yet, one of the challenges with using AI for video indexing is that the datasets used to train foundation models do not reflect the diversity of the world. Most datasets are created in Western countries and in English, regardless of the actual distribution of language speakers worldwide. 

Some initiatives exist such as the European BLOOM that emerged from a consortium of open-source AI advocates, or the Chinese Yi models from 01AI. But there are still some missing pieces to obtain large AI models that cover a broad range of cultures, opinions, and beliefs.

On Hugging Face, more than 65% of the world’s datasets are in English, and less than 4% are in Arabic. The percentage of English to Arabic datasets is vastly disproportionate given that 400 million people around the world use English as their native language, and 370 million people speak Arabic.

Language distribution in Huggingface datasets.

This discrepancy arises from several factors. English serves as a universal language for research and communication, influencing dataset creation. Moreover, market size drives language-related efforts and quality, impacting aspects such as transcription accuracy . 

The biases within AI datasets and models have the potential to greatly affect how stories are told and cultural narratives are conveyed by broadcasters and media companies.

For example, this shot of a man wearing a kufiya or keffiyeh, which has been incorrectly detected and described by the AI as a hijab.

A kufiya or keffiyeh is incorrectly detected and described by the AI as a hijab.

This hallucination is a result of the vision-language model used for image captioning being trained on data that contained limited-to-no examples of hijabs and keffiyehs, so it lacked important knowledge of this traditional Middle Eastern clothing and who it is worn by.

To address this critical issue of AI hallucinations, it is important to continuously add diverse cultural information from various sources to existing datasets. Building an unbiased AI is almost utopian, but one can mitigate cultural biases and make the most of AI video understanding in a few ways.

How to Reduce the Biases in AI Generated Content     

One solution is prompt engineering. This approach applies to language-based models. It involves adjusting the initial  prompt to achieve less biased answer generation. Asking a raw diffusion-based foundation model for a picture of a researcher making a discovery will likely produce four different propositions of a Caucasian male in his 40s in a white shirt or lab coat. Adjusting the prompt and suggesting other cultures and genders will improve the diversity of the search results. It’s important to note that this approach is likely to introduce new biases. For example, Google Gemini recently attempted to address this issue, but there were drawbacks, including inaccuracies in some historical image generation depictions.

Another option is fine-tuning the model. Instead of retraining a new model from scratch, extra data is added to a base model to improve its results on a specific set of data. This solution still requires a significant amount of data, often annotated, and diverse enough. Therefore, it’s important for AI scientists and business experts — including media managers and producers — to work together to provide real-world use cases and the necessary data to fine tune the AI model that will be used in a media management context. Well executed, fine-tuning empowers media companies to create custom metadata in the form of video summaries, or video shot descriptions that respect their editorial guidelines, and instantly generate additional value through the significant time saved on manual media logging and writing SEO copy for online publishing. 

Applying a multimodal approach (i.e., multiple modalities) to AI can also be used to further improve video indexing results, and can reduce the likelihood of generating hallucinations. Rather than relying on a single source for indexing, multimodal AI models can take into account numerous sources, such as objects, context, geo-location, facial recognition, wikidata knowledge graph, brand logos and other visual patterns, transcriptions, translations, web search results and more. Utilizing collective memories, personal learnings, hearing, and the notion of space and time, metadata applied through multimodal AI indexing leads users to the exact moment, and gives them the precise context they need.

In the long run, combining all these methods is probably key to having AI technologies that can efficiently understand videos from various cultures and in various languages. 

In the long term, sharing additional data to the open-source AI community and promoting diversity among research teams will go a long way toward directly impacting culture representation. As the media industry, along with the global community, increasingly incorporates diverse cultural perspectives, the precision of AI outcomes will correspondingly improve.

Moments Lab is helping broadcasters and media companies worldwide find the key moments in their petabytes of archives or live streams with its groundbreaking MXT-1.5 AI indexing models.

A version of this article was originally published by BroadcastPro ME

Watch Frederic Petitpont's presentation at the 3rd Arab Media Congress in Tunis, Tunisia. Organized by the Arab States Broadcasting Union.