AI Cannot Read Your Mind
Can AI read your mind? That is the question on many minds these days. A recent study to be published at the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2023 has been cited in popular media as evidence of AI’s mind-reading capability. We take a closer look.
Let’s start with a summary of the paper’s main premise and contributions.
The paper aims to find a machine-learning solution that can map neural activations as measured by fMRI scans, to what the person might be seeing that resulted in said activations. This question has been investigated in the past, with several methods proposed, including several references cited in the paper. Where the paper distinguishes itself is in the quality of image reconstructions; the ones shown in the paper are very good.
In more detail, the paper study uses a database of image+fMRI pairs previously collected – the so-called natural scenes dataset (NSD). The NSD data were collected by showing participants 10000 images from the MS-COCO corpus while recording their neural activity in an fMRI machine. Each presented image is tagged with an average annotation (in turn derived from annotations from the MS-COCO dataset), annotations span different categories and a total of 80 different categories are represented in the images in the original MS-COCO dataset. The types of object categories in the COCO dataset include: ‘person’, ‘bicycle’, ‘car’, ‘motorcycle’, ‘airplane’, ‘bus’, ‘train’, ‘truck’, ‘boat’, ‘traffic light’, ‘fire hydrant’, ‘stop sign’, ‘parking meter’, ‘bench’, ‘bird’, ‘cat’, ‘dog’, ‘horse’, ‘sheep’, ‘cow’, ‘elephant’, ‘bear’, ‘zebra’, ‘giraffe’, ‘backpack’, ‘umbrella’, ‘handbag’, ‘tie’, ‘suitcase’, ‘frisbee’, ‘skis’,’snowboard’, ‘sports ball’, ‘kite’, ‘baseball bat’, ‘baseball glove’, ‘skateboard’, ‘surfboard’, ‘tennis racket’, ‘bottle’, ‘wine glass’, ‘cup’, ‘fork’, ‘knife’, ‘spoon’, ‘bowl’, ‘banana’, ‘apple’, ‘sandwich’, ‘orange’, ‘broccoli’, ‘carrot’, ‘hot dog’, ‘pizza’, ‘donut’, ‘cake’, ‘chair’, ‘couch’, ‘potted plant’, ‘bed’, ‘dining table’, ‘toilet’, ‘tv’, ‘laptop’, ‘mouse’, ‘remote’, ‘keyboard’, ‘cell phone’, ‘microwave’, ‘oven’, ‘toaster’, ‘sink’, ‘refrigerator’, ‘book’, ‘clock’, ‘vase’, ‘scissors’, ‘teddy bear’, ‘hair drier’, ‘toothbrush’. (Read more at: https://viso.ai/computer-vision/coco-dataset/)
There were 8 different subjects in the original NSD corpus. The study protocol was as follows: each participant was shown an image while their neural activity was recorded, each image was shown 3 times for 3 seconds each time. The total recording time for each subject was ~25 hours and collected over the course of a year. Only 4 of the 8 subjects completed the experiment. These were the four that were used in the experiments in the paper.
Now the machine learning innovation that can map fMRI data to image reconstruction uses the latest advances in generative AI tools – specifically a class of methods that are referred to as diffusion models. Diffusion models can take in a text-prompt to generate an image, and they can also take in an image prompt (e.g. a sketch) to generate a realistic looking picture. Diffusion models can also process both a text and image prompt together to create a new image with the semantic information provided by the text description and the style provided by the input image.
The key innovation in the proposed study was the method by which the fMRI output was connected to the diffusion model. Internally the diffusion model maps the input prompt image to a numerical representation called an embedding (variable z in the paper). The same is done for the input text description (variable c in the paper). In the experiment, the images in each fMRI+image pair are first processed by the part of the diffusion model that creates the z and c embeddings described above. Next, the authors train a simple linear model to predict the z and c vectors individually from the fMRI recordings associated with the image in question.
Now, the paper shows a proof of concept, but one needs to check for ability to generalize beyond the conditions in the paper. All of the training as well as testing in the paper was done within the same subjects – meaning that the training and testing subjects were the same, although the images were different. Even under this idealized condition, the linear model doesn’t capture very well the relationship between the fMRI and the true embeddings as the correlation between true and predicted embeddings is on the order of (r=0.2-0.35). The diffusion model takes the noisy predicted embeddings and generates images that humans deemed to be more similar to the original image than to a random image.
This is an interesting and clever approach to connect between the NSD dataset and diffusion models using a simple linear mapping. This has been sensationalized in the press, with many talking about reading thoughts. This is nowhere near what’s actually happening.
Every image elicits a different neural response for different individuals. Neural responses are individualized and heterogenous. This makes sense as, for some individuals seeing a lake and a bird might invoke feelings of calm and peace, whereas for others the semantic categories might stand out. The neural responses for each of these interpretations would be different but both would provide some ability to predict which image was shown from the neural responses.
In other words, it’s likely that there is neural activity that can distinguish between some of the 80 categories represented in the NSD dataset – however, those are very much specific to the individual. A model trained for one person is not likely to generalize for another person and vice-versa. It’s important to note that this finding is independent of the diffusion model or any other advanced AI approach. Generalizability is all the more questionable, as the authors use a simple linear model for the key steps of deriving the image and text embedding vectors from fMRI. The role the diffusion model plays is one of a denoiser. It converts the noisy embeddings into a realistic image.
Is this useful? Perhaps one can find some ways to use this approach, however it isn’t immediately clear to us. Just to replicate the same experiment on a different person, one would need to develop an individualized linear mapping from the neural response data to the embeddings z and c. This means we need ~25 hours of data from this individual.
The bottleneck in being able to do this kind of work is the resolution of the fMRI and the between-subject variability. We just don’t know how to reliably map between natural images and neural activity in a way that is repeatable. This is a complex, not-well-understood mapping and none of the generative models that exist today ease this problem. This is likely why the authors chose to use a linear model. More importantly, this is a very specific protocol and doesn’t generalize to the use cases that have been postulated.
Our key takeaways are:
- The ethical implications for this type of work should be considered first and foremost. The primary question that we should answer as a community is whether these use cases should even be under consideration. That’s a question that we’re not qualified to answer.
- If the answer to the ethical questions is in the affirmative, then it is important to note the limitations of the current study relative to the claims that have been made about it in the press. The authors demonstrate that the predicted embeddings can generate better-than-random images for 80 categories; however, only after collecting 25 hours of fMRI data for an individual.
- This is not a realistic constraint for real-world use as fMRI data at this scale exists for very few people.
- The second important constraint is that this model was trained for a closed set of categories. Real-world use cases involve more than 80 categories, with humans being able to make sense of thousands of different categories, each one in countless contexts.
- Finally “thought” is as yet a mathematically undefinable and unmeasurable concept as far as we are aware of. Defining “thought” as names of objects in a picture is very limiting; thus any notion of a “mind-reading AI” is at best fantastical.