Comment: Semantic Reconstruction of Continuous Language from Non-Invasive Brain Recordings
Authors: Pouria Saidi, Sean Kinahan, and Visar Berisha (Arizona State University)
A recent article published in Nature neuroscience, titled Semantic Reconstruction of Continuous Language from Non-Invasive Brain Recordings, has attracted the attention of many and raised questions around the idea that AI-powered machines may be able to read stories in people’s mind. But, is this really the case? Do the results in this paper mean that one can read people’s mind by looking into the brain recordings? The short answer to this question is negative and the paper does not make any claims vis-à-vis mind reading. In fact, this technology is far from reaching these capabilities. Here we provide a summary of the article and our view on the results.
Summary of the paper’s main premise and contribution
The paper presents a machine-learning decoder of continuous natural language from non-invasive brain recordings. Existing decoders rely on invasive recordings from implanted electrodes, while non-invasive methods can only identify a limited set of stimuli. The key innovation presented in this study is the non-invasive reconstruction of semantic meaning of continuous natural language. Past methods have accomplished decoding continuous language only from invasive recordings.
The non-invasive decoder presented in the study reconstructs natural language by analyzing cortical representations of semantic meaning recorded through functional magnetic resonance imaging (fMRI). This decoder architecture can generate understandable word sequences that capture the meaning of perceived speech, imagined speech, and even silent videos. It can be applied to various semantic tasks, indicating its versatility. The researchers tested the decoder on different cortical networks and found that natural language can be decoded from multiple networks in each hemisphere of the brain. The authors additionally evaluated aspects of mental privacy by testing the effect that subject cooperation has on the decoding process. Performance of the non-invasive decoder presented exceeds random chance, and demonstrates that semantic meaning of continuous language can be decoded under these conditions.
Relative to the rate of speech (2+ words per second), the blood-oxygen-dependent-level (BOLD) signal measured via fMRI is slow – it takes approximately 10 seconds to measure a single impulse of neural activity. Therefore, each measurement of brain activity was impacted by 20 or more words. The decoder tackled this ill-posed inverse problem by guessing several candidate word sequences, evaluating each sequence’s likelihood of having evoked the recorded brain responses, and finally selecting the best candidate.
A dataset consisting of paired fMRI and speech annotation was compiled for the seven subjects who participated in this study. The dataset was collected by having participants listen to 16 hours of narrative stories during fMRI across 16 separate scanning sessions. Autobiographical narratives for the dataset included 82 5-15 min podcast stories. To aid their investigation into the role that distinct cortical regions play in language processing, fMRI data from each subject was partitioned into three distinct cortical networks. Decoder predictions were generated using a combination of all three networks, and this process was repeated using data from each network individually.
The authors used a fine-tuned Generative Pre-trained Transformer (GPT) neural network to perform semantic feature extraction from the language stimuli. After semantic features were extracted, linear regression was used to map these semantic features to corresponding brain responses. The resulting encoding model provides a prediction for how a subject’s brain would respond to a given word sequence.
During reconstruction of language from brain recordings, the decoder keeps a small number of potential word sequences. It employs a language model to propose extensions to the word sequence, while the encoding model evaluates how likely the recorded responses match each extension. The decoder then selects the most probable extensions, effectively guiding the language model based on the subject’s brain recordings towards the most likely word sequence.
To assess the performance of the decoder framework, test stories that were not part of the model’s training corpus were used. Language similarity metrics were gathered and utilized to compare the decoded speech with the input annotations. The results showed that around 50% of the time, the decoder framework generated output text which closely aligned with the intended meaning of the original input sequence.
Our thoughts
The paper proposed an interesting approach for non-invasive decoding, providing results that suggest a practical pathway toward decoding continuous language from fMRI. These findings are interesting and impressive and the paper opens new avenues of research in this area. However, the way the paper has been covered in the press has left the impression that we are at the point of the emergence of a new technology that can read minds. On the contrary, we believe that technology that helps people to communicate using non-invasive brain recordings is still in its infancy. Here, we highlight a few concerns regarding the interpretation of the results presented in this article.
1) The encoding model that is the core element of the proposed language decoder is trained on an immense amount of data recorded specifically for each subject. To collect subject-specific data, each subject spent up to 16 hours in the MRI scanner to collect the fMRI recordings required for the training phase. Based on the results in the paper, the performance of the language decoder substantially deteriorates in the absence of subject-specific training data, suggesting that within-subject training data is critical. Therefore, one cannot adopt a trained model to decode the fMRI recordings of new subjects. This is one of the principal limitations of the proposed model. However, from a different perspective this could be good news, suggesting that without the subject’s willingness and cooperation, no such models can be developed.
2) The performance of the language decoder heavily depends on the pre-trained language model – GPT. It is the GPT that generates the word sequences and on the other hand the encoding model scores the likelihood of the predicted brain responses. In other words, the language model generates many candidate sentences to find a match for the original sentence.
It is the encoding model that selects the better sentences. The encoding model is trained using extracted semantic features that capture the meaning of the stimulus phrases. For this purpose, the middle layers of the language model are used. So, we think the brain recordings are being used to guide the language model to generate word sequences aligned with the semantics from the original word sequence. Consequently, the language decoder may happen to solve an easier problem, that is generating word sequences under a particular class that is characterized by the semantics of the original sentence. Which implies that the true performance of the model is obscured and exaggerated with the performance of the language model. As discussed in the paper, the results show that often the decoder reconstructs the meaning of the language stimuli but fails to recover exact words. It is noteworthy that many words and sentences may induce similar neural responses in the brain, underscoring the fact that this technology cannot generate the same words, or even the exact same stories.
3) These language models (GPT) are pre-trained on a very large corpus of data, and it is possible that the model has already been exposed to similar sentences that are in the test set. We should be careful and make sure that there is not any source of data leakage due to using the pre-trained language models.
4) Lastly, it is of paramount importance to consider the ethical concerns that are intrinsic with the nature of such studies. There should be no doubt that the privacy of the subjects should be of utmost concern, and all the necessary steps should be taken to make sure that the privacy of people are respected. Although the article emphasizes the importance of subject cooperation to achieve high performance, it discusses the possibility of developing models that can overcome these obstacles in the future. Although this technology is still far from reaching such capabilities, ethical considerations should be considered in parallel with the technical development.
Image credit: svstudioart on Freepik