Book Review: "The Worlds I See: Curiosity, Exploration, and Discovery at the Dawn of AI", by Dr. Fei-Fei Li
On the story of a woman in STEM following her dream of becoming a scientist, the history and birth of computer vision and the implications of AI
I have been following Dr Fei Fei Li’s research since my master's degree days and when I saw the news that she has just published her memoir, couldn’t be more excited to find out more about her days as a researcher.
Dr. Li is a renowned computer scientist who played a crucial role in shaping the evolution of contemporary artificial intelligence. Her new memoir presents her journey from her childhood years in China, navigating the hurdles of relocating to the United States during high school, to her eventual ascent as a prominent researcher and professor at Stanford, at the forefront of technological innovation.
In “The Worlds I See”, she presents her experiences and challenges as an immigrant, a woman in STEM, a teacher and a scientist, emphasising her profound passion for the work that has solidified her standing as a well-known thought leader in the field of AI.
Born in Chengdu, China, she moved to Parsippany, New Jersey in the early 1990s where she met Mr Sabella, her math teacher and her first American friend. She majored in physics and then went on to study at Princeton on a full scholarship, followed by a PhD at Caltech. Dr. Li completed her dissertation, titled "Visual Recognition: Computational Models and Human Psychophysics", under the primary supervision of Pietro Perona and secondary supervision of Christof Koch, where she put together ImageNet, a fourteen million-image dataset that enabled rapid advances in computer vision in the 2010s.
The birth of ImageNet
During an era when the predominant focus of AI research was on models and algorithms, Dr Li aimed to broaden and enhance the dataset available for training AI algorithms. She admits this was not an easy challenge as she recounts her struggles and efforts in encouraging other academics’ use of the dataset in the early years:
“We’d dedicated years of our lives to a data set that was orders of magnitude beyond anything that has ever existed, orchestrated an international competition to explore its capabilities, and, for all that, accomplished little more than simply reifying the status quo. If ImageNet was a bet, it was time to start wondering if we’d simply lost.“
ImageNet contains more than 20,000 categories where a typical category consists of several hundred images. Images have been annotated to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided.
As her husband Silvio mentioned “That’s another thing about hypotheses - sometimes they take time to win everyone over.“, and he was right. During one of the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) where programs compete to correctly classify and detect objects and scenes, one of the contestants’ algorithms had the surprising results of an 85% accuracy. They had used a neural network, a concept talked about only decades earlier but considered “a dusty artifact, encased in glass and protected by velvet ropes.“ More precisely, the algorithm was based on a CNN (convolutional neural network), where each of its layers integrates further details into higher and higher levels of awareness until a real-world object comes fully into view.
A CNN behaves like a retina, “its outermost layer applies thousands of receptive fields to the pixels of a photograph, each turned to a unique, tiny pattern and activating when it encounters it. […] At this level of awareness, these filters could be reacting to anything […] Rather than arbitrarily deciding in advance which feature to look for, the authors allowed each of its hundreds of thousands of neurons to learn their own sensitivities from the training data, without manual intervention. […] Signals from those thousands of receptive fields travel deep into the network, merging and clustering into larger, clearer hints. Each new layer, operating at a more sophisticated level of perception than the last, responds when sensing something familiar - something it had been trained to recognize.“
The team that published the entry was a three-man team, made of Alex Krizhevsky (hence the name of the algorithm, AlexNet), Ilya Sutskever, and Geoffrey E. Hinton (the same person that has developed the backpropagation technique in the mid-80s).
The intersection of computer vision with other domains
A second-year grad student under Dr Li’s supervision, Andrej Karpathy’s thesis focused on the intersection of natural language processing with computer vision, investigating deep learning models suited for this task (Connecting images and natural language, 2016). The exploration started with the idea of having an algorithm that could describe an entire scene and not just being able to label whichever object appears in the background.
This posed a couple of challenges. The fundamental unit of a spoken language is a word and unlike a pixel, words convey distinct meaning, in isolation or as part of group of words. (e.g. “rock bottom“, “rock fragments“, “rock music“ etc)
Karpathy’s idea to solve the challenge would be to pair a CNN with an RNN. Another neural network, the RNN (recurrent neural network), born a couple of decades ago as well would start to gain attention. The model is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes, and hence being able to infer basic properties of text.
“One to encode visual information and pair it with words, and the other to generate language. We’ll train our model on pairs of images and human-written description. […] the RNN generates each new word in the description conditionally, based on the words already in the sentence.“
And the birth of deep learning
A sense of eagerness gripped the field as the demand for more intensified: more deeper neural networks, more layers, more compute power to speed up the training process, more data, more of … everything. It wasn't merely the era of machine learning, but as with more and more layered networks being experimented with across labs, it was the era of deep learning.
In 2015, a retrospective was published on the ILSVRC competition, including the results conducted by Andrej estimating the human error rate when labelling a thousand images at around 5.1%. In 2014, GoogLeNet, Google’s neural network classifier, achieved an all-time-low error rate of just 6.67%, coming close to surpassing humans at the top.
Later that year, Deep Residual Network (ResNet) led by Kaiming He, a network with 152 layers presented a unique innovation: some of the layers could be bypassed during the training phase, allowing different images to direct their influence toward smaller subregions in the network.
Google later acquired DeepMind with the mission of working on AGI (artificial general intelligence), a form of AI that rather than just performing tasks like classification or translation, it can replicate all human cognitive capabilities.
In 2018, at the Google Brain conference, one of the topics was “neural architecture research“, a domain looking into the optimisation of the neural network’s architecture. Models are defined by a set of parameters which govern the trade-off between speed vs accuracy, memory vs efficiency etc, and finding a way to balance this trade-off is a complex task. One method to solve this would be to automate it, have a machine learning model design other machine learning models. But from a computational side, this would be a really expensive task.
More innovations followed, such as LLMs, transformers, “multimodal“ networks, which allowed models to provide solutions to more complex tasks.
Future challenges concerning AI policies
Another project Dr Li’s lab was working on was in the healthcare domain and its goal was to have an automated technique for ensuring that caregivers wash their hands consistently and thoroughly throughout hospital spaces. The aim was to have some type of technology that would blend into the background discreetly, keeping a silent watch and speaking up only when it sensed danger (what Dr Li called ambient intelligence). But the unexpected that followed was not only the complexity of the sheer technical task but more the implications of it from a human perspective. The technology was intended for patient safety, but it raised concerns when it came to performance reviews. This was Dr Li’s first encounter with an aspect of AI that would soon haunt the public’s mind: its capacity for surveillance.
Another concern towards AI would soon be raised by the emerging threat known as “adversarial attacks“, in which input is designed for the sole purpose of confusing an ML algorithm to counterintuitive or destructive ends. For instance, an input photo can be modified with subtle fluctuations in the colours of individual pixels in such a way that would trigger a cascade of failures with the NN. This would have enormous consequences in the self-driving field where for example, a stop sign could be misclassified, putting at risk our own safety.
In 2016, ProPublica investigations documented the widespread use of biased AI in helping lenders process loan applications and even in assisting judges in making decisions about parole. Similar reports discovered other questionable uses of technology in domains like pre-screening job applicants before being called for interviews.
“As scary as each of these issues was in isolation, they pointed toward a future that would be characterised by oversight, more inequality, and, in the wrong hands, possibly even a kind of looming, digital authoritarianism.“
Providing concrete examples of the initial processes and challenges in AI up to the present, Dr. Li contemplates the principles that should underpin AI in both academic and private sector contexts. Lopsided data sets, susceptibility to bias, adversarial influence, fairness, and privacy are all concerns that need to be addressed. This is another challenge Dr Li tries to solve as a Co-Director of the Stanford Institute for Human-Centered Artificial Intelligence.
A memoir of one woman’s curiosity and courage
In one of her talks, she mentions her friend, John Etchemendy, reading her first manuscript saying it’s no good and convincing her she needs to write a memoir which she was reluctant at first.
“Many AI scientists and technologists can write this book, but there are so many young women, young people, immigrants, people of all walks of life out there, who need to identify with the voice that is not a typical AI voice, and you have a journey that you need to share with these people.“
The challenges of being an immigrant and woman in STEM paired with the demands of having a career, a marriage and a motherhood are some of the hardships Dr. Li describes so heartedly. As an immigrant and woman in engineering, I’m really happy she listened to her friend’s advice as her book has become a cosy journey of reminiscing of my own struggles and the spark and passion that made me move forward.
Her book’s final chapter contains probably the most fitting message for all the immigrants scientists out there, “No matter who you are, or where you came from - you belong here“ so that we can build the future together.
The Worlds I See is an inspiring story of one woman’s curiosity and courage following her passion to become a scientist. Today, Dr. Li is a computer science professor and director of the Stanford Human-Centered AI which she founded while she also plays a role as a consultant to the Senate and House sub-committees and the Executive branch on matters concerning AI security and policies.