Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence

Industry & Startups · Published: Jul 03, 2025 · Priya Sharma · ~26 min read

Author

Priya Sharma · Enterprise AI & Governance Editor

Regulation, enterprise adoption, and what teams should verify before they deploy.

The burden of proof for defining artificial general intelligence (AGI) has shifted again, this time resting on spatial reasoning rather than just language or pattern recognition. In her latest interview, Fei-Fei Li argues that AGI remains incomplete without it. As an editor focused on governance and technical reality, I view this not merely as a philosophical stance but as a critical infrastructure requirement for the next generation of AI agents.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 2

Li, often called the “Godmother of AI,” frames her lifelong mission as enabling agents to tell the story of the world. She insists this is impossible without spatial intelligence. Her assessment carries weight because she has spent decades building the data foundations that make modern AI possible.

I have spent my entire career chasing extremely difficult, almost crazy questions.

Li now targets one of those “crazy” questions: 3D world modeling. She posits that understanding, generating, reasoning within, and acting in a 3D space are fundamental problems for AI. Her goal is to build a world model that moves beyond flat pixels and language barriers to capture the actual structure of the three-dimensional world.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 3

In this conversation, she traces the origins of ImageNet, discusses paradigm shifts in AI, and addresses the severe lack of data for spatial intelligence. The following summary reflects her insights on how we got here and where we must go next.

ImageNet Built the Data Skeleton for Modern Computer Vision

Q: One of your earliest projects was ImageNet, created in 2009, which is now 16 years ago. That paper has been cited over 80,000 times and truly touched upon a critical issue in artificial intelligence: the data problem. Please tell us how that project came about—at the time, it was groundbreaking work.

Li Fei-Fei: Actually, we conceived of this (ImageNet) nearly 18 years ago. I was an assistant professor at Princeton University then. The world of AI and machine learning was completely different; data was very scarce. At least in computer vision, algorithms didn’t really work—there was no industry there. You know, to the general public, the term “AI” did not even exist.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 4

But among us—starting with the founders of AI—and then people like John McCarthy and Geoffrey Hinton—I think we were just dreaming an artificial intelligence dream: we really, truly wanted machines to be able to think and act. My personal dream was to enable machines to see, because seeing is the cornerstone of intelligence.

Visual intelligence is not merely perception; its true meaning lies in understanding the world and acting within it. I became obsessed with the problem of enabling machines to see. When I was passionately developing machine learning algorithms back then, we tried neural networks, but they failed, so we turned to foundational networks like support vector machines.

But one issue always troubled me: generalization. If you work in machine learning, you must recognize that generalization is the core mathematical foundation or goal of machine learning. For generalization, these algorithms need data. But at that time, no one had data in computer vision. I was among the first graduate students to start working with data because I belonged to the generation that saw the early massive development of the internet and the Internet of Things.

Fast forward to the 21st century, around 2007, my students and I decided we had to make a bold bet: we had to bet that machine learning needed a paradigm shift, and this shift must be led by a data-driven approach, but there was simply no data at the time.

So we thought, well, let’s download a billion images from the internet—that was the largest amount we could access then—and create an entire visual taxonomy of the world to train and evaluate machine learning algorithms. That is how ImageNet was conceived and born.

I think spatial reasoning adds regulatory complexity for autonomous systems operating in physical spaces. My sense is enterprises must verify if their current models can handle 3D context, not just 2D images. What concerns me is that the shift from data scarcity to spatial scarcity is a new bottleneck for AI development.

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

The Fusion of Natural Language and Visual Signals Enables Agents to Tell the Story of the World

Q: It took some time for this process to develop promising algorithms until AlexNet emerged in 2012, which constituted the second key part toward artificial intelligence: gaining computing power and investing sufficient resources into algorithms. Please tell us when you began to realize this—specifically, that moment when you discovered that the method of “sowing data” started working, and the entire AI community achieved further breakthroughs based on it.

Li Fei-Fei: In 2009, we published a very small poster at CVPR. Between 2009 and 2012, those three years, we truly believed that data would drive artificial intelligence, but we had almost no signals as to whether it was effective. So we did some things; one of them was open-sourcing. From the beginning, we believed we must open-source this project to the entire research community so everyone could participate.

Another thing was that we launched a challenge, hoping that the smartest and best students and researchers from around the world would come together to solve it. This is what we called the ImageNet Challenge. We released a test dataset every year and publicly invited everyone to participate. In the first few years, we were essentially establishing baselines—the recognition error rate hovered around 30%, which was not quite random guessing levels, but certainly unsatisfactory.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 5

However, in the third year, 2012 (I wrote about this in a book I published), I still remember it was near the end of summer. We were processing all the results from the ImageNet Challenge and running them on our servers. Then one late night, I received a message from my graduate student saying we had obtained an exceptionally outstanding result and that I should take a look. So we examined it closely; it was something like convolutional neural networks.

At that time, Geoffrey Hinton’s team did not yet use the name “Alex”; they called it “SuperVision,” which is a clever pun on “super vision” and “supervised learning.” So, “SuperVision.”

Let’s look at what they did—it was an old algorithm; convolutional neural networks had emerged in the 1980s, but they made some adjustments to the algorithm. Seeing such a leap initially surprised us. You know, we presented this at the ICCV challenge workshop in Florence, Italy, that year. Alex Krizhevsky and many researchers attended.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 6

This moment has now been recorded in history as the “AlexNet Moment of the ImageNet Challenge.” It was not merely an application of a convolutional neural network; it was Alex and his team’s first feat of using two GPUs in parallel for deep learning computation. So this was actually the first time data, GPUs, and neural networks were combined.

I think the reliance on open-source collaboration remains critical for validating new architectural claims. My sense is enterprises must verify if their current GPU infrastructure matches these historical scaling requirements. What concerns me is that the shift from isolated object detection to scene understanding is a governance risk for autonomous agents.

Q: Now, following the trend of computer vision intelligence development, ImageNet truly became key to solving object recognition concepts, and subsequently, artificial intelligence reached a level where it could parse visual scenes. Because you and your…

Students, such as Andrej Karpathy, did significant work that enabled AI to achieve scene description capabilities for the first time. Please explain how this transition from objects to scenes occurred.

Fei-Fei Li: The core problem solved by ImageNet was: when a system receives an image, it can accurately identify the objects within it, such as “there is a cat here” or “that is a chair,” and so on. This is a fundamental issue in visual recognition.

Since I entered the field of artificial intelligence as a graduate student, I have had a dream—a dream that spans a century—which is to enable agents to tell stories about the world: when you open your eyes in this room, you do not just see people, chairs, and more chairs; you actually perceive an entire conference room with screens, a stage, people, an audience, and cameras… You can describe the whole scene you see. This is a foundational capability of human visual intelligence and is crucial to our daily lives.

So I truly believed this problem would haunt me for my entire life, literally. When I graduated as a graduate student, I told myself that if I could create an algo

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

The burden of defining true artificial general intelligence (AGI) has shifted from mere pattern recognition to a deeper requirement for spatial understanding. As we evaluate the trajectory of enterprise AI, it is critical to distinguish between generative novelty and genuine cognitive completeness. The governance challenge lies in verifying whether current models possess the structural reasoning necessary for high-stakes applications.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 7

Li reflects on the foundational shift triggered by what she terms the “AlexNet moment,” which catalyzed explosive growth in deep learning. She notes that when Andrej Karpathy and later Justin Johnson joined her lab, researchers began observing the convergence of natural language and visual signals. This period marked a pivotal transition toward multimodal capabilities.

Li recounts proposing the challenge of image captioning—essentially asking computers to tell stories about scenes—as a primary research direction around 2015. She published several papers on this topic alongside similar studies, viewing it as her lifelong goal. The complexity of the task initially overwhelmed her: “Oh my god, how will I spend the rest of my life?”

I think multimodal convergence is no longer experimental; it is the baseline for modern AI governance. Enterprises must audit these models for spatial reasoning gaps before deployment.

The narrative of progress accelerated when Li referenced a tweet Karpathy posted years ago upon completing his image captioning work during a recent TED talk. She recalls joking with him about reversing the process: “Hey Andrej, why don’t we do the reverse? Take a sentence and generate an image.” Karpathy’s response highlighted the skepticism of that era: “Haha, I’m out~ The world isn’t ready yet.”

Today, generative AI has fulfilled that reversed promise, allowing users to create images from single sentences. Li uses this evolution to illustrate the incredible growth of the field, noting her personal fortune in starting her career during the early stages after the AI winter. She feels proud that her work contributed to this transformation.

My sense is the speed of generative adoption outpaces regulatory frameworks, creating compliance blind spots for visual content liability.

Li concludes by expressing gratitude for witnessing and contributing to the industry’s takeoff from its earliest days. However, the implication remains that while generation is solved, understanding—specifically spatial intelligence—is still incomplete. This distinction is vital for any organization relying on AI for decision-making rather than just content creation.

What concerns me is that spatial intelligence is a compliance differentiator; models lacking it cannot be trusted in regulated physical-world environments.

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

The Burden of 3D Understanding

The shift from recognizing objects to comprehending entire scenes—and now the world itself—marks a critical inflection point in computer vision. This evolution demands that we move beyond flat pixels and language models to capture the structural reality of three-dimensional space. As Fei-Fei Li transitions from academia to leading World Labs, she argues that true Artificial General Intelligence (AGI) cannot exist without spatial intelligence. The burden of proof now lies with enterprises and researchers to demonstrate how systems can navigate, interact with, and reason about physical environments, not just process text or 2D images.

I think spatial reasoning is the missing link for embodied AI agents in real-world deployments. My sense is enterprises must verify if their models understand physics, not just semantics. What concerns me is that the gap between language fluency and physical interaction remains a significant compliance risk.

From Pixels to Civilizational Progress

When asked what makes scene understanding harder than object detection, Li reflects on the rapid acceleration of technology over the past five to six years. She describes a civilizational moment where computer vision has evolved from simple image recognition to complex description and generation via diffusion models. Simultaneously, the rise of Large Language Models (LLMs) like ChatGPT in November 2022 demonstrated generative capabilities that challenge traditional Turing test benchmarks.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 8

Li draws inspiration from evolutionary theory and brain science to identify the next frontier. She notes that while complex human language evolved over approximately 300 million to 500 million years—making humans unique in their use of communication, reasoning, and abstraction—vision is even more foundational. The ability to perceive a three-dimensional world, navigate it, and interact with it took roughly 540 million years to evolve, starting with the visual perception of trilobites underwater.

The Evolutionary Case for Spatial Intelligence

Before vision emerged, life on Earth remained extremely simple. However, once organisms gained the ability to observe their environment, an evolutionary arms race began. Over the subsequent 500 million years, animal intelligence competed and refined itself through spatial awareness. For Li, solving how machines can understand, generate, and reason about this 3D world is not just a technical challenge but a fundamental requirement for AI.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 9

She asserts that without spatial intelligence, Artificial General Intelligence (AGI) is incomplete. To address this, she co-founded World Labs with three world-class technologists: Justin Johnson, Ben Mildenhall, and Christoph Lassner. The team aims to build creative world models that transcend flat representations, capturing the true structure of the 3D world.

I think founders must ensure their technical claims are backed by rigorous spatial benchmarks. My sense is governance frameworks need updates for AI systems interacting with physical spaces.

An Elite Team Tackling Hard Problems

Li highlights the talent driving this initiative, noting that Chris Lassner created Pulsar—a precursor to Gaussian Splatting techniques—capable of highly distinguishable rendering. Justin Johnson brings strong systems engineering thinking, known for real-time neural style transfer, while Ben Mildenhall is the author of NeRF (Neural Radiance Fields). This “elite super-team” is tasked with solving what Li considers the most difficult problem in AI: bridging the gap between one-dimensional language models and complex visual tasks.

What concerns me is that verify that vendor teams have proven expertise in 3D reconstruction, not just 2D generation. I think the complexity of spatial AI requires stricter validation protocols for enterprise adoption.

My read: Spatial data scarcity is a structural bottleneck, not just a compute problem. My sense is enterprises must audit their 3D asset pipelines for physical accuracy before deployment. What concerns me is that the shift from generative text to constructive world models demands new governance frameworks. I think relying on “brightest minds” without standardized verification is a compliance risk.

Fei-Fei Li’s latest interview underscores a critical divergence in AI development: the gap between linguistic fluency and spatial understanding. As an editor focused on enterprise governance, I see this not merely as a technical challenge but as a fundamental shift in how we define intelligence—and accountability—in autonomous systems. The burden of proof now shifts from “can it speak?” to “does it understand the physical constraints of its environment?”

Why Spatial Intelligence Lags Behind Language Research

When asked why understanding 3D structure remains elusive compared to language, Li points to the inherent simplicity of linguistic data versus the complexity of physical reality.

Fei-Fei Li: I appreciate that you realize how difficult our problem is, haha. Language is inherently one-dimensional, right? Those syllables are arranged in sequence, which is why sequence-to-sequence and sequence modeling are so classic. There are also aspects of language that people do not realize: language is purely generative. Language does not exist in nature; you cannot touch or see it. Language originates from everyone’s brain, and this is a pure generative signal—of course, if you write it on paper, it exists physically.

But the generation, construction, and utility of language are highly creative, whereas the real world is far more complex. The first

First, the real world is three-dimensional. If we add time, it becomes four-dimensional, but let’s stick to space and assume the world is essentially 3D; this itself presents a significantly higher level of combinatorial complexity.

Second, visual perception and reception of the world are projections. Whether through your eyes, retina, or camera, the process always converts 3D into 2D. You must understand how difficult this is: mathematically speaking, it is an error-prone reduction. This is precisely why humans and animals possess multiple sensors.

Third, the world is not entirely generative. We can generate virtual 3D worlds that still adhere to physical laws, but there is also a real world outside. Currently, virtual worlds are switching between generation and reconstruction in a very fluid manner, while user behavior, utility, and use cases differ vastly. If we fast-forward to this generation, we might discuss games or the metaverse; if we dial back into the real world, we find ourselves talking about embodied intelligence. Yet, all of these exist on a continuum of world modeling and spatial intelligence.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 10

Li highlights a data asymmetry that enterprises often overlook. While the internet is saturated with text, spatial intelligence lacks comparable public datasets.

Fei-Fei Li: An obvious yet often avoided question is: The internet is flooded with language data, but where is the data for spatial intelligence? Of course, this information exists within the human brain, but it is not as easily accessible as language, which contributes to its difficulty. Honestly, this excites me because if it were easy, someone else would have solved it by now. My entire career has been dedicated to pursuing extremely difficult, almost crazy problems, and I believe this is that kind of crazy problem.

Architectural Differences: LLMs vs. Constructive World Models

The interview also touches on the architectural divergence between Large Language Models (LLMs) and models designed for spatial reasoning. Li suggests that current LLM capabilities are largely driven by brute-force self-supervision, whereas spatial models require structured guidance.

Li Feifei: That is actually an excellent question. There are still many differing viewpoints today. Much of what we see in large language models is essentially writing—extending stories to perfect endings through writing skills, allowing for almost entirely self-supervised learning via brute force.

In contrast, constructive world models may be more complex. The world is more structured and may require signals to guide it, which can be viewed as a form of prior knowledge or data supervision.

I believe these are some of the open questions we must address. Moreover, we do not even fully understand all human perception; we have not yet solved how 3D operates in human vision. Although mechanically, we use our eyes to perform three-dimensional measurements on objects, where does the mathematical model come into play afterward?

Humans are not as inherently proficient at 3D processing as other animals, so there is still much to be answered. I am simply relying on one thing: I expect some of the brightest minds among us to solve this problem.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 11

The Utility of Outputted 3D Worlds

Finally, the discussion turns to World Labs’ specific ambitions: building a foundational model that outputs 3D worlds. This raises immediate questions about utility and the tension between generative and discriminative tasks in enterprise applications.

Li Feifei: Regarding spatial intelligence, much like language models, th

The burden of proof for the metaverse’s viability has shifted from pure software ambition to a necessary convergence of hardware and software capabilities. As Fei-Fei Li argues, current limitations are not just about virtual environments but about the underlying generative infrastructure required to populate them.

My sense is the metaverse cannot survive on software alone; it needs robust world models. What concerns me is that hardware constraints remain a primary bottleneck for immersive experiences. I think enterprises should monitor spatial AI as a prerequisite for next-gen interfaces.

Li emphasizes that the utility of spatial intelligence extends far beyond gaming or creative design, reaching into robotics and industrial applications. She acknowledges the skepticism surrounding the metaverse but points to an impending technological convergence.

“Actually, I am very interested in the metaverse. I know many people still feel it doesn’t work, and I acknowledge that it currently cannot function as intended. However, I believe a convergence of hardware and software is on the horizon, which represents another excellent use case for the future.”

When pressed on this convergence by an interviewer who has previously attempted to solve these issues at other companies, Li identifies hardware as a critical obstacle. She links the metaverse’s success directly to content generation, which in turn relies on world models.

“I think hardware is part of the obstacle in the metaverse. The metaverse requires content generation, and content generation necessitates world models.”

Fearlessness in Thought

My sense is her pivot from laundry to AI underscores the value of operational grit over pedigree.
What concerns me is that founding HAI was a strategic move to anchor AI ethics before commercialization took hold.
I think “Fearlessness in thought” is a cultural filter, not just a hiring slogan.

For some viewers, Fei-Fei Li’s transition from academia to founder and CEO might seem abrupt. Yet her journey has been defined by building from scratch repeatedly. When she immigrated to the United States as a teenager without speaking English, she ran a laundry store for several years. I asked how these experiences shaped her current leadership style.

Li Feifei: “I was 19 years old and needed to study physics at Princeton University, so I had no way to support my family. Consequently, I opened a decent dry-cleaning business—in Silicon Valley terms, I started raising capital.”

She served as founder, CEO, and cashier. She notes that the current generation’s talent is striking compared to her own era. “Regardless of the roles, looking at you all makes me incredibly excited because your ages are roughly half mine, or perhaps only 30% of my age, yet you are so talented. Just go ahead and do what you want to do.”

Her academic path was equally unconventional. She joined institutions where she was the first computer vision professor, defying advice to seek established mentorship. “Of course, I would have appreciated having senior mentors there, but if they weren’t available, I would carve my own path and make my way without fear.”

After joining Google to understand corporate culture, she founded a startup at Stanford. By 2018, AI had become a global human issue. She emphasizes that while technology progresses, humanity must not be lost. “I care deeply about positive guidance in AI development and want AI to benefit humanity by being human-centric.”

This led her back to Stanford to establish the Institute for Human-Centered Artificial Intelligence (HAI), which she ran for five years. “Some may not understand this choice, but I am very proud of it. In a way, I feel that I truly love being an entrepreneur.”

She describes her comfort zone as starting from zero: “I enjoy the feeling of starting from ground level, forgetting everything done in the past, disregarding others’ opinions, and simply focusing on hard work and building.”

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 12

Beyond her achievements, Li has mentored legendary researchers like Andrej Karpathy, Jim Fan from NVIDIA, and Jia Deng, who co-created ImageNet with her. I asked what set them apart during their student years.

Li Feifei: “First of all, I am a lucky person. I believe students mean more to me than anything else; they genuinely made me a better person, teacher, and researcher. As you said, having the honor of working with so many legendary students is truly one of the highlights of my life.”

She notes their diversity: pure scientists, industry leaders, and knowledge disseminators. However, she identifies one unifying trait that also serves as her recruitment criteria for World Labs: “I look for fearlessness in thought.”

“The courage and fearlessness to embrace difficult tasks, go all out, and find ways to resolve problems is the core trait of successful people. I learned this from them, and I am genuinely looking for individuals who possess this quality.”

This philosophy drives her current hiring at World Labs. When asked if she was recruiting extensively, she confirmed: “Yes, we are recruiting engineering talent, product talent, 3D specialists, and generative model experts. So, if you consider yourself fearless and passionate about solving spatial intelligence problems, please reach out to me or visit our website.”

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

Finding Life’s Optimal Solution via Gradient Descent

The burden of proof for AI advancement has shifted from resource-rich academia to the broader ecosystem. As I read through this exchange, it becomes clear that enterprises must verify where true innovation occurs—not just in compute-heavy labs, but in interdisciplinary spaces where data is scarce and problems are fundamental.

Audience Member 1: Hi, Fei-Fei, I’m your superfan. My question is: Over twenty years ago, you worked on visual recognition. If I want to start a PhD now, what direction should I choose to become a legend like you?

Fei-Fei Li: While I could say “do whatever excites you,” I’d rather give you a more thoughtful answer. First, I believe AI research has changed because academia no longer holds the majority of AI resources, which is very different from my era. Resources such as chips, compute power, and data are indeed scarce in academic settings.

As a PhD student, I suggest you seek out teams that can solve problems better without relying on superior computation or larger datasets. In academia, we can still identify fundamental issues where significant progress can be made regardless of how many chips you have.

Secondly, interdisciplinary AI is a very exciting field in academia, particularly in scientific discovery. There are too many disciplines to intersect with AI; I believe this is a promising area for theoretical advancement.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 13

Interestingly, AI capabilities have already surpassed theory: we don’t know how to do certain things, we lack interpretability, we don’t know how to identify causality, and there are too many things we don’t understand… so people can continue pushing forward.

And this list could go on: in computer vision, there are still representation problems we haven’t solved. Additionally, small data is another very interesting area; these represent possibilities.

My sense is enterprises should audit their reliance on massive datasets for competitive advantage. What concerns me is that interdisciplinary teams offer higher resilience against compute-centric market shifts. I think governance frameworks must address the interpretability gaps Li highlights.

Defining AGI: Theory vs. Functionality

The definition of Artificial General Intelligence remains a liability rather than an asset, creating ambiguity in regulatory compliance and strategic planning. I followed the release closely; what stood out to me was Li’s insistence that “scale is intelligence,” which simplifies a complex governance challenge into a technical metric that may not hold up legally or ethically.

Audience Member 2: Congratulations again on receiving the honorary doctorate from Yale University; I had the privilege of witnessing that moment there a month ago. My question is: In your view, is AGI more likely to emerge as unified models or as multi-agent systems in a single, unified form?

Fei-Fei Li: The way you phrased this question already reflects two different definitions. One definition is more theoretical: if there were an IQ test that defined AGI upon passing it. The other definition is more functional: if it is agent-based, does it possess functionality and what tasks can it perform?

Honestly, i am also confused by the definition of AGI. In 1956, AI pioneers like John McCarthy and Marvin Minsky gathered at Dartmouth with the goal of solving machine thinking. This was a question posed by Turing ten years earlier; in that statement, it wasn’t narrow AI but rather a formulation of intelligence itself.

So I am not quite sure how to distinguish between the definitions of AI and this new term, AGI. To me, they are one and the same. But I understand that today’s industry likes to refer to AGI as something beyond AI, which confuses me because I don’t know exactly what differentiates AGI from AI.

If we say that current AGI systems perform better than narrow AI systems from the 80s, 70s, or 90s, I believe this is simply a matter of progress in the field. Fundamentally, however, I think “scale is intelligence.” We aim to create machines capable of thinking and acting as intelligently as humans, or even more so.

I don’t know how to define AGI, nor do I know if it must be singular without defining it. If you view the brain as a whole, it indeed has different functions. There are specialized language areas, visual cortices, and motor cortices. So I really don’t know how to answer that question.

My sense is regulatory bodies need clear definitions beyond industry marketing terms. What concerns me is that enterprises should verify if “scale” correlates with actual safety outcomes.

The Role of Curiosity in Research and Enterprise

The distinction between academic curiosity and commercial viability is a critical governance boundary. I read this section carefully; the friction Li describes between pure inquiry and investor demands mirrors the tension we see in enterprise AI deployment, where risk appetite often overrides theoretical rigor.

Audience Member 3: It is truly inspiring to see a woman playing a leading role in this field. I would like to ask: With the rapid rise of AI, as a researcher, educator, and entrepreneur, what kind of person should pursue a graduate degree?

Fei-Fei Li: That’s a great question. It’s even something parents ask me. I believe graduate school is those four to five years when you are filled with intense curiosity. You are led by that curiosity. The drive is so strong that there is no better time than this period to satisfy it.

Pursuing a graduate degree differs from entrepreneurship because you cannot lead a startup solely on curiosity; otherwise, your investors would be furious. A startup with clear business goals has curiosity as part of its motivation, but not the whole story.

For grassroots researchers, curiosity about solving problems or asking the right questions is crucial. I believe those who dive in with strong cur

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

My read: Open source strategy is a business decision, not a moral absolute. My read: Data quality outweighs sheer volume in world model development. My read: Hybrid data approaches are the pragmatic path forward for AI labs. My read: Resilience in STEM requires focusing on creation over identity politics.

The discussion shifted toward the structural integrity of the AI ecosystem, specifically regarding open-source governance. An audience member highlighted the fragmentation in current licensing models—ranging from fully closed to completely open—and asked Li how an AI company should navigate this landscape.

Audience Member 4: You mentioned that open source was a vital component of ImageNet’s development. Now, with the latest releases of large language models, we see organizations adopting different approaches to open source: some go fully closed-source, others release their entire research stack openly, and some fall in between by releasing weights or using restrictive licenses. So I’d like to ask: How do you view these different open-source methods, and what do you think is the correct way for an AI company to handle open sourcing?

Fei-Fei Li: I am not dogmatic about whether one must be open source or closed source. It depends on the company’s business strategy.

For example, Facebook/Meta’s reason for wanting to go open source is obvious: their current business model does not rely on selling models for profit. They are leveraging it to build an ecosystem so that people come to their platform. So, going open source makes sense for them.

Other companies make money through either open or closed sources. So I am quite open-minded about this issue. I believe open source should be protected; if both the public sector (such as academia) and the private sector have open-source initiatives, it is crucial for the startup ecosystem. I think it deserves technical protection.

My read: Enterprises must audit license restrictions before integrating third-party weights. My read: Ecosystem lock-in drives Meta’s openness more than altruism does.

The conversation then turned to the data infrastructure required for World Labs’ spatial intelligence initiatives. The interviewer pressed on the scarcity of high-fidelity spatial data, noting that such information exists in human cognition but not readily on the open internet. They asked whether Li’s team relies on real-world collection, synthetic generation, or historical priors.

Audience Member 4: I have a question about data: Since you are now working on world models, you pointed out the shift in machine learning toward data-driven methods represented by ImageNet, and you mentioned that such spatial data does not exist on the internet—it only exists in our minds. How do you solve this problem? Are you collecting this data from the real world? Is it synthetic data? Or do you rely on those ancient priors? Thank you.

Fei-Fei Li: You should join World Labs, and I’ll tell you then.

As a company, I can’t reveal too much, but I admit we are taking a hybrid approach. Having massive amounts of data is important, but having high-quality data is equally crucial. Ultimately, if you don’t pay attention to data quality, you will still get “garbage in, garbage out.”

My read: Hybrid data pipelines mitigate the risk of synthetic bias. My read: Governance frameworks must enforce strict data provenance standards.

Li also addressed the personal challenges faced by women and immigrants in STEM fields, referencing her book The Worlds I See. When asked if she felt like a minority in the workplace and how she navigated those dynamics, she emphasized a universal experience of vulnerability rather than a singular identity-based struggle.

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence — figure 14

Fei-Fei Li: Thank you for asking this question. I want to answer very carefully and thoughtfully because we all come from different backgrounds, and everyone’s feelings are unique. Actually, it doesn’t matter who we are; we have all had moments where we felt like a minority.

Sometimes it depends on who I am, sometimes it is based on my ideas, and sometimes it is just about something trivial like the color of the shirt I’m wearing. But this is exactly what I want to encourage everyone to do: having come here from a young age, I have already tested…

Understanding the essence of this matter, I have almost cultivated an ability not to dwell on it excessively—as a female immigrant. Like each of you, I am here to learn, work, or create.

At the end of the interview, Fei-Fei Li offered her best wishes to all young people:

You are about to embark on a journey, or perhaps you are already in the midst of one. There will be moments of vulnerability, and strange things may happen. In entrepreneurship, I feel this way every day; sometimes I think, “Goodness, I have no idea what I’m doing.” But just focus on moving forward and find the optimal solution through gradient descent.

Interview link: https://www.youtube.com/watch?v=_PioN-CpOP0

Fei-Fei Li's Latest Interview: AGI Is Incomplete Without Spatial Intelligence

Author

ImageNet Built the Data Skeleton for Modern Computer Vision

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

The Fusion of Natural Language and Visual Signals Enables Agents to Tell the Story of the World

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

The Burden of 3D Understanding

From Pixels to Civilizational Progress

The Evolutionary Case for Spatial Intelligence

An Elite Team Tackling Hard Problems

Why Spatial Intelligence Lags Behind Language Research

Architectural Differences: LLMs vs. Constructive World Models

The Utility of Outputted 3D Worlds

Fearlessness in Thought

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

Finding Life’s Optimal Solution via Gradient Descent

Defining AGI: Theory vs. Functionality

The Role of Curiosity in Research and Enterprise

Fei-Fei Li’s Latest Interview: AGI Is Incomplete Without Spatial Intelligence

Comments

Related News

Latest Headlines