Interview with Alexandr Wang

Founder and CEO @ Scale.ai

by 20VC with Harry Stebbings2024-06-12

Alexandr Wang

In a captivating and candid conversation with Harry Stebbings on 20VC, Alexandr Wang, CEO of Scale AI, peeled back the layers of the current AI landscape, challenging conventional wisdom and spotlighting the true bottleneck to next-generation model performance. While the world obsesses over compute, Wang argues that the real race—and potential differentiator—lies not in silicon, but in data.

The Data Wall: Why Compute Isn't Enough Anymore

The interview dove straight into a provocative question: are we seeing diminishing returns in AI model performance, where more compute no longer guarantees better results? Wang's answer was a resounding "yes." He pointed out that despite an exponential surge in Nvidia GPU expenditure since late 2022 (from $5 billion to over $20 billion a quarter), we haven't seen a "jaw-droppingly better" base model than GPT-4, which predates this massive compute inflection.

Wang explained that AI progress rests on three pillars: compute, algorithms, and data. While compute has scaled dramatically, the other two haven't kept pace. Crucially, he believes the industry has hit a "data wall." The "easy data"—everything readily available on the open internet, scraped from common crawls or torrents—has largely been consumed. These models are now "exceptionally good at emulating the internet," but that's not enough for the complex tasks and reasoning required for true AGI or effective AI agents.

Key Insights:

  • AI progress relies on compute, data, and algorithms advancing in tandem.
  • Massive investments in compute post-GPT-4 haven't yielded commensurate leaps in base model performance.
  • The industry has largely exhausted "easy data" (internet data), leading to a performance plateau.

Forging the Frontier: Cultivating Data Abundance

To overcome this data wall, Wang introduced the concept of "Frontier data." He highlighted that much of the complex reasoning and problem-solving that powers today's economy – like a fraud analyst's deductive process – doesn't get written down online. This means models trained solely on internet data lack the ability to learn from this deeper human intelligence.

So, how do we capture this elusive Frontier data? Wang outlined two main avenues. First, there's a colossal trove of proprietary data locked within enterprises. He cited JPMorgan's 150 petabytes of internal data, dwarfing GPT-4's less than one petabyte internet dataset. This data, however, is highly sensitive and would require enterprises to mine and refine it for their own AI systems, likely on-prem or with strong guarantees against external use. Second, and more critically for generalized breakthroughs, is "forward data production." This isn't just about collecting existing data but creating new, highly complex data. This involves a "human-synthetic hybrid process" where AI generates data, and human experts act as "safety drivers," guiding the AI, correcting errors, and providing crucial input when models get stuck. Wang views these "AI trainers" or "contributors" as holding some of the highest leverage jobs for societal impact. "As a human expert," he noted, "you have the ability to have society-wide impact by producing data to help improve these models."

Key Changes:

  • The transition from readily available "easy data" to "Frontier data" is essential for advanced AI.
  • Frontier data encompasses complex reasoning chains, tool use, and agentic behavior not found on the open internet.
  • Data abundance will be achieved through mining proprietary enterprise data and actively producing new, high-quality data.
  • New human roles will emerge to guide and correct AI systems in generating synthetic data, akin to autonomous vehicle safety drivers.

The Geopolitical Data Race: A New Cold War?

The conversation took a turn towards the profound geopolitical implications of AI, a topic Wang believes is under-discussed. He starkly stated, "At its core this AI technology has the potential to be one of the greatest military assets that Humanity has ever seen, potentially even more of a military asset than nukes." He painted a chilling scenario where a totalitarian regime with AGI could conquer a nation without it.

Wang expressed significant concern over China's rapid AI progress. While two years ago they might have been "nowhere near" GPT-4's capabilities, a recent Chinese model, Yi-Large from 0101, is now ranked among the world's best, just behind GPT-4o, Gemini, and Claude 3 Opus. He attributed this to the CCP's exceptional ability to implement "very aggressive centralized action and centralized industrial policy to drive forward critical Industries." This pattern, seen in solar and EVs, suggests China has "a clear shot at racing forward and racing ahead of us." Given this, Wang believes there's a "dichotomy that must emerge": cutting-edge, truly powerful AI systems should be kept closed for military and geopolitical reasons, while less advanced, open models can continue to drive economic value.

Key Learnings:

  • AI, particularly AGI, could be humanity's most potent military asset, with profound geopolitical consequences.
  • China's centralized industrial policy enables rapid AI advancement, quickly closing the gap with Western capabilities.
  • A strategic distinction between open and closed AI systems is critical: cutting-edge models may need to be closed for security, while less powerful ones can remain open for broad economic benefit.

Redefining Competition: Data as the Ultimate Moat

In the fiercely competitive world of foundation models, Wang firmly believes data will be the ultimate differentiator. He outlined that algorithms can eventually be reverse-engineered or become common knowledge, and compute can simply be purchased. "Data is one of the few areas," he asserted, "where you can actually produce a a long-term sustainable competive Advantage." He cited Open AI's partnerships with the Financial Times and Axel Springer as early indicators of this shift.

Wang boldly predicted a future where AI leaders won't brag about their GPU count, but "what data they have access to and what are their sort of unique rights to to different data sources." This emphasis on unique, proprietary data will drive market differentiation. Furthermore, he anticipates a significant shift in software, moving away from "walled garden" SaaS to highly customized, purpose-built applications for enterprises, reminiscent of Palantir's early approach. This will be fueled by AI dramatically lowering software creation costs, leading to a new era of personalized software solutions. Consequently, the long-standing per-seat pricing model will likely give way to consumption-based pricing, reflecting the work done by both human employees and AI agents.

Key Insights:

  • Data is emerging as the primary and most durable competitive advantage in the foundation model race.
  • Future competition will revolve around proprietary data access, ownership, and the ability to produce unique datasets.
  • The commoditization of software creation will lead to bespoke, customized applications for enterprises, moving beyond generic SaaS.
  • Software pricing models will evolve from per-seat to consumption-based, aligning with value delivered by both humans and AI agents.

Shifting gears to company building, Wang shared his unconventional approach to public relations: "the best PR is no PR." He argued that traditional media, often driven by clicks, tends to sensationalize and distort narratives, building up and tearing down companies for engagement. He revealed a surprising personal experience: "I've received more fair treatment testifying in front of Congress than I have from various media Outlets over the years."

This perspective has led Scale AI to prioritize direct channels, like podcasts and company blogs, where they can transmit their message authentically and without alteration. This ownership of their narrative ensures their story is "purest" and untainted, fostering trust and clarity with their audience.

Key Practices:

  • Adopt a strategy of "no PR" or minimal engagement with traditional media to avoid sensationalism and narrative distortion.
  • Prioritize direct communication channels (podcasts, company content) for authentic and unaltered messaging.
  • Founders and companies must actively own and manage their narrative in an increasingly noisy information landscape.

"At its core this AI technology has the potential to be one of the greatest military assets that Humanity has ever seen, potentially even more of a military asset than nukes." - Alexandr Wang