artificial-intelligence

Ride the Wave, Build the Future: Scientific Computing in an AI World

Posted on February 6, 2026February 6, 2026 by cloud4science

By Jack Dongarra, Daniel Reed, and Dennis Gannon

Abstract: The rapid rise of generative AI has shifted the center of gravity in advanced computing toward hyperscale AI platforms, reshaping the hardware, software, and economic landscape that scientific computing depends on. This paper argues that scientific and technical computing must “ride the wave” of AI-driven infrastructure while “building the future” through deliberate investments in new foundations. It presents seven maxims that frame the emerging reality: (1) HPC is increasingly defined by integrated numerical modeling and generative AI as peer processes; (2) energy and data movement—not peak FLOPS—are the dominant constraints, motivating “joules per trusted solution” as a primary metric; (3) benchmarks should reflect end-to-end hybrid workflows rather than isolated kernels; (4) winning systems require true end-to-end co-design, workflow first; (5) progress demands prototyping at scale with tolerance for failure; (6) curated data and trained models are durable strategic assets; and (7) new public–private collaboration models are essential in an AI-dominated market. The paper concludes with a call for a national next-generation system design “moonshot” targeting orders-of-magnitude reductions (≈1/100) in energy per validated scientific outcome via energy-aware algorithms, architecture innovation focused on memory/interconnect efficiency, and software stacks that optimize hybrid AI+simulation workflows.

1. Introduction

In 2023 [18], we argued that the center of gravity in advanced computing had already shifted away from traditional scientific and engineering high-performance computing (HPC), with the locus of influence now centered on hyperscale service providers and consumer smartphone companies. We enumerated five maxims to guide future activities in HPC:

Semiconductor constraints dictate new approaches,
End-to-end hardware/software co-design is essential,
Prototyping at scale is required to test new ideas,
The space of leading-edge HPC applications is far broader now than in the past, and
Cloud economics have changed the supply-chain ecosystem.

Since then, given the meteoric rise of generative artificial intelligence (AI), the computing landscape has shifted more dramatically than even the most disruptive technology forecasts might have anticipated. Today, the dominant computing markets are unequivocally AI-driven; the energy and cooling demands of hyperscale systems are measured in hundreds of megawatts, making them public issues; high-precision floating point hardware is giving way to reduced precision arithmetic in support of AI models; and national strategies increasingly treat AI-capable clouds and scientific supercomputers as a fused strategic resource, with deep geopolitical implications.

Consequently, scientific and technical computing is increasingly a specialized, policy-driven niche riding atop hardware and software stacks optimized for other, much larger markets. The challenge for scientific computing is to adapt to this rapidly changing world, albeit with a more holistic perspective on the global landscape, one that looks beyond the narrow, but important design of next-generation computing systems to how an integrated ecosystem of new, nascent, and still-to-be developed computing technologies enables scientific discovery, economic opportunities, public health, and global security. We must ride the wave of AI, while simultaneously building the future.

In this paper, we outline seven new maxims that define the present and the future of advanced scientific computing. From these new maxims, we conclude with a proposal for a “moon shot” to build a new foundation for future computer systems for research, one that would benefit both scientific computing and AI..

2. Current Technical and Economic Reality

Each high-performance computing transition has been driven by a combination of market forces and semiconductor economics, requiring the scientific computing community to develop and embrace new algorithms and software to use the systems effectively. Each time, there were those who initially resisted inevitability, only to suffer the consequences of delayed adoption, whether clinging to vector supercomputers or refusing to embrace scalable message passing. Today is no different. The scientific computing community must again adapt and embrace the new realities of our AI-dominated technology world.

The first sea change is one of economic and technical influence. The scientific computing community has long been a driver of computing innovation, even in the commodity hardware space, by specifying and buying the earliest and largest instances of new technology. Today, that is no longer possible, especially under current procurement models. Today, the scale of “AI factories” dwarfs that of even the fastest machines on the list of the TOP500 supercomputers, and the gap widens each year.

Moreover, unlike the rise of the modern microprocessor, when all hardware was available for public purchase, a substantial portion of the most advanced AI hardware is designed and built by the AI hyperscalers themselves. Prominent examples include Google’s TPUs [7], Amazon’s Trainium [24], and Microsoft’s Maia hardware. The largest clusters and newest accelerator generations are often accessible only to internal AI teams within the hyperscaler or to a small set of strategic partners under commercial terms.

Although both scientific computing and generative AI benefit from high floating point operation rates, machine learning flourishes with 32, 16, 8, and even 4-bit operands. In contrast, scientific computing has long depended on high-precision, 64-bit floating point. The shift in hardware design points for hardware designed by both hyperscalers and NVIDIA, the largest supplier of AI accelerators, raises important concerns for traditional computational modeling.

In addition, the now mainstream cloud software ecosystem, including storage systems, scheduling models, and software services, differs markedly from current technical computing practices. Lest this seem heretical, remember that UNIX and open source software were once viewed as high risk by the scientific computing community, even as they became mainstream in the commercial computing world.

3. Modeling and AI As Peer Processes

Maxim One: HPC is now synonymous with integrated numerical modeling and generative AI.

The need to embrace AI is more than an economic imperative; it is also an intellectual and scientific necessity. Just as computational science became a complement to theory and experiment, later augmented by data science [25], HPC and AI are now peer processes in scientific discovery. Both are now needed to integrate deductive (computational science) and inductive (learning from data) models.

It is worth pausing to understand why there was initial resistance to AI in the computational science community. First, traditional computational simulation and modeling are deductive, based on mathematical models of phenomena based on the laws of classical or quantum physics, typically expressed as discretized differential equations. This approach reflects the classical mathematical and scientific training of most computational scientists.

In contrast, generative AI models are inductive, with models trained using large volumes of data. Just as computational models can approximate solutions to differential equations to arbitrary precision, so too can AI models learn to approximate unknown functions to arbitrary precision. Crucially, it is not a matter of choosing to invest in simulation and modeling or AI. Both are critical and complementary, each offering capabilities and efficiencies lacking in the other.

Consider weather modeling, an area long dominated by complex, numerical models. When trained on 40 years of analysis, AI can predict 10-day forecasts in seconds rather than hours, with results now competitive with the European Center for Medium Range Weather Forecasts (ECMWF) on standard metrics [11, 12, 26]. In biology, the protein folding systems, AlphaFold and RoseTTAFold, accurately predict protein 3-D structure from sequences [8,9], which many now consider to be a solved problem. AI is also a great help with inverse problems. Similarly, the AI diffusion methods used to create images can also be used to remove noise and reconstruct diagnostic-quality medical images [27]. Similar techniques can aid in searching for gravitational lensing in large scale survey data [28]. Drug and materials discovery have also been aided by AI methods that reduce search spaces prior to expensive experimentation.

Despite their great promise, AI methods are not without problems, just as numerical models face challenges regarding uncertainty quantification. Simply put, AI methods fail when applied outside the boundaries of their training data. As we noted earlier, AI methods have proven highly effective for weather prediction given historical data, but they are unable to predict the emergence of chaotic, rare events such as tornadoes. In contrast, tornadoes can now be predicted with HPC fine-grained CFD simulations, an example of the complementary utility of AI and numerical models. Nor can generative AI models readily incorporate well-known physical laws, though physics-based neural networks offer promise.

The complementary strengths and weaknesses of numerical and AI models has led to their integration as hybrid models, notably the use of AI models as numerical surrogates. First, one trains a neural network to approximate an expensive simulation, then uses the AI surrogate for rapid parameter space exploration – taking care to not push beyond its domain of applicability,, and finally uses the computationally intensive numerical simulation for verification of promising results. Similarly, for adaptive grid methods, AI can also be used to predict the region where mesh refinement may be most beneficial. These hybrid techniques incorporate the AI directly into the workflow of a large scale HPC computation.

The message is clear. AI and numerical models each have advantages and domains of applicability. Equally important, their integration creates opportunities not possible with either alone.

4. Energy and Data Movement Dominate

Maxim Two: Energy and data movement, not floating point operations, are the scarce resources.

Energy As a Design Constraint

As semiconductor scaling has slowed and architectural complexity has grown, energy consumption and heat dissipation have become limiting factors for both AI data centers and traditional supercomputers. Systems that draw hundreds of megawatts now define flagship deployments, driven by both the rising scale of deployments and the energy requirements of modern semiconductors. At these scales, every aspect of system design becomes an energy problem: how to deliver power from the grid, how to remove heat efficiently, and how to align operations with carbon reduction commitments. Liquid cooling isde rigueur with direct-to-chip, immersion, and hybrid schemes now the norm.

In this context, traditional performance metrics such as peak floating point operations per

second (FLOPS) or even time-to-solution are no longer sufficient. What matters is “joules per solution”—the total energy cost of producing a scientifically meaningful answer or training a model to an acceptable level of quality. This metric forces new trade-offs among fidelity, resolution, model size, and energy consumption. It also highlights the role of algorithmic innovation: mixed-precision methods, communication-avoiding algorithms, data compression, and smarter sampling and surrogate models can all reduce joules per solution, sometimes dramatically, without sacrificing reliability.

Critically, the time scales for computing system design and energy infrastructure decisions are increasingly mismatched. A new hyperscale data center and associated computing infrastructure can be designed and built in a few months. Upgrading power generation, transmission, or distribution infrastructure often takes much longer, especially when it involves regulatory approvals, environmental review, and large capital projects. This asymmetry means that unless the system design also includes building and operating a utility (e.g., a reactor or wind farm), the power envelope for systems is often effectively fixed years in advance, long before architectural details are finalized. As a result, future systems must be conceived as configurations that operate within pre-defined energy and cooling budgets, not as free variables to be optimized later.

Consequently, as Figure 1 shows, the energy demand for AI factories is now outpacing the capacity of energy grids [33]. In addition to the mismatch in construction timescales, it also reflects inadequate investment, at least in the U.S., in grid modernization. Rising energy demand, from both the proliferation of data centers and their growing scale, is now a bottleneck for data center deployment. In consequence, some hyperscalers are now embracing temporary solutions, such as arrays of gas turbine generators.

Sustainability is no longer a public-relations story; it is a design constraint and an operating condition. Policy mandates, institutional climate goals, and community expectations will increasingly require large-scale computing projects to quantify and justify their energy usage in terms of joules per solution, not just peak capability. Energy efficiency must be a first-class objective across hardware, software, and workload design—not as a downstream optimization once the systems are built.

Data Movement Costs and Floating Point Arithmetic

In the past, the energy cost of arithmetic operations dominated. Today, moving data (within and between chips) consumes more energy than the arithmetic operations enabled by that data movement, yet our measures of software efficiency still center on arithmetic operation counts. Simply put, performance metrics that ignore power and communication costs encourage architectures that look impressive on paper but are increasingly impractical to operate at scale.

If facilities are to operate within tight energy envelopes while supporting both AI and high-fidelity simulation, algorithmic co-design must also extend beyond kernels and into the fundamental treatment of precision and data movement. In this view, arithmetic precision and communication are not merely implementation details; they are explicit algorithmic resources to be budgeted alongside time and memory.

This shift has already begun, with hardware designed for AI already focusing on reduced precision arithmetic to reduce energy and data movement costs. NVIDIA’s latest hardware exemplifies this trend, as illustrated in Table 1.

Operations	Peak Performance
	2022	2024	2026
	NVIDIA Hopper (H200)	NVIDIA Blackwell (B200)	NVIDIA Vera Rubin
FP64 FMA	33.5 TFLOPS/s	40 TFLOPS/s	33 TFLOP/s
FP64 Tensor Core	67 TFLOPS/s	40 TFLOPS/s	33 TFLOP/s
FP16 Tensor Core	989 TFLOPS/s	2250 TFLOPS/s	4000 TFLOP/S
BF16 Tensor Core	989 TFLOPS/s	2250 TFLOPS/s	4000 TFLOP/S
INT8 Tensor Core	1979 Teraops/s	4500 Teraops/s	2500 Teraop/s
Memory bandwidth	4.8 TB/s	8 TB/s	22 TB/s

Table 1 NVIDIA Floating Point Performance

Mixed-precision methods exemplify this shift [13,14]. Rather than assuming uniform 64-bit (FP64) floating point arithmetic, future numerical solvers will partition computations across FP64, FP32, BF16, FP8, and integer-emulated formats, using high precision only where it is most needed for stability or accuracy. Iterative refinement [21], stochastic rounding [23], randomized sketching [22], and hierarchical preconditioners [20] will allow most floating point operations to be executed on low-precision units. At the same time, small high-precision components provide correction and certification. In AI workflows, similar ideas apply to training and inference, with dynamic precision schedules and quantization strategies tuned to minimize joules per unit of practical learning.

Communication-avoiding and energy-aware algorithms add a complementary dimension [15]. Classical work on minimizing messages and data movement must be reinterpreted in the context of modern communication fabrics, offload engines, and hierarchical memory systems. Runtimes will need to be aware of both energy and communication costs, scheduling tasks to minimize expensive data motion across racks or facilities and to exploit near-memory or in-network computation where possible. Hybrid AI+simulation workflows will rely on asynchronous, event-driven communication patterns that allow different parts of the system to operate at their own natural time scales without constant global synchronization.

This algorithmic work must be conducted in deliberate co-design with emerging hardware—just as hyperscalers already do for AI, where they face similar energy cost and data movement challenges [29]. Scientific computing cannot simply await new architectures and adapt afterward. Instead, targeted collaborations are needed in which hardware features (numerical precision formats, on-die networks, memory hierarchies, and DPUs) are shaped in dialogue with scientific algorithms, and in which software stacks expose those features in usable, portable ways.

5. Benchmarking and Evaluation

Maxim Three: Benchmarks are mirrors, not levers.

Performance metrics such as High-Performance Linpack (HPL), High-Performance Conjugate Gradient (HPCG), or any other next-generation benchmark reflect the systems vendors are already building; they rarely reshape the broader market trajectory on their own. Put another way, they generally reward incremental improvements rather than transformative alternatives.

New benchmarks must span both simulation and AI partitions, exercising end-to-end workflows rather than isolated kernels. For example, a climate benchmark might couple high-resolution dynamical core simulations with AI-based subgrid parametrizations and data assimilation, measuring not only time-to-solution but also energy consumed, data moved, and robustness of the resulting forecasts. A materials benchmark might link quantum-level calculations, surrogate models, and large-scale screening workflows.

Energy- and carbon-aware metrics should be central, not peripheral. Joules per trusted solution—and, where possible, estimated emissions per solution—provide a more meaningful measure of a system’s value than peak floating point performance. Benchmarks can incorporate these metrics directly, reporting performance as a Pareto frontier among time, energy, and fidelity. This will encourage architectures and algorithms that balance, rather than chasing single-number records.

Equally important is the need to benchmark the data fabric itself. Future metrics should stress test data ingestion from instruments, movement across simulation and AI partitions, access to long-term archives, and enforcement of security and access policies. They should evaluate not just raw bandwidth and latency, but also how well facilities support governed, equitable access to data and models—key concerns for national platforms that serve diverse communities.

Finally, benchmarks should reflect the hybrid nature of public-private computing infrastructure. Some workloads will span on-premise facilities and secure cloud regions; others will rely heavily on AI services coupled with local simulations. Measurement frameworks must be able to attribute performance and energy across these boundaries, enabling comparisons of different design and deployment choices.

In short, if we want design patterns for future scientific facilities that genuinely align with societal and scientific goals, we must update the mirrors we use to see ourselves. New benchmarks and metrics—rooted in AI+simulation workflows, energy and carbon efficiency, and equitable access—are as essential as new chips, racks, and cooling systems.

6. Co-Design Really Matters

Maxim Four: Winning systems are co-designed end-to-end—workflow first, parts list second.

Although the hyperscaler and AI community has aggressively embraced hardware-software co-design, in scientific computing, the story is less encouraging. There are notable examples of co-design in specific missions—fusion devices, climate modeling initiatives, and some exascale application teams have worked closely with vendors to shape features or software paths. However, most production scientific codes must still adapt to extant architectures. Porting and tuning cycles are long; exploitation of new features (tensor cores, DPUs, new memory tiers) is partial, ad hoc, and large segments of the scientific software ecosystem remain effectively frozen on older models of the machine.

Is this because the community is risk-averse, or simply because it is resource-constrained? The honest answer is both. Co-design at scale requires sustained funding, institutional continuity, and the ability to place substantial bets on uncertain outcomes. In reality, most scientific teams operate with fragmented funding and short time horizons; they cannot afford to gamble entire codes on speculative hardware features. Most tellingly, this has proven true even for the largest, mission-driven applications such as nuclear stockpile stewardship. Meanwhile, vendors are understandably reluctant to optimize for niche workloads when AI and cloud customers dominate revenue.

The net result is that co-design remains the exception rather than the rule in scientific computing. Where it has worked, it has done so in contexts that resemble AI—concentrated workloads, strong institutional commitment, and substantial aligned resources. For co-design to enable a broader spectrum of scientific codes, governance and funding structures must look more like those of AI ecosystems: fewer, more focused efforts with the scale and longevity to justify genuine hardware–software co-evolution.

7. Prototyping at Scale

Maxim Five: Research requires prototyping at scale (and risking failure), otherwise it is procurement.

In 2023 [18], we advocated for more aggressive prototyping of next-generation systems at scale. The idea was simple – if we want new architectures and programming models, ones better matched to the needs of scientific computing, we must first build and let real users test them in realistic configurations. Since then, we have seen a handful of promising large-scale prototypes and early-access systems. Nevertheless, these efforts remain scattered and, in many cases, closed or narrowly scoped, with inadequate funding and little ability to take calculated risks.

Such prototyping and development will require larger scale investments (i.e., tens of millions of dollars), either in startup companies or laboratory teams, that embrace targeted technological risks (e..g, custom chiplets) that leverage the extant hardware ecosystem. Only with scalable testbeds can new hardware, software stacks, and energy-management strategies be exercised by a wide range of scientific workloads under realistic conditions. This is neither simple nor easy, but it is essential if we are to address the limitations of hardware designed for commercial markets.

Equally importantly, advanced prototyping means being willing to accept failure while drawing lessons from the failure. Put another way, we must embrace calculated risks to explore promising new ideas. Such risk-taking was once more common in computing. One need look no further than the 1960s experiments with the IBM Stretch and the Illinois/Burroughs ILLIAC IV, followed more recently by DARPA’s targeted parallel computing program in the 1990s, which led to a host of novel parallel hardware prototypes, including the Stanford DASH and Illinois Cedar systems.

Pursued seriously, advanced prototyping may push scientific+AI HPC toward a “bespoke instrument” model. Rather than building generic machines and layering everything on top, designs might explicitly target particular classes of workflows (e.g., climate + energy systems, fusion + materials, or life sciences + health analytics) with algorithmic patterns, precision strategies, and data topologies tuned to those missions. The challenge will be to retain enough generality and openness that such bespoke instruments remain shared national resources, not single-experiment machines.

Software Stack Interoperability and Malleability

Nor can the world of prototypes be limited to software; it must also encompass interoperability between computational modeling and cloud services. In a world where traditional supercomputing and modern AI clouds are not separate worlds but interoperable layers, a climate scientist, materials chemist, or nuclear engineer would move fluidly between running large-scale simulations on government HPC systems, invoking scientific foundation models hosted in secure clouds, and using AI agents to orchestrate end-to-end workflows that span both environments.

Alternative Computing Models

Building the future means more than just riding AI hardware trends, it also means investing in alternative computing models, ones that address precisely those areas where constraints are becoming first-order: energy, data movement, and domain-specific computing.

For example, neuromorphic computing [30] can be more aptly characterized as an “energy-first” approach for event-driven, sparse inference, or control. Asynchronous, spiking networks with co‑located memory and compute are inherently suited to always‑on sensing, edge scientific instrumentation, autonomous laboratories, fast triggers, and adaptive control. The priority, not just in neuromorphic computing, but in sensing generally, really, ever since Einstein’s earliest days in physics, has been ‘act quickly, with minimal joules.’

Quantum computing [31 also represents an accelerator for a class of problems. Specifically, a quantum computer can be integrated into a hybrid processing pipeline involving chemistry/materials simulations (specific electron-structure problems), small- to medium-scale combinatorial optimization, sampling problems, and perhaps cryptology/security applications. However, the bar is relatively high, as the potential to lower the cost of communications and synchronization is becoming increasingly dominant.

8. Multidisciplinary Data Curation and Fusion

Maxim Six: Data and models are intellectual gold.

In an era when many countries can buy similar hardware and access similar cloud platforms, the differentiators are increasingly the quality of curated datasets, the sophistication of the trained models, and the legal and institutional frameworks that govern their use. High-value scientific datasets—long climate reanalyses, fusion diagnostics, high-resolution Earth observation archives, curated materials, and molecular databases—are expensive to generate and maintain.

When combined with frontier AI and hybrid AI+simulation workflows, they allow a given amount of computation to yield more insight, faster and more reliably, than would otherwise be possible. Similarly, scientific foundation models trained on such data—models for weather, climate, molecular design, materials discovery, or engineering design—become reusable assets that can be fine-tuned, coupled to simulations, and deployed across a wide range of applications.

Data stewardship must be a central element of national and institutional strategy. Investments in high-quality metadata, provenance tracking, curation, and long-term preservation are investments in future scientific leverage. Thus, the design and training of scientific foundation models must be treated as infrastructure. Just as we do not rebuild compilers and linear algebra libraries for every application, we should not treat domain foundation models as disposable experiments.

9. New Public-Private Partnerships

Maxim Seven: New collaborative models define 21st-century computing.

Frontier AI+HPC has moved from the realm of research strategy to national geopolitical policy. Executive orders and national strategies now explicitly identify AI+science platforms, secure cloud AI, and supercomputers as components of national competitiveness and security. Genesis-style [17] missions recast a historically technical conversation as a matter of national priority.

Concurrently, the shift to an AI-dominated computing market forces a rethinking of how to fund and organize scientific computing. In a world where hyperscalers and AI platform companies set the pace of hardware innovation, traditional models—incremental upgrades to on-premise systems funded through periodic capital campaigns—are no longer sufficient to sustain leadership in HPC for science. Instead, future government funding models must recognize that advanced computing is now a mixed public–private ecosystem, in which strategic consortia, pre-competitive platforms, and mission-driven initiatives play central roles.

In turn, this means articulating explicit AI+HPC requirements linked to national and global challenge problems – climate resilience, health, energy transition, national security, and economic competitiveness. Funding calls that tie hardware, software, data, and workforce development together—anchored in concrete mission outcomes—are more likely to produce durable ecosystems than one-off hardware acquisitions.

Genesis-style initiatives are one example of this logic: they frame AI+science platforms as critical infrastructure for national goals rather than as isolated technology experiments. The core lesson is that publicly funded scientific computing cannot succeed by passively purchasing available computing hardware. It needs proactive, coalition-based funding models that treat AI+HPC as a long-term strategic national asset, integrating hardware, software, data, and people under coherent missions.

10. Implications for the Future

The old model of HPC as a dominant, self-directed driver of advanced hardware and software has ended. Indeed, it arguably ended decades ago, with the emergence of clusters based on commodity microprocessors. Absent strategic investment in new architectures, what remains is a role dependent on AI-centric, hyperscaler investments for technology advances.

In such a world, Genesis is a pragmatic bridge into the AI-factory era, but it should not become the ceiling of our ambition. “AI factories” cannot continue growing without bounds; there are practical energy and carbon constraints. Equally importantly, the future trajectory of semiconductor innovation and cost curves is also uncertain.

If the dominant commercial trajectory is toward ever larger, ever more energy-intensive clusters (e.g., xAI-style “Colossus” builds, Oracle’s OCCI-class deployments, and other zettascale-aspirational AI campuses), then science needs a countervailing national program whose primary objective is not peak capability, but orders-of-magnitude reduction in joules per trusted solution.

We believe the scientific computing community must play a distinctive role in reshaping this ecosystem. This includes serving as a co-designer of AI infrastructure, drawing on decades of experience in numerical methods, performance engineering, and uncertainty quantification to collaborate on the design of AI-centric systems that support both scientific computing and AI-mediated discovery. Doing so will require embracing new models of collaborative public-private partnership, identifying leverage points where early research can shape technology futures.

11. A Call To Action: A National Next-Generation System Design Moonshot

Consider the following Gedanken challenge: deliver the same validated scientific results as today’s frontier AI datacenters, but at roughly 1/100th the energy per solution? Such a target requires a fundamentally different design point that includes: energy-proportional computing [32], extreme data-movement frugality, and algorithm-architecture co-design that treats numerical precision, communication, and verification as first-class resources, not afterthoughts.

Why has this not been the default design point, and a sociotechnical imperative, given the clear and ever more looming challenges of today’s approach? Simply put, because it is far more challenging than incrementalism and procurement. A true moonshot requires accepting risk (and failure), building prototypes early, and resisting the temptation to equate “national leadership” with the largest single installation. It also challenges existing incentives: vendors optimize for hyperscale utilization; government procurement cycles favor incremental upgrades; and “largest machine” headlines still crowd out efficiency metrics.

The scientific case for such a moonshot is compelling. AI factories and HPC systems face similar technical challenges, including inadequate memory bandwidth, high and rising energy requirements, and semiconductor scaling issues. Moreover, many of the highest-value workflows (i.e., climate and weather ensembles, materials screening, fusion design loops, health analytics, inverse problems, and hybrid AI+simulation pipelines) scale best when one can run many jobs in parallel with predictable energy cost. A fleet of smaller, efficient systems can deliver more scientific throughput per dollar and per megawatt than a single monolithic machine, while improving resilience, availability, and breadth of access.

Note that we are not suggesting that we abandon the desire for higher performance, merely that our current approach to increasing performance has reached the point of diminishing returns. We must first rebuild the foundations of computing, then leverage these foundations to build both leading edge systems and a set of grid-deployable “science engines” – modular systems small enough to locate at multiple research institutions and regional power nodes, and numerous enough to support diverse communities.

In many ways, computing became most transformative when it became small enough and economical enough for personal use; the national analogue is to make advanced capability compact, repeatable, and ubiquitous enough that science can own the workflows end-to-end. The same is true for AI engines; broad access is needed for scientific discovery.

Concretely, such a moonshot would couple (i) aggressive energy-aware algorithms (mixed precision with certification, communication-avoiding methods, learned surrogates with validation), (ii) architecture innovation focused on memory and interconnect efficiency rather than raw FLOPS, and (iii) software stacks that measure and optimize joules per trusted outcome across hybrid AI+simulation workflows. The outcome of such a project would not replace Genesis; it would complement it, making sure that public science is not forever constrained to renting computing and storage resources designed for someone else’s business model.

References

[1] G. E. Moore, “Cramming More Components Onto Integrated Circuits,” Electronics, vol. 38, no. 8, pp. 114–117, Apr. 1965. DOI: https://doi.org/10.1109/JSSC.1965.1051903

[2] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann, 2019.

[3] R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256–268, 1974. DOI: https://doi.org/10.1109/JSSC.1974.1050511

[4] J. Dongarra et al., “The International Exascale Software Project Roadmap,” International Journal of High Performance Computing Applications, vol. 25, no. 1, pp. 3–60, 2011. DOI: https://doi.org/10.1177/1094342010391989

[5] OpenAI, “AI and Compute,” OpenAI Blog, 2018.

[6] T. B. Brown et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, 2020.

[7] N. P. Jouppi et al., “In-datacenter Performance Analysis of a Tensor Processing Unit,” in Proc. 44th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2017. DOI: https://doi.org/10.1145/3079856.3080246

[8] J. Jumper et al., “Highly Accurate Protein Structure Prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021. DOI: https://doi.org/10.1038/s41586-021-03819-2

[9] M. Baek et al., “Accurate Prediction of Protein Structures and Interactions Using a Three-Track Neural Network,” Science, vol. 373, no. 6557, pp. 871–876, 2021. DOI: https://doi.org/10.1126/science.abj8754

[10] G. Carleo et al., “Machine Learning and the Physical Sciences,” Reviews of Modern Physics, vol. 91, no. 4, p. 045002, 2019. DOI: https://doi.org/10.1103/RevModPhys.91.045002

[11] S. Rasp, M. S. Pritchard, and P. Gentine, “WeatherBench: A Benchmark Dataset for Data-Driven Weather Forecasting,” Journal of Advances in Modeling Earth Systems, vol. 12, no. 11, 2020. DOI: https://doi.org/10.1029/2020MS002203

[12] R. Nguyen et al., “Learning Skillful Medium-Range Global Weather Forecasting,” Science, vol. 382, pp. 1416–1422, 2023. DOI: https://doi.org/10.1126/science.adi2336

[13] N. J. Higham, “Accuracy and Stability of Numerical Algorithms,” 2nd ed. SIAM, 2002. DOI: https://doi.org/10.1137/1.9780898718027

[14] A. Haidar, S. Tomov, J. Dongarra, and N. Higham, “Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed Up Mixed-Precision Iterative Refinement Solvers,” in Proc. SC18, 2018. DOI: https://doi.org/10.1109/SC.2018.00034

[15] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, “Communication-Avoiding Algorithms,” Acta Numerica, vol. 23, pp. 1–111, 2014. DOI: https://doi.org/10.1017/S0962492914000038

[16] U.S. Congress, “CHIPS and Science Act of 2022,” Public Law 117-167, Aug. 9, 2022.

[17] Executive Office of the U.S. President, “Executive Order on the American Science and Security Platform and the Genesis Mission,” Washington, DC, 2025, https://www.whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission/

[18] Reed. D., Gannon, D., Dongarra, J., “HPC Forecast: Cloudy and Uncertain,” Communications of the ACM, Vol. 66, No. 2, pp. 82-90, https://doi.org/10.1145/3552309, January 2023.

[19] Price, I., Sanchez-Gonzalez, A., Alet, F. et al. Probabilistic Weather Forecasting with Machine Learning. Nature 637, 84–90 (2025). https://doi.org/10.1038/s41586-024-08252-9

[20] Halko, N.; Martinsson, P.-G.; Tropp, J. A. “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” SIAM Review, 2011. DOI: 10.1137/090771806.

[21] Abdelfattah, Anzt, Boman, Carson, Cojean, Dongarra, et al. A Survey of Numerical Linear Algebra Methods Utilizing Mixed-Precision Arithmetic, Int’l J. High Performance Computing Applications (2021). DOI: 10.1177/10943420211003313.

[22] Riley Murray, James Demmel, Michael W. Mahoney, et al., “Randomized Numerical Linear Algebra: A Perspective on the Field With an Eye to Software” (arXiv:2302.11474v2, Apr 12, 2023).

[23] Croci, Fasi, Higham, Mary, Mikaitis, “Stochastic Rounding: Implementation, Error Analysis and Applications,” Royal Society Open Science, 2022. DOI: 10.1098/rsos.211631.

[24] Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Mohammad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang, Distributed Training of Large Language Models on AWS Trainium, SoCC ’24: Proceedings of the 2024 ACM Symposium on Cloud Computing, pp. 961-976, https://doi.org/10.1145/3698038.369853

[25] Tony Hey, Stewart Tansley, and Kristin Tolle, eds., The Fourth Paradigm: Data-Intensive Scientific Discovery (Redmond, WA: Microsoft Research, 2009).

[26] Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen. Weihua Hu, Alexander Merose, Stephan Hoyer https://orcid.org/0000-0002-5207-0380, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Learning Skillful Medium-Range Global Weather Forecasting,” Science, 382,1416-1421(2023).DOI:10.1126/science.adi2336

[27] Mohammed Alsubaie, Wenxi Liu, Linxia Gu, Ovidiu C. Andronesi, Sirani M. Perera, and Xianqi Li, “Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data,” March 2025, https://arxiv.org/html/2510.06335v1

[28] Supranta S. Boruah and Michael Jacob, “Diffusion-based Mass Map Reconstruction From Weak Lensing Data,” February 2025, https://arxiv.org/html/2502.04158

[29] Xiaoyu Ma and David Patterson, “Challenges and Research Directions for Large Language Model Inference Hardware,” https://arxiv.org/abs/2601.05047, 2026

[30] Dennis V Christensen, Regina Dittmann, Bernabe Linares-Barranco, Abu Sebastian, Manuel Le Gallo, Andrea Redaelli, Stefan Slesazeck, Thomas Mikolajick, Sabina Spiga, Stephan Menzel, “2022 Roadmap on Neuromorphic Computing and Engineering, ”022 Neuromorphic Computing and Engineering, 2 02250, DOI 10.1088/2634-4386/ac4a83, 2022

[31] National Academies of Sciences, Engineering, and Medicine. Quantum Computing: Progress and Prospects. National Academies Press, 2019 (DOI: 10.17226/25196).

[32] Luiz A. Barroso and U. Hölzle, “The Case for Energy-Proportional Computing,” IEEE Computer, 40 (12): 33–37. doi:10.1109/mc.2007.443. S2CID 6161162, 2007

[33] Arman Shehabi, Sarah Josephine Smith, Alex Hubbard, Alexander Newkirk, Nuoa Lei, Md AbuBakar Siddik, Billie Holecek, Jonathan G Koomey, Eric R Masanet, and Dale A Sartor, “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, DOI 10.71468/P1WC7Q, 2024

Jack Dongarra is Professor Emeritus at the University of Tennessee, EECS Department, Knoxville, Tennessee, USA and Univeristy of Manchester, UK.

Daniel Reed is a Presidential Professor at the University of Utah, Computer Science and Electrical & Computer Engineering, Salt Lake City, Utah, USA.

Dennis Gannon is Professor Emeritus at the Indiana University, Luddy School of Informatics, Computing and Engineering, Bloomington, Indiana, USA.

Augmenting Generative AI with Knowledge Graphs

Posted on February 17, 2024February 17, 2024 by cloud4science

Introduction

As an organization or enterprise grows, the knowledge needed to keep it going explodes. The shear complexity of the information sustaining a large operation can become overwhelming. Consider, for example the American Museum of Natural History. Who does one contact to gain an understanding of the way the different collections interoperate? Relational databases provide one way to organize information about an organization, but extracting information from an RDBMS can require expertise concerning the database schema and the query languages. Large language models like GPT4 promise to make it easier to solve problems by asking open-ended, natural language questions and having the answers returned in well-organized and thoughtful paragraphs. The challenge in using a LLM lies in training the model to fully understand where fact and fantasy leave off.

Another approach to organizing facts about a topic of study or a complex organization is to build a graph where the nodes are the entities and the edges in the graph are the relationships between them. Next you train or condition a large language model to act as the clever frontend which knows how to navigate the graph to generate accurate answers. This is an obvious idea and others have written about it. Peter Lawrence discusses the relation to query languages like SPAQL and RDF. Venkat Pothamsetty has explored how threat knowledge can be used as the graph. A more academic study from Pan, et.al. entitled ‘Unifying Large Language Models and Knowledge Graphs: A Roadmap’ has an excellent bibliography and covers the subject well.

There is also obvious commercial potential here as well. Neo4J.com, the graph database company, already has a product linking generative AI to their graph system. “Business information tech firm Yext has introduced an upcoming new generative AI chatbot building platform combining large language models from OpenAI and other developers.” See article from voicebot.ai. Cambridge Semantics has integrated the Anzo semantic knowledge graph with generative AI (GPT-4) to build a system called Knowledge Guru that “doesn’t hallucinate”.

Our goal in this post is to provide a simple illustration of how one can augment a generative large language model with a knowledge graph. We will use AutoGen together with GPT4 and a simple knowledge graph to build an application that answers non-trivial English language queries about the graph content. The resulting systems is small enough to run on a laptop.

The Heterogeneous ACM Knowledge Graph

To illustrate how to connect a knowledge graph to the backend of a large language model, we will program Microsoft’s AutoGen multiagent system to recognize the nodes and links of a small heterogeneous graph. The language model we will use is OpenAI’s GPT4 and the graph is the ACM paper citation graph that was first recreated for a KDD cup 2003 competition for the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Its current form, the graph consists of 17,431 author nodes from 1,804 intuition nodes, 12,499 paper titles and abstracts nodes from 14 conference nodes and 196 conference proceedings covering 73 ACM subject topics. It is a snapshot in time from the part of computer science represented by KDD, SIGMOD, WWW, SIGIR, CIKM, SODA, STOC, SOSP, SPAA, SIGCOMM, MobilCOM, ICML, COLT and VLDB. The edges of the graph represent (node, relationship, node) triples as follows.

(‘paper’, ‘written-by’, ‘author’)
(‘author’, ‘writing’, ‘paper’)
(‘paper’, ‘citing’, ‘paper’)
(‘paper’, ‘cited-by’, ‘paper’)
(‘paper’, ‘is-about’, ‘subject’)
(‘subject’, ‘has’, ‘paper’)
(‘paper’, ‘venue’, ‘conference’)
(‘paper’, ‘in’, ‘;proceedings’)
(‘proceedings’, ‘of-conference’, ‘conference’)
(‘author’, ‘from’, ‘institution’)

Figure 1 illustrates the relations between the classes of nodes. (This diagram is also known as the metagrapah for the heterogeneous graph.) Within each class the induvial nodes are identified by an integer identifier. Each edge can be thought of as a partial function from one class of nodes to another. (It is only a partial function because a paper can have multiple authors and some papers are not cited by any other. )

Figure 1. Relations between node classes. We have not represented every possible edge. For example, proceedings are “of” conferences, but many conferences have a proceeding for each year they are held.

Connecting the Graph to GPT4 with AutoGen.

Autogen is a system that we have described in a previous post, so we will not describe it in detail here. However the application here is easy to understand. We will use a system of two agents.

A UserProxyAgent called user_proxy that is capable of executing the functions that can interrogate our ACM knowledge graph. ( It can also execute Python program, but that feature is not used here.)
An AssistantAgent called the graph interrogator. This agent takes the English language search requests from the human user and breaks them down into operations that can be invoked by the user_proxy on the graph. The user_proxy executes the requests and returns the result to the graph interrogator agent who uses that result to formulate the next request. This dialog continues until the question is answered and the graph interrogator returns a summary answer to the user_proxy for display to the human.

The list of graph interrogation functions mirrors the triples that define the edges of the graph. They are:

find_author_by_name( string )
find_papers_by_authors (id list)
find_authors (id list)
paper_appeared_in (id list)
find_papers_cited_by (id list)
find_papers_citing (id list)
find_papers_by_id (id list)
find_papers_by_title (string )
paper_is_about (id list)
find_papers_with_topic (id list)
find_proceedings_for_papers (id list)
find_conference_of_proceedings (id list)
where_is_author_from (id list)

Except for find_author_by_name and find_papers_by_title which take strings for input, the others all take graph node id lists. They all return nod id lists or list of (node id, strings) pairs. It is easiest to understand the dialog is to see an example. Consider the query message.

Msg = ‘Find the authors and their home institutions of the paper “A model for hierarchical memory”.’

We start the dialog by asking the user_proxy to pass this to the graph_interrogator.

user_proxy.initiate_chat(graph_interogator, message=msg)

The graph interrogator agent responds to the user proxy with a suggestion for a function to call.

Finally, the graph interrogator responds with the summary:

To compare this to GPT-4 based Microsoft Copilot in “precise” answer mode, we get:

Asking the same question in “creative” mode, Copilot lists four papers, one of which is correct and has the authors’ affiliation as IBM which was correct at the time of the writing. The other papers are not related.

(Below we look at a few more example queries and the responses. We will skip the dialogs. The best way to see the details is to try this out for yourself. The entire graph can be loaded on a laptop and the AutoGen program runs there as well. You will only need an OpenAI account to run it, but it may be possible to use other LLMs. We have not tried that. The Jupyter notebook with the code and the data are in the GitHub repo.)

Here is another example:

msg = ”’find the name of authors who have written papers that cite paper “Relational learning via latent social dimensions”. list the conferences proceedings where these papers appeared and the year and name of the conference where the citing papers appeared.”’

user_proxy.initiate_chat(graph_interrogator, message=msg)

Skipping the detail of the dialog, the final answer is

The failing here is that the graph does not have the year of the conference.

Here is another example:

msg = ”’find the topics of papers by Lawrence Snyder and find five other papers on the same topic. List the titles and proceedings each appeared in. ”’

user_proxy.initiate_chat(graph_interogator, message=msg)

Note: The acm topic for Snyder’s paper is “Operating Systems” and that is ACM topic D.4.

Final Thoughts

This demo is, of course, very limited. Our graph is very small. It only covers a small fraction of ACM’s topics and scope. One must then ask how well this scale to a very large KG. In this example we only have a dozen edge types. And for each edge type we needed a function that the AI can invoke. These edges correspond to the verbs in the language of the graph and a graph big enough to describe a complex organization or a field of study may require many more. Consider for example a large natural history museum. The nodes of the graph may be objects in the collection and the categorical groups in which they are organized, their location in the museum, the historical provenance of the pieces, the scientific importance of the piece and many more. The edge “verbs” could be extremely large and reflect the way these nodes relate to each other. The American Natural History Museum in New York has many on-line databases that describe its collections. One could build the KG by starting with these databases and knitting them together. This raises an interesting question. Can an AI solution create a KG from the databases alone? In principle, it is possible to extract the data from the databases and construct a text corpus that could be used to (re)train a BERT or GPT like transformer network. Alternatively, one could use a named entity recognition pipeline and relation extraction techniques to build the KG. One must then connect the language model query front end. There are probably already start-ups working on automating this process.

A Brief Look at Autogen: a Multiagent System to Build Applications Based on Large Language Models.

Posted on January 13, 2024January 15, 2024 by cloud4science

Abstract

Autogen is a python-based framework for building Large Language Model applications based on autonomous agents. Released by Microsoft Research, Autogen agents operate as a conversational community that collaborate in surprisingly lucid group discussions to solve problems. The individual agents can be specialized to encapsulate very specific behavior of the underlying LLM or endowed with special capabilities such as function calling and external tool use. In this post we describe the communication and collaboration mechanisms used by Autogen. We illustrate its capabilities with two examples. In the first example, we show how an Autogen agent can generate the Python code to read an external file while another agent uses the content of the file together with the knowledge the LLM has to do basic analysis and question answering. The second example stresses two points. As we have shown in a previous blog Large Language Models are not very good a advanced algebra or non-trivial computation. Fortunately, Autogen allows us to invoke external tools. In this example, we show how to use an Agent that invokes Wolfram Alpha to do the “hard math”. While GPT-4 is very good at generating Python code, it is far from perfect when formulating Alpha queries. To help with the Wolfram Alpha code generation we incorporate a “Critic” agent which inspects code generated by a “Coder” agent, looking for errors. These activities are coordinated with a Group Chat feature of Autogen. We do not attempt to do any quantitative analysis of Autogen here. This post only illustrates these ideas.

Introduction

Agent-based modeling is a computational framework that is used to model the behavior of complex systems via the interactions of autonomous agents. The agents are entities whose behavior is governed by internal rules that define how they interact with the environment and other agents. Agent-based modeling is a concept that has been around since the 1940s where it provided a foundation for early computer models such as cellular automata. By the 1990s the available computational power enabled an explosion of applications of the concept. These included modeling of social dynamics and biological systems. (see agents and philosophy of science). Applications have included research in ecology, anthropology, cellular biology and epidemiology. Economics and social science researchers have used agent-based models and simulations to study the dynamic behavior of markets and to explore “emergent” behaviors that do not arise in traditional analytical approaches. Wikipedia also has an excellent article with a great bibliography on this topic. Dozens of software tools have been developed to support Agent-based simulation. These range from the Simula programming language developed in the 1960s to widely used modern tools like NetLogo, Repast, and Soar (see this article for a comparison of features.)

Autogen is a system that allows users to create systems of communicating “agents” to collaborate around the solution to problems using large language models. Autogen was created by a team at Microsoft Research, Pennsylvania State University, the University of Washington, and Xidian University consisting of Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger and Chi Wang. Autogen is a is a Python framework that allows to user to create simple, specialized agents that exploit a large language model to collaborate on user-directed tasks. Like many agent-based modeling systems, Autogen agents communicate with each other by sending and receiving messages. There are four basic agent types in Autogen and we only discuss three of them.

A UserProxyAgent is an important starting point. It is literally a proxy for a human in the agent-agent conversations. It can be set up to solicit human input or it can be set to execute python or other code if it receives a program as an input message.
An AssistantAgent is an AI assistant. It can be configured to play different roles. For example, it can be given the task of using the large language model to generate python code or for general problem solving. It may also be configured to play specific roles. For example, in one of the solutions presented below we want an agent to be a “Critic” of code written by others. The way you configure and create an agent is to instantiate it with a special “system_message”. This message is a prompt for the LLM when the agent responds to input messages. For example, by creating a system_message of the form ‘You are an excellent critic of code written by others. Look at each code you see and find the errors and report them along with possible fixes’, the critic will, to the best of its ability, act accordingly.

Communication between Agents is relatively simple. Each Agent has a “send” and a “receive” method. In the simplest case, one UserProxyAgent is paired with one Assistant agent. The communication begins with

              user_proxy.initiate_chat(
                             Assistant,
                             message = “the text of the message to the assistant”
                             )

The user_proxy generates a “send” message to the Assistant. Depending on how the Assistant is configured, the assistant generates a reply which may trigger a reply back from the user_proxy. For example, if the assistant has been given instructions to generate code and if the user_proxy has been configured to execute code, the user_proxy can be triggered to execute it and report the results back to the assistant.

Figure 1. Communication patterns used in the examples in this post.

Agents follow a hierarchy of standard replies to received messages. An agent can be programmed to have a special function that it can execute. Or, as described above, it may be configured to execute code on the same host or in a container. Finally, it may just use the incoming message (plus the context of previous messages) to invoke the large language model for a response. Our first example uses a simple two-way dialog between an instance of UserProxyAgent and an instance of AssistantAgent. Our second example uses a four-way dialog as illustrated in Figure 1. This employs an example of a third type of agent:

GroupChatManager. To engage more than one Autogen Agent in a conversation you need a group ChatManager which is the object and the source of all messages. (Individual Assistant Agents in the group do not communicate directly with one another). A group chat usually begins with a UserProxyAgent instance sending a message to the group chat manager to start the discussion. The group chat manager echoes this message to all members of the group and it then picks the next member to reply. There are several ways this selection may happen. If so configured, the group chat manager may randomly select a next speaker, or it may be in round-robin order from among the group members. The next speaker may also be selected by human input. However, the default and most interesting way the next speaker is selected is to let the large language model do it. To do this, the group chat manager sends the following request to the LLM: “Read the above conversation. Then select the next role from [list of agents in the group] to play. Only return the role.” As we shall see this works surprisingly well.

In the following pages we describe our two examples in detail. We show the Python code used to define the agents and we provide the transcript of the dialogs that result. Because this is quite lengthy, we edited it in a few places. GPT-4 likes to ‘do math’ using explicit, raw Latex. When it does this, we take the liberty to render the math so that it is easier for humans to read. However, we include the full code and unedited results in our GitHub repository https://github.com/dbgannon/autogen.

Example 1. Using External Data to Drive Analysis.

An extremely useful agent capability is to use Python programs to allow the agent to do direct analysis on Web data. (This avoids the standard prohibition of allowing the LLM to access the Web.) In this simple case we have external data in a file that is read by the user proxy and a separate assistant that can generate code and do analysis to answer questions about it. Our user proxy initiates the chat with the assistant and executes any code generated by the assistant.

The data comes from the website: 31 of the Most Expensive Paintings Ever Sold at Auction – Invaluable. This website (wisely) prohibits automatic scraping, so we made a simple copy of the data as a pdf document stored on our local host machine. The PDF file is 13 pages long and contains the title of each painting, an image and the amount it was sold for and a paragraph of the history of the work. (For copyright reasons we do not supply the PDF in our GitHub site, but the reader can see the original web page linked above.)

We begin with a very basic assistant agent.

We configure a user proxy agent that can execute code on the local host. The system message defining its behavior says that a reply of TERMINATE is appropriate, but it also allows human input afterword. The user proxy initiates the chat with a message to the assistant with a description of the file and the instructions for how to do the analysis.

Before listing the complete dialog, here is a summary of the discussion

The user_proxy send a description of the problem to the assistant.
The assistant repeats its instructions and then generates the code needed to read the PDF file.
The user_proxy executes the code but there is a small error.
The assistant recognizes the error. It was using an out-of-date version of the pdf reader library. It corrected the code and gave that back to the user_proxy.
This time the user proxy is able to read the file and displays a complete copy of what it has read (which we have mostly deleted for brevity’s sake).
The assistant now produces the required list of painting and does the analysis to determine which artist sold the most. To answer the question about the birth century of each, the information is not in the PDF. So It uses its own knowledge (i.e. the LLM training) of the artists to answer this question. Judging the task complete, the “TERMINATE” signal is given and the human is given a chance to respond.
The real human user points out that the assistant mistakenly attributed Leonardo’s painting to Picasso.
The assistant apologizes and corrects the error.

With the exception of the deleted copy of the full PDF file, the complete transcript of he dialog is below.

Using External Computational Tools: Python and Wolfram Alpha

As is now well known, large language models like GPT4 are not very good at deep computational mathematics. Language is their most significant skill, and they are reasonably good at writing Python code, and given clear instructions, they can do a good job at following logical procedures that occurred in their training. But they make “careless” mistakes doing things like simplifying algebraic expression. In this case we seek the solution to the following problem.

“Find the point on the parabola (x-3)**2 – (y+5)**2 = 7 that is closes to the origin.”

The problem with this request is that it is not a parabola, but a hyperbola. (An error on my part.) As a hyperbola it has two branches as illustrated in figure 3 below. There is a point on each branch that is closes to the origin.

Figure 3. Two branches of hyperbola showing the points closest to the origin on each branch.

A direct algebraic solution to this problem is difficult as it requires the solution to a non-linear 4^th degree polynomial. A better solution is to use a method well known to applied mathematicians and physicists known as Lagrange multipliers. Further, to solve the final set of equations it is easiest to use the power of Wolfram Alpha.

We use four agents. One is a MathUserProxy Agent which is provided in the Autogen library. Its job will be execution of Alpha and Python programs.

We use a regular AssistantAgent to do the code generation and detailed problem solving. While great at Python, GPT-4 is not as good writing Alpha code. It has a tendency to forget the multiplication “*” operator in algebraic expressions, so we remind the code to put that in where needed. It does not always help. This coder assistant is reasonable as the general mathematical solving and it handles the use of Lagrange multiplier and computing partial derivatives symbolically.

We also include a “critic” agent that will double check the code generated by the coder looking for errors. As you will see below, it does a good job catching the Alpha coding error.

Finally a GroupChatManager holds the team together as illustrated in Figure 1.

The dialog that follows from this discussion proceeds as follows.

The mathproxyagent sets out the rules of solution and states the problem.
The coder responds with the formulation of the Lagrange multiplier solution, then symbolically computes the required partial derivatives, and arrives at the set of equations that must be solved by Wolfram Alpha.
The Critic looks at the derivation and equations sees an error. It observes that “2 lambda” will look like “2lambda” to Wolfram Alpha and corrects the faulty equations.
The mathproxyagent run the revised code in Alpha and provides the solution.
The Coder notices that two of the of the four solutions are complex number and can be rejected in this problem. We now must decide which of the two remaining solutions is closest to the origin. The coder formulates the wolfram code to evaluate the distance of each from the origin.
The Critic once again examines the computation and notices a problem. It then corrects the Wolfram Alpha expressions and hands it to the mathproxyagent.
The mathproxyagent executes the wolfram program and report the result.
The Coder announces the final result.
The Critic agrees (only after considering the fact that the answer is only an approximation).

Final Observations

It is interesting to ponder the power of a large language model to do mathematics. Consider the following remarkable language ability. Ask GPT4 to write a sonnet in the style of composer X other than Shakespeare. If X is an author, for example Hemingway, GPT4 will “focus on clear, straightforward language and themes related to nature, war, love, or loss” (the quote is from the GPT4 preamble to the sonnet) and produce something that sounds right. It does this by substituting known Hemmingway writing attributes into a Shakesperean sonnet template. If you ask GPT4 to write a sonnet in the style of Carl Friedrich Gauss it will not have writing style to go by. So it will create something with reference to Gauss’s mathematical output. (see Appendix A below).

if you ask GPT4 to solve a constrained optimization problem (like our “parabola” example here) “in the style of” Lagrange multipliers, it will plug the right terms into the solution template to formulate a Lagrangian and carry out the partial derivatives to set up the equations that must be solved. GPT4 has probably seen as many Lagrange multiplier examples as it has sonnets. So this is, in some ways, an example of its linguistic power. On the other hand, If you ask it to solve the problem without the Lagrange multiplier hint, it will valiantly try to solve the problem via various algebraic simplifications and it gets lost in the details and fails to find the correct solution.

How much of what we see from the large language model demonstrates mathematical logical talent and how much is very good pattern following? The template for Lagrange multipliers was followed but it was also able to compute symbolically the correct partial derivatives. The only weakness we can see is frequent small errors in algebra and poor computational skill. Clearly better training on basic algebra will help. It is also clear that being able to invoke tools like Mathematica or Python numerical libraries would be an essential component of a true Mathematica co-pilot. But can a LLM go beyond pattern matching? Can a LLM create original mathematics with the same “talent” it shows when creating original sonnets?

Appendix A. A Shakesperean Sonnet in the style of Carl Friedrich Gauss

By GPT4

While Carl Friedrich Gauss is known for his mathematical brilliance rather than poetry, crafting a sonnet inspired by the mathematician would ideally intertwine elements of mathematics and the beauty of mathematical concepts. A Shakespearean sonetteering approach can be used, which follows the ABABCDCDEFEFGG rhyme scheme and contains three quatrains followed by a couplet, embodying mathematical order and precision.

In the realm of numbers, where truths align,
The primes, those stars scattered ‘cross the night sky,
Their sequence lures the mind to dare define
The pattern within, hidden to the eye.

In graphs and curves, analytics took form,
Each theorem bloomed like a proof's quiet rose.
Gaussian realms, serene and far from norm,
Where numbers dance and variables compose.

His bell curve gently sways with nature's grace,
Life’s variance secured within its sway,
Within this normal field, finds every place
A measure, chance’s role in clear display.

In math’s own language, Gauss sought to converse,
And in its whispers, found the universe.