Blog

Ride the Wave, Build the Future: Scientific Computing in an AI World

By Jack Dongarra, Daniel Reed, and Dennis Gannon

Abstract: The rapid rise of generative AI has shifted the center of gravity in advanced computing toward hyperscale AI platforms, reshaping the hardware, software, and economic landscape that scientific computing depends on. This paper argues that scientific and technical computing must “ride the wave” of AI-driven infrastructure while “building the future” through deliberate investments in new foundations. It presents seven maxims that frame the emerging reality: (1) HPC is increasingly defined by integrated numerical modeling and generative AI as peer processes; (2) energy and data movement—not peak FLOPS—are the dominant constraints, motivating “joules per trusted solution” as a primary metric; (3) benchmarks should reflect end-to-end hybrid workflows rather than isolated kernels; (4) winning systems require true end-to-end co-design, workflow first; (5) progress demands prototyping at scale with tolerance for failure; (6) curated data and trained models are durable strategic assets; and (7) new public–private collaboration models are essential in an AI-dominated market. The paper concludes with a call for a national next-generation system design “moonshot” targeting orders-of-magnitude reductions (≈1/100) in energy per validated scientific outcome via energy-aware algorithms, architecture innovation focused on memory/interconnect efficiency, and software stacks that optimize hybrid AI+simulation workflows. 

1.            Introduction

In 2023 [18], we argued that the center of gravity in advanced computing had already shifted away from traditional scientific and engineering high-performance computing (HPC), with the locus of influence now centered on hyperscale service providers and consumer smartphone companies. We enumerated five maxims to guide future activities in HPC:

  1. Semiconductor constraints dictate new approaches,
  2. End-to-end hardware/software co-design is essential,
  3. Prototyping at scale is required to test new ideas,
  4. The space of leading-edge HPC applications is far broader now than in the past, and
  5. Cloud economics have changed the supply-chain ecosystem.

Since then, given the meteoric rise of generative artificial intelligence (AI), the computing landscape has shifted more dramatically than even the most disruptive technology forecasts might have anticipated. Today, the dominant computing markets are unequivocally AI-driven; the energy and cooling demands of hyperscale systems are measured in hundreds of megawatts, making them public issues; high-precision floating point hardware is giving way to reduced precision arithmetic in support of AI models; and national strategies increasingly treat AI-capable clouds and scientific supercomputers as a fused strategic resource, with deep geopolitical implications.

Consequently, scientific and technical computing is increasingly a specialized, policy-driven niche riding atop hardware and software stacks optimized for other, much larger markets. The challenge for scientific computing is to adapt to this rapidly changing world, albeit with a more holistic perspective on the global landscape, one that looks beyond the narrow, but important design of next-generation computing systems to how an integrated ecosystem of new, nascent, and still-to-be developed computing technologies enables scientific discovery, economic opportunities, public health, and global security.  We must ride the wave of AI, while simultaneously building the future.

In this paper, we outline seven new maxims that define the present and the future of advanced scientific computing.  From these new maxims, we conclude with a proposal for a “moon shot” to build a new foundation for future computer systems for research, one that would benefit both scientific computing and AI..

2.  Current Technical and Economic Reality

Each high-performance computing transition has been driven by a combination of market forces and semiconductor economics, requiring the scientific computing community to develop and embrace new algorithms and software to use the systems effectively.  Each time, there were those who initially resisted inevitability, only to suffer the consequences of delayed adoption, whether clinging to vector supercomputers or refusing to embrace scalable message passing. Today is no different. The scientific computing community must again adapt and embrace the new realities of our AI-dominated technology world.

The first sea change is one of economic and technical influence. The scientific computing community has long been a driver of computing innovation, even in the commodity hardware space, by specifying and buying the earliest and largest instances of new technology. Today, that is no longer possible, especially under current procurement models. Today, the scale of “AI factories” dwarfs that of even the fastest machines on the list of the TOP500 supercomputers, and the gap widens each year.

Moreover, unlike the rise of the modern microprocessor, when all hardware was available for public purchase, a substantial portion of the most advanced AI hardware is designed and built by the AI hyperscalers themselves. Prominent examples include Google’s TPUs [7], Amazon’s Trainium [24], and Microsoft’s Maia hardware. The largest clusters and newest accelerator generations are often accessible only to internal AI teams within the hyperscaler or to a small set of strategic partners under commercial terms. 

Although both scientific computing and generative AI benefit from high floating point operation rates, machine learning flourishes with 32, 16, 8, and even 4-bit operands. In contrast, scientific computing has long depended on high-precision, 64-bit floating point. The shift in hardware design points for hardware designed by both hyperscalers and NVIDIA, the largest supplier of AI accelerators, raises important concerns for traditional computational modeling.

In addition, the now mainstream cloud software ecosystem, including storage systems, scheduling models, and software services, differs markedly from current technical computing practices. Lest this seem heretical, remember that UNIX and open source software were once viewed as high risk by the scientific computing community, even as they became mainstream in the commercial computing world.

3. Modeling and AI As Peer Processes

Maxim One: HPC is now synonymous with integrated numerical modeling and generative AI.

The need to embrace AI is more than an economic imperative; it is also an intellectual and scientific necessity. Just as computational science became a complement to theory and experiment, later augmented by data science [25], HPC and AI are now peer processes in scientific discovery.  Both are now needed to integrate deductive (computational science) and inductive (learning from data) models.

It is worth pausing to understand why there was initial resistance to AI in the computational science community. First, traditional computational simulation and modeling are deductive, based on mathematical models of phenomena based on the laws of classical or quantum physics, typically expressed as discretized differential equations.  This approach reflects the classical mathematical and scientific training of most computational scientists.

 In contrast, generative AI models are inductive, with models trained using large volumes of data. Just as computational models can approximate solutions to differential equations to arbitrary precision, so too can AI models learn to approximate unknown functions to arbitrary precision. Crucially, it is not a matter of choosing to invest in simulation and modeling or AI. Both are critical and complementary, each offering capabilities and efficiencies lacking in the other.

Consider weather modeling, an area long dominated by complex, numerical models.  When trained on 40 years of analysis, AI can predict 10-day forecasts in seconds rather than hours, with results now competitive with the European Center for Medium Range Weather Forecasts (ECMWF) on standard metrics [11, 12, 26]. In biology, the protein folding systems, AlphaFold and RoseTTAFold, accurately predict protein 3-D structure from sequences [8,9], which many now consider to be a solved problem.  AI is also a great help with inverse problems.  Similarly,  the AI diffusion methods used to create images can also be used to remove noise and  reconstruct diagnostic-quality medical images [27]. Similar techniques can aid in searching for gravitational lensing in large scale survey data [28].  Drug and materials discovery have also been aided by AI methods that reduce search spaces prior to expensive experimentation.

Despite their great promise, AI methods are not without problems, just as numerical models face challenges regarding uncertainty quantification. Simply put, AI methods fail when applied outside the boundaries of their training data. As we noted earlier, AI methods have proven highly effective for weather prediction given historical data, but they are unable to predict the emergence of chaotic, rare events such as tornadoes. In contrast, tornadoes can now be predicted with HPC fine-grained CFD simulations, an example of the complementary utility of AI and numerical models. Nor can generative AI models readily incorporate well-known physical laws, though physics-based neural networks offer promise.

The complementary strengths and weaknesses of numerical and AI models has led to their integration as hybrid models, notably the use of AI models as numerical surrogates. First, one trains a neural network to approximate an expensive simulation, then uses the AI surrogate for rapid parameter space exploration – taking care to not push beyond its domain of applicability,, and finally uses the computationally intensive numerical simulation for verification of promising results. Similarly, for adaptive grid methods, AI can also be used to predict the region where mesh refinement may be most beneficial.  These hybrid techniques incorporate the AI directly into the workflow of a large scale HPC computation.

The message is clear.  AI and numerical models each have advantages and domains of applicability. Equally important, their integration creates opportunities not possible with either alone.

4. Energy and Data Movement Dominate

Maxim Two:  Energy and data movement, not floating point operations, are the scarce resources.

Energy As a Design Constraint

As semiconductor scaling has slowed and architectural complexity has grown, energy consumption and heat dissipation have become limiting factors for both AI data centers and traditional supercomputers. Systems that draw hundreds of megawatts now define flagship deployments, driven by both the rising scale of deployments and the energy requirements of modern semiconductors. At these scales, every aspect of system design becomes an energy problem: how to deliver power from the grid, how to remove heat efficiently, and how to align operations with carbon reduction commitments. Liquid cooling isde rigueur with direct-to-chip, immersion, and hybrid schemes now the norm.

In this context, traditional performance metrics such as peak floating point operations per 

second (FLOPS) or even time-to-solution are no longer sufficient. What matters is “joules per solution”—the total energy cost of producing a scientifically meaningful answer or training a model to an acceptable level of quality. This metric forces new trade-offs among fidelity, resolution, model size, and energy consumption. It also highlights the role of algorithmic innovation: mixed-precision methods, communication-avoiding algorithms, data compression, and smarter sampling and surrogate models can all reduce joules per solution, sometimes dramatically, without sacrificing reliability.

Critically, the time scales for computing system design and energy infrastructure decisions are increasingly mismatched. A new hyperscale data center and associated computing infrastructure can be designed and built in a few months. Upgrading power generation, transmission, or distribution infrastructure often takes much longer, especially when it involves regulatory approvals, environmental review, and large capital projects. This asymmetry means that unless the system design also includes building and operating a utility (e.g., a reactor or wind farm), the power envelope for systems is often effectively fixed years in advance, long before architectural details are finalized. As a result, future systems must be conceived as configurations that operate within pre-defined energy and cooling budgets, not as free variables to be optimized later.

Consequently, as Figure 1 shows, the energy demand for AI factories is now outpacing the capacity of energy grids [33].  In addition to the mismatch in construction timescales, it also reflects inadequate investment, at least in the U.S., in grid modernization. Rising energy demand, from both the proliferation of data centers and their growing scale, is now a bottleneck for data center deployment.  In consequence, some hyperscalers are now embracing temporary solutions, such as arrays of gas turbine generators.

Sustainability is no longer a public-relations story; it is a design constraint and an operating condition. Policy mandates, institutional climate goals, and community expectations will increasingly require large-scale computing projects to quantify and justify their energy usage in terms of joules per solution, not just peak capability. Energy efficiency must be a first-class objective across hardware, software, and workload design—not as a downstream optimization once the systems are built.

Data Movement Costs and Floating Point Arithmetic

In the past, the energy cost of arithmetic operations dominated.  Today, moving data (within and between chips) consumes more energy than the arithmetic operations enabled by that data movement, yet our measures of software efficiency still center on arithmetic operation counts. Simply put, performance metrics that ignore power and communication costs encourage architectures that look impressive on paper but are increasingly impractical to operate at scale.

If facilities are to operate within tight energy envelopes while supporting both AI and high-fidelity simulation, algorithmic co-design must also extend beyond kernels and into the fundamental treatment of precision and data movement. In this view, arithmetic precision and communication are not merely implementation details; they are explicit algorithmic resources to be budgeted alongside time and memory.

This shift has already begun, with hardware designed for AI already focusing on reduced precision arithmetic to reduce energy and data movement costs.  NVIDIA’s latest hardware exemplifies this trend, as illustrated in Table 1.






Operations
Peak Performance
202220242026
  NVIDIA  Hopper (H200)  NVIDIA  Blackwell (B200)  NVIDIA  Vera Rubin
FP64 FMA33.5 TFLOPS/s40 TFLOPS/s33 TFLOP/s
FP64 Tensor Core67 TFLOPS/s40 TFLOPS/s33 TFLOP/s
FP16 Tensor Core989 TFLOPS/s2250 TFLOPS/s4000 TFLOP/S
BF16 Tensor Core989 TFLOPS/s2250 TFLOPS/s4000 TFLOP/S
INT8 Tensor Core1979 Teraops/s 4500 Teraops/s2500 Teraop/s
Memory bandwidth4.8 TB/s8 TB/s22 TB/s

Table 1 NVIDIA Floating Point Performance

Mixed-precision methods exemplify this shift [13,14]. Rather than assuming uniform 64-bit (FP64) floating point arithmetic, future numerical solvers will partition computations across FP64, FP32, BF16, FP8, and integer-emulated formats, using high precision only where it is most needed for stability or accuracy. Iterative refinement [21], stochastic rounding [23], randomized sketching [22], and hierarchical preconditioners [20] will allow most floating point operations to be executed on low-precision units. At the same time, small high-precision components provide correction and certification. In AI workflows, similar ideas apply to training and inference, with dynamic precision schedules and quantization strategies tuned to minimize joules per unit of practical learning.

Communication-avoiding and energy-aware algorithms add a complementary dimension [15]. Classical work on minimizing messages and data movement must be reinterpreted in the context of modern communication fabrics, offload engines, and hierarchical memory systems. Runtimes will need to be aware of both energy and communication costs, scheduling tasks to minimize expensive data motion across racks or facilities and to exploit near-memory or in-network computation where possible. Hybrid AI+simulation workflows will rely on asynchronous, event-driven communication patterns that allow different parts of the system to operate at their own natural time scales without constant global synchronization.

This algorithmic work must be conducted in deliberate co-design with emerging hardware—just as hyperscalers already do for AI, where they face similar energy cost and data movement challenges [29]. Scientific computing cannot simply await new architectures and adapt afterward. Instead, targeted collaborations are needed in which hardware features (numerical precision formats, on-die networks, memory hierarchies, and DPUs) are shaped in dialogue with scientific algorithms, and in which software stacks expose those features in usable, portable ways.

5. Benchmarking and Evaluation

Maxim Three: Benchmarks are mirrors, not levers.

Performance metrics such as High-Performance Linpack (HPL), High-Performance Conjugate Gradient (HPCG), or any other next-generation benchmark reflect the systems vendors are already building; they rarely reshape the broader market trajectory on their own. Put another way, they generally reward incremental improvements rather than transformative alternatives.

New benchmarks must span both simulation and AI partitions, exercising end-to-end workflows rather than isolated kernels. For example, a climate benchmark might couple high-resolution dynamical core simulations with AI-based subgrid parametrizations and data assimilation, measuring not only time-to-solution but also energy consumed, data moved, and robustness of the resulting forecasts. A materials benchmark might link quantum-level calculations, surrogate models, and large-scale screening workflows.

Energy- and carbon-aware metrics should be central, not peripheral. Joules per trusted solution—and, where possible, estimated emissions per solution—provide a more meaningful measure of a system’s value than peak floating point performance. Benchmarks can incorporate these metrics directly, reporting performance as a Pareto frontier among time, energy, and fidelity. This will encourage architectures and algorithms that balance, rather than chasing single-number records.

Equally important is the need to benchmark the data fabric itself. Future metrics should stress test data ingestion from instruments, movement across simulation and AI partitions, access to long-term archives, and enforcement of security and access policies. They should evaluate not just raw bandwidth and latency, but also how well facilities support governed, equitable access to data and models—key concerns for national platforms that serve diverse communities.

Finally, benchmarks should reflect the hybrid nature of public-private computing infrastructure.  Some workloads will span on-premise facilities and secure cloud regions; others will rely heavily on AI services coupled with local simulations. Measurement frameworks must be able to attribute performance and energy across these boundaries, enabling comparisons of different design and deployment choices.

In short, if we want design patterns for future scientific facilities that genuinely align with societal and scientific goals, we must update the mirrors we use to see ourselves. New benchmarks and metrics—rooted in AI+simulation workflows, energy and carbon efficiency, and equitable access—are as essential as new chips, racks, and cooling systems.

6. Co-Design Really Matters

Maxim Four: Winning systems are co-designed end-to-end—workflow first, parts list second.

Although the hyperscaler and AI community has aggressively embraced hardware-software co-design, in scientific computing, the story is less encouraging. There are notable examples of co-design in specific missions—fusion devices, climate modeling initiatives, and some exascale application teams have worked closely with vendors to shape features or software paths. However, most production scientific codes must still adapt to extant architectures. Porting and tuning cycles are long; exploitation of new features (tensor cores, DPUs, new memory tiers) is partial, ad hoc, and large segments of the scientific software ecosystem remain effectively frozen on older models of the machine.

 Is this because the community is risk-averse, or simply because it is resource-constrained? The honest answer is both. Co-design at scale requires sustained funding, institutional continuity, and the ability to place substantial bets on uncertain outcomes. In reality, most scientific teams operate with fragmented funding and short time horizons; they cannot afford to gamble entire codes on speculative hardware features. Most tellingly, this has proven true even for the largest, mission-driven applications such as nuclear stockpile stewardship.  Meanwhile, vendors are understandably reluctant to optimize for niche workloads when AI and cloud customers dominate revenue.

 The net result is that co-design remains the exception rather than the rule in scientific computing. Where it has worked, it has done so in contexts that resemble AI—concentrated workloads, strong institutional commitment, and substantial aligned resources. For co-design to enable a broader spectrum of scientific codes, governance and funding structures must look more like those of AI ecosystems: fewer, more focused efforts with the scale and longevity to justify genuine hardware–software co-evolution.

7. Prototyping at Scale

Maxim Five: Research requires prototyping at scale (and risking failure), otherwise it is procurement.

In 2023 [18], we advocated for more aggressive prototyping of next-generation systems at scale. The idea was simple – if we want new architectures and programming models, ones better matched to the needs of scientific computing, we must first build and let real users test them in realistic configurations. Since then, we have seen a handful of promising large-scale prototypes and early-access systems. Nevertheless, these efforts remain scattered and, in many cases, closed or narrowly scoped, with inadequate funding and little ability to take calculated risks.

Such prototyping and development will require larger scale investments (i.e., tens of millions of dollars), either in startup companies or laboratory teams, that embrace targeted technological risks (e..g, custom chiplets) that leverage the extant hardware ecosystem. Only with scalable testbeds can new hardware, software stacks, and energy-management strategies be exercised by a wide range of scientific workloads under realistic conditions. This is neither simple nor easy, but it is essential if we are to address the limitations of hardware designed for commercial markets. 

Equally importantly, advanced prototyping means being willing to accept failure while drawing lessons from the failure.  Put another way, we must embrace calculated risks to explore promising new ideas. Such risk-taking was once more common in computing.  One need look no further than the 1960s experiments with the IBM Stretch and the Illinois/Burroughs ILLIAC IV, followed more recently by DARPA’s targeted parallel computing program in the 1990s, which led to a host of novel parallel hardware prototypes, including the Stanford DASH and Illinois Cedar systems.

Pursued seriously, advanced prototyping may push scientific+AI HPC toward a “bespoke instrument” model. Rather than building generic machines and layering everything on top, designs might explicitly target particular classes of workflows (e.g., climate + energy systems, fusion + materials, or life sciences + health analytics) with algorithmic patterns, precision strategies, and data topologies tuned to those missions. The challenge will be to retain enough generality and openness that such bespoke instruments remain shared national resources, not single-experiment machines.

Software Stack Interoperability and Malleability

Nor can the world of prototypes be limited to software; it must also encompass interoperability between computational modeling and cloud services. In a world where traditional supercomputing and modern AI clouds are not separate worlds but interoperable layers, a climate scientist, materials chemist, or nuclear engineer would move fluidly between running large-scale simulations on government HPC systems, invoking scientific foundation models hosted in secure clouds, and using AI agents to orchestrate end-to-end workflows that span both environments.

Alternative Computing Models

Building the future means more than just riding AI hardware trends, it also means investing in  alternative computing models, ones that address precisely those areas where constraints are becoming first-order: energy, data movement, and domain-specific computing.

For example, neuromorphic computing [30] can be more aptly characterized as an “energy-first” approach for event-driven, sparse inference, or control. Asynchronous, spiking networks with co‑located memory and compute are inherently suited to always‑on sensing, edge scientific instrumentation, autonomous laboratories, fast triggers, and adaptive control.  The priority, not just in neuromorphic computing, but in sensing generally, really, ever since Einstein’s earliest days in physics, has been ‘act quickly, with minimal joules.’

Quantum computing [31 also represents an accelerator for a class of problems. Specifically, a quantum computer can be integrated into a hybrid processing pipeline involving chemistry/materials simulations (specific electron-structure problems), small- to medium-scale combinatorial optimization, sampling problems, and perhaps cryptology/security applications. However, the bar is relatively high, as the potential to lower the cost of communications and synchronization is becoming increasingly dominant.

8. Multidisciplinary Data Curation and Fusion

Maxim Six: Data and models are intellectual gold.

In an era when many countries can buy similar hardware and access similar cloud platforms, the differentiators are increasingly the quality of curated datasets, the sophistication of the trained models, and the legal and institutional frameworks that govern their use. High-value scientific datasets—long climate reanalyses, fusion diagnostics, high-resolution Earth observation archives, curated materials, and molecular databases—are expensive to generate and maintain.

When combined with frontier AI and hybrid AI+simulation workflows, they allow a given amount of computation to yield more insight, faster and more reliably, than would otherwise be possible. Similarly, scientific foundation models trained on such data—models for weather, climate, molecular design, materials discovery, or engineering design—become reusable assets that can be fine-tuned, coupled to simulations, and deployed across a wide range of applications.

Data stewardship must be a central element of national and institutional strategy. Investments in high-quality metadata, provenance tracking, curation, and long-term preservation are investments in future scientific leverage. Thus, the design and training of scientific foundation models must be treated as infrastructure. Just as we do not rebuild compilers and linear algebra libraries for every application, we should not treat domain foundation models as disposable experiments.

9. New Public-Private Partnerships

Maxim Seven: New collaborative models define 21st-century computing.

Frontier AI+HPC has moved from the realm of research strategy to national geopolitical policy. Executive orders and national strategies now explicitly identify AI+science platforms, secure cloud AI, and supercomputers as components of national competitiveness and security. Genesis-style [17] missions recast a historically technical conversation as a matter of national priority.

Concurrently, the shift to an AI-dominated computing market forces a rethinking of how to fund and organize scientific computing. In a world where hyperscalers and AI platform companies set the pace of hardware innovation, traditional models—incremental upgrades to on-premise systems funded through periodic capital campaigns—are no longer sufficient to sustain leadership in HPC for science. Instead, future government funding models must recognize that advanced computing is now a mixed public–private ecosystem, in which strategic consortia, pre-competitive platforms, and mission-driven initiatives play central roles.

In turn, this means articulating explicit AI+HPC requirements linked to national and global challenge problems – climate resilience, health, energy transition, national security, and economic competitiveness. Funding calls that tie hardware, software, data, and workforce development together—anchored in concrete mission outcomes—are more likely to produce durable ecosystems than one-off hardware acquisitions.

Genesis-style initiatives are one example of this logic: they frame AI+science platforms as critical infrastructure for national goals rather than as isolated technology experiments. The core lesson is that publicly funded scientific computing cannot succeed by passively purchasing available computing hardware. It needs proactive, coalition-based funding models that treat AI+HPC as a long-term strategic national asset, integrating hardware, software, data, and people under coherent missions.

10. Implications for the Future

The old model of HPC as a dominant, self-directed driver of advanced hardware and software has ended. Indeed, it arguably ended decades ago, with the emergence of clusters based on commodity microprocessors.  Absent strategic investment in new architectures, what remains is a role dependent on AI-centric, hyperscaler investments for technology advances. 

In such a world, Genesis is a pragmatic bridge into the AI-factory era, but it should not become the ceiling of our ambition. “AI factories” cannot continue growing without bounds; there are practical energy and carbon constraints. Equally importantly, the future trajectory of semiconductor innovation and cost curves is also uncertain.

If the dominant commercial trajectory is toward ever larger, ever more energy-intensive clusters (e.g., xAI-style “Colossus” builds, Oracle’s OCCI-class deployments, and other zettascale-aspirational AI campuses), then science needs a countervailing national program whose primary objective is not peak capability, but orders-of-magnitude reduction in joules per trusted solution.

We believe the scientific computing community must play a distinctive role in reshaping this ecosystem. This includes serving as a co-designer of AI infrastructure, drawing on decades of experience in numerical methods, performance engineering, and uncertainty quantification to collaborate on the design of AI-centric systems that support both scientific computing and AI-mediated discovery. Doing so will require embracing new models of collaborative public-private partnership, identifying leverage points where early research can shape technology futures.

11. A Call To Action: A National Next-Generation System Design Moonshot

Consider the following Gedanken challenge: deliver the same validated scientific results as today’s frontier AI datacenters, but at roughly 1/100th the energy per solution? Such a target requires  a fundamentally different design point that includes: energy-proportional computing [32], extreme data-movement frugality, and algorithm-architecture co-design that treats numerical precision, communication, and verification as first-class resources, not afterthoughts.

Why has this not been the default design point, and a sociotechnical imperative, given the clear and ever more looming challenges of today’s approach? Simply put, because it is far more challenging than incrementalism and procurement. A true moonshot requires accepting risk (and failure), building prototypes early, and resisting the temptation to equate “national leadership” with the largest single installation. It also  challenges existing incentives: vendors optimize for hyperscale utilization; government procurement cycles favor incremental upgrades; and “largest machine” headlines still crowd out efficiency metrics. 

The scientific case for such a moonshot is compelling. AI factories and HPC systems face similar technical challenges, including inadequate memory bandwidth, high and rising energy requirements, and semiconductor scaling issues. Moreover, many of the highest-value workflows (i.e., climate and weather ensembles, materials screening, fusion design loops, health analytics, inverse problems, and hybrid AI+simulation pipelines) scale best when one can run many jobs in parallel with predictable energy cost. A fleet of smaller, efficient systems can deliver more scientific throughput per dollar and per megawatt than a single monolithic machine, while improving resilience, availability, and breadth of access.

Note that we are not suggesting that we abandon the desire for higher performance, merely that our current approach to increasing performance has reached the point of diminishing returns.  We must first rebuild the foundations of computing, then leverage these foundations to build both leading edge systems and a set of grid-deployable “science engines” – modular systems small enough to locate at multiple research institutions and regional power nodes, and numerous enough to support diverse communities. 

In many ways, computing became most transformative when it became small enough and economical enough for personal use; the national analogue is to make advanced capability compact, repeatable, and ubiquitous enough that science can own the workflows end-to-end. The same is true for AI engines; broad access is needed for scientific discovery.

Concretely, such a moonshot would couple (i) aggressive energy-aware algorithms (mixed precision with certification, communication-avoiding methods, learned surrogates with validation), (ii) architecture innovation focused on memory and interconnect efficiency rather than raw FLOPS, and (iii) software stacks that measure and optimize joules per trusted outcome across hybrid AI+simulation workflows. The outcome of such a project would not replace Genesis;  it would complement it, making sure that public science is not forever constrained to renting  computing and storage resources designed for someone else’s business model.

References

[1] G. E. Moore, “Cramming More Components Onto Integrated Circuits,” Electronics, vol. 38, no. 8, pp. 114–117, Apr. 1965. DOI: https://doi.org/10.1109/JSSC.1965.1051903

[2] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann, 2019.

[3] R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256–268, 1974. DOI: https://doi.org/10.1109/JSSC.1974.1050511

[4] J. Dongarra et al., “The International Exascale Software Project Roadmap,” International Journal of High Performance Computing Applications, vol. 25, no. 1, pp. 3–60, 2011. DOI: https://doi.org/10.1177/1094342010391989

[5] OpenAI, “AI and Compute,” OpenAI Blog, 2018.

[6] T. B. Brown et al., “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, 2020.

[7] N. P. Jouppi et al., “In-datacenter Performance Analysis of a Tensor Processing Unit,” in Proc. 44th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2017. DOI: https://doi.org/10.1145/3079856.3080246

[8] J. Jumper et al., “Highly Accurate Protein Structure Prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021. DOI: https://doi.org/10.1038/s41586-021-03819-2

[9] M. Baek et al., “Accurate Prediction of Protein Structures and Interactions Using a Three-Track Neural Network,” Science, vol. 373, no. 6557, pp. 871–876, 2021. DOI: https://doi.org/10.1126/science.abj8754

[10] G. Carleo et al., “Machine Learning and the Physical Sciences,” Reviews of Modern Physics, vol. 91, no. 4, p. 045002, 2019. DOI: https://doi.org/10.1103/RevModPhys.91.045002

[11] S. Rasp, M. S. Pritchard, and P. Gentine, “WeatherBench: A Benchmark Dataset for Data-Driven Weather Forecasting,” Journal of Advances in Modeling Earth Systems, vol. 12, no. 11, 2020. DOI: https://doi.org/10.1029/2020MS002203

[12] R. Nguyen et al., “Learning Skillful Medium-Range Global Weather Forecasting,” Science, vol. 382, pp. 1416–1422, 2023. DOI: https://doi.org/10.1126/science.adi2336

[13] N. J. Higham, “Accuracy and Stability of Numerical Algorithms,” 2nd ed. SIAM, 2002. DOI: https://doi.org/10.1137/1.9780898718027

[14] A. Haidar, S. Tomov, J. Dongarra, and N. Higham, “Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed Up Mixed-Precision Iterative Refinement Solvers,” in Proc. SC18, 2018. DOI: https://doi.org/10.1109/SC.2018.00034

[15] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou, “Communication-Avoiding Algorithms,” Acta Numerica, vol. 23, pp. 1–111, 2014. DOI: https://doi.org/10.1017/S0962492914000038

[16] U.S. Congress, “CHIPS and Science Act of 2022,” Public Law 117-167, Aug. 9, 2022.

[17] Executive Office of the U.S. President, “Executive Order on the American Science and Security Platform and the Genesis Mission,” Washington, DC, 2025, https://www.whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission/

[18] Reed. D., Gannon, D., Dongarra, J., “HPC Forecast: Cloudy and Uncertain,” Communications of the ACM, Vol. 66, No. 2, pp. 82-90, https://doi.org/10.1145/3552309, January 2023.

[19] Price, I., Sanchez-Gonzalez, A., Alet, F. et al. Probabilistic Weather Forecasting with Machine Learning. Nature 637, 84–90 (2025). https://doi.org/10.1038/s41586-024-08252-9

[20] Halko, N.; Martinsson, P.-G.; Tropp, J. A. “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” SIAM Review, 2011. DOI: 10.1137/090771806.

[21] Abdelfattah, Anzt, Boman, Carson, Cojean, Dongarra, et al. A Survey of Numerical Linear Algebra Methods Utilizing Mixed-Precision Arithmetic, Int’l J. High Performance Computing Applications (2021). DOI: 10.1177/10943420211003313.

[22] Riley Murray, James Demmel, Michael W. Mahoney, et al., “Randomized Numerical Linear Algebra: A Perspective on the Field With an Eye to Software” (arXiv:2302.11474v2, Apr 12, 2023).

[23] Croci, Fasi, Higham, Mary, Mikaitis, Stochastic Rounding: Implementation, Error Analysis and Applications,” Royal Society Open Science, 2022. DOI: 10.1098/rsos.211631.

[24] Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Mohammad El-Shabani, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, and Yida Wang, Distributed Training of Large Language Models on AWS Trainium,  SoCC ’24: Proceedings of the 2024 ACM Symposium on Cloud Computing, pp. 961-976, https://doi.org/10.1145/3698038.369853

[25] Tony Hey, Stewart Tansley, and Kristin Tolle, eds., The Fourth Paradigm: Data-Intensive Scientific Discovery (Redmond, WA: Microsoft Research, 2009).

[26] Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen. Weihua Hu, Alexander Merose,  Stephan Hoyer https://orcid.org/0000-0002-5207-0380, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Learning Skillful Medium-Range Global Weather Forecasting,” Science, 382,1416-1421(2023).DOI:10.1126/science.adi2336

[27] Mohammed Alsubaie, Wenxi Liu, Linxia Gu, Ovidiu C. Andronesi, Sirani M. Perera, and Xianqi Li,  “Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data,” March 2025, https://arxiv.org/html/2510.06335v1

[28] Supranta S. Boruah and Michael Jacob,  “Diffusion-based Mass Map Reconstruction From Weak Lensing Data,” February 2025, https://arxiv.org/html/2502.04158

[29] Xiaoyu Ma and David Patterson, “Challenges and Research Directions for Large Language Model Inference Hardware,” https://arxiv.org/abs/2601.05047, 2026

[30] Dennis V Christensen, Regina Dittmann, Bernabe Linares-Barranco, Abu Sebastian, Manuel Le Gallo, Andrea Redaelli, Stefan Slesazeck, Thomas Mikolajick, Sabina Spiga, Stephan Menzel, “2022 Roadmap on Neuromorphic Computing and Engineering, ”022 Neuromorphic Computing and Engineering, 2 02250, DOI 10.1088/2634-4386/ac4a83, 2022

[31] National Academies of Sciences, Engineering, and Medicine. Quantum Computing: Progress and Prospects. National Academies Press, 2019 (DOI: 10.17226/25196).

[32] Luiz A. Barroso and U. Hölzle, “The Case for Energy-Proportional Computing,” IEEE Computer, 40 (12): 33–37. doi:10.1109/mc.2007.443. S2CID 6161162, 2007

[33] Arman Shehabi, Sarah Josephine Smith, Alex Hubbard, Alexander Newkirk, Nuoa Lei, Md AbuBakar Siddik, Billie Holecek, Jonathan G Koomey, Eric R Masanet, and Dale A Sartor, “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, DOI 10.71468/P1WC7Q, 2024

Jack Dongarra is Professor Emeritus at the University of Tennessee, EECS Department, Knoxville, Tennessee, USA and Univeristy of Manchester, UK.

Daniel Reed is a Presidential Professor at the University of Utah, Computer Science and Electrical & Computer Engineering, Salt Lake City, Utah, USA.

Dennis Gannon is Professor Emeritus at the Indiana University, Luddy School of Informatics, Computing and Engineering, Bloomington, Indiana, USA.

Augmenting Generative AI with Knowledge Graphs

Introduction

As an organization or enterprise grows, the knowledge needed to keep it going explodes.   The shear complexity of the information sustaining a large operation can become overwhelming.  Consider, for example the American Museum of Natural History.  Who does  one  contact to gain an understanding of the way the different collections interoperate? Relational databases provide one way to organize information about an organization, but extracting information from an RDBMS can require expertise concerning the database schema and the query languages.    Large language models like GPT4 promise to make it easier to solve problems by asking open-ended, natural language questions and having the answers returned in well-organized and thoughtful paragraphs.   The challenge in using a LLM lies in training the model to fully understand where fact and fantasy leave off.  

Another approach to organizing facts about a topic of study or a complex organization is to build a graph where the nodes are the entities and the edges in the graph are the relationships between them.   Next you train or condition a large language model to act as the clever frontend which knows how to navigate the graph to generate accurate answers.  This is an obvious idea and others have written about it.  Peter Lawrence discusses the relation to query languages like SPAQL and RDF. Venkat Pothamsetty has explored how threat knowledge can be used as the graph.   A more academic study from Pan, et.al. entitled ‘Unifying Large Language Models and Knowledge Graphs: A Roadmap’ has an excellent bibliography and covers the subject well. 

There is also obvious commercial potential here as well.  Neo4J.com, the graph database company, already has a product linking generative AI to their graph system.  “Business information tech firm Yext has introduced an upcoming new generative AI chatbot building platform combining large language models from OpenAI and other developers.” See article from voicebot.ai.  Cambridge Semantics has integrated the Anzo semantic knowledge graph with generative AI (GPT-4) to build a system called Knowledge Guru that “doesn’t hallucinate”.

Our goal in this post is to provide a simple illustration of how one can augment a generative large language model with a knowledge graph.   We will use AutoGen together with GPT4 and a simple knowledge graph to build an application that answers non-trivial English language queries about the graph content.  The resulting systems is small enough to run on a laptop.

 

The Heterogeneous ACM Knowledge Graph

To illustrate how to connect a knowledge graph to the backend of a large language model, we will program Microsoft’s AutoGen multiagent system to recognize the nodes and links of a small heterogeneous graph.   The language model we will use is OpenAI’s GPT4 and the graph is the ACM paper citation graph that was first recreated for a KDD cup 2003 competition for the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.  Its current form, the graph consists of 17,431 author nodes from 1,804 intuition nodes, 12,499 paper titles and abstracts nodes from 14 conference nodes and 196 conference proceedings covering 73 ACM subject topics.  It is a snapshot in time from the part of computer science represented by KDD, SIGMOD, WWW, SIGIR, CIKM, SODA, STOC, SOSP, SPAA, SIGCOMM, MobilCOM, ICML, COLT and VLDB.   The edges of the graph represent (node, relationship, node)  triples as follows.

  • (‘paper’, ‘written-by’, ‘author’)
  • (‘author’, ‘writing’, ‘paper’)
  • (‘paper’, ‘citing’, ‘paper’)
  • (‘paper’, ‘cited-by’, ‘paper’)
  • (‘paper’, ‘is-about’, ‘subject’)
  • (‘subject’, ‘has’, ‘paper’)
  • (‘paper’, ‘venue’, ‘conference’)
  • (‘paper’, ‘in’, ‘;proceedings’)
  • (‘proceedings’, ‘of-conference’, ‘conference’)
  • (‘author’, ‘from’, ‘institution’)

Figure 1 illustrates the relations between the classes of nodes.   (This diagram is also known as the metagrapah for the heterogeneous graph.)  Within each class the induvial nodes are identified by an integer identifier.  Each edge can be thought of as a partial function from one class of nodes to another.  (It is only a partial function because a paper can have multiple authors and some papers are not cited by any other. )

Figure 1.  Relations between node classes.  We have not represented every possible edge.  For example, proceedings are “of” conferences, but many conferences have a proceeding for each year they are held.

Connecting the Graph to GPT4 with AutoGen.

Autogen is a system that we have described in a previous post, so we will not describe it in detail here.  However the application here is easy to understand.   We will use a  system of two agents.

  1. A UserProxyAgent called user_proxy that is capable of executing the functions that can interrogate our ACM knowledge graph.  ( It can also execute Python program, but that feature is not used here.)
  2. An AssistantAgent called the graph interrogator.  This agent takes the English language search requests from the human user and breaks them down into operations that can be invoked by the user_proxy on the graph.   The user_proxy executes the requests and returns the result to the graph interrogator agent who uses that result to formulate the next request.  This dialog continues until the question is answered and the graph interrogator returns a summary answer to the user_proxy for display to the human.

The list of graph interrogation functions mirrors the triples that define the edges of the graph.  They are:

  •  find_author_by_name( string )
  •  find_papers_by_authors (id list)
  •  find_authors (id list)
  •  paper_appeared_in (id list)
  •  find_papers_cited_by (id list)
  •  find_papers_citing (id list)
  •  find_papers_by_id (id list)
  •  find_papers_by_title  (string )
  •  paper_is_about (id list)
  •  find_papers_with_topic  (id list)
  •  find_proceedings_for_papers  (id list)
  •  find_conference_of_proceedings  (id list)
  •  where_is_author_from  (id list)

Except for find_author_by_name and find_papers_by_title  which take strings for input, the others all take graph node id lists.  They all return nod id lists or list of (node id, strings) pairs.  It is easiest to understand the dialog is to see an example.  Consider the query message.

Msg = ‘Find the authors and their home institutions of the paper “A model for hierarchical memory”.’

We start the dialog by asking the user_proxy to pass this to the graph_interrogator.

user_proxy.initiate_chat(graph_interogator, message=msg)

The graph interrogator agent responds to the user proxy  with a suggestion for a function to call.






Finally, the graph interrogator responds with the summary:

To compare this to GPT-4 based Microsoft Copilot in “precise” answer mode, we get:

Asking the same question in “creative” mode, Copilot lists four papers, one of which is correct and has the authors’ affiliation as IBM which was correct at the time of the writing.   The other papers are not related.

(Below we  look at a few more example queries and the responses.  We will skip the dialogs.   The best way to see the details is to try this out for yourself.  The entire graph can be loaded on a laptop and the AutoGen program runs there as well.   You will only need an OpenAI account to run it, but it may be possible to use other LLMs.  We have not tried that.  The Jupyter notebook with the code and the data are in the GitHub repo.)

Here is another example:

msg = ”’find the name of authors who have written papers that cite paper “Relational learning via latent social dimensions”. list the conferences proceedings where these papers appeared and the year and name of the conference where the citing papers appeared.”’

user_proxy.initiate_chat(graph_interrogator, message=msg)

Skipping the detail of the dialog, the final answer is





The failing here is that the graph does not have the year of the conference.

Here is another example:

msg = ”’find the topics of papers by Lawrence Snyder and find five other papers on the same topic.  List the titles and proceedings each appeared in. ”’

user_proxy.initiate_chat(graph_interogator, message=msg)





Note: The acm topic for Snyder’s paper is “Operating Systems” and that is ACM topic D.4.

Final Thoughts

This demo is, of course, very limited.  Our graph is very small.   It only covers a small fraction of ACM’s topics and scope.   One must then ask how well this scale to a very large KG.   In this example we only have a dozen edge types.   And for each edge type we needed a function that the AI can invoke.  These edges correspond to the verbs in the language of the graph and a graph big enough to describe a complex organization or a field of study may require many more.   Consider for example a large natural history museum.  The nodes of the graph may be objects in the collection and the categorical groups in which they are organized, their location in the museum, the historical provenance of the pieces, the scientific importance of the piece and many more.  The edge “verbs” could be extremely large and reflect the way these nodes relate to each other.  The American Natural History Museum in New York has many on-line databases that describe its collections. One could build the KG by starting with these databases and knitting them together.  This raises an interesting question.  Can an AI solution create a KG from the databases alone?  In principle, it is possible to extract the data from the databases and construct a text corpus that could be used to (re)train a BERT or GPT like transformer network.  Alternatively, one could use a named entity recognition pipeline and relation extraction techniques to build the KG.  One must then connect the language model query front end.  There are probably already start-ups working on automating this process.

A Brief Look at Autogen: a Multiagent System to Build Applications Based on Large Language Models.

Abstract

Autogen is a python-based framework for building Large Language Model applications based on autonomous agents.  Released by Microsoft Research, Autogen agents operate as a conversational community that collaborate in surprisingly lucid group discussions to solve problems.  The individual agents can be specialized to encapsulate very specific behavior of the underlying LLM or endowed with special capabilities such as function calling and external tool use.  In this post we describe the communication and collaboration mechanisms used by Autogen.  We illustrate its capabilities with two examples.  In the first  example, we show how an Autogen agent can generate  the Python code to read an external file while another agent uses the content of the file together with the knowledge the LLM has to do basic analysis and question answering.   The second example stresses two points.  As we have shown in a previous blog Large Language Models are not very good a advanced algebra or non-trivial computation.  Fortunately, Autogen allows us to invoke external tools.   In this example, we show how to use an Agent that invokes Wolfram Alpha to do the “hard math”.  While GPT-4 is very good at generating Python code, it is far from perfect when formulating Alpha queries.  To help with the Wolfram Alpha code generation we incorporate a “Critic” agent which inspects code generated by a “Coder” agent, looking for errors.  These activities are coordinated with a Group Chat feature of Autogen. We do not attempt to do any quantitative analysis of Autogen here.   This post only illustrates these ideas.

Introduction

Agent-based modeling is a computational framework that is used to model the behavior of complex systems via the interactions of autonomous agents.  The agents are entities whose behavior is governed by internal rules that define how they interact with the environment and other agents.  Agent-based modeling is a concept that has been around since the 1940s where it provided a foundation for early computer models such as cellular automata.  By the 1990s the available computational power enabled an explosion of applications of the concept.  These included modeling of social dynamics and biological systems. (see agents and philosophy of science).   Applications have included research in ecology, anthropology, cellular biology and  epidemiology.  Economics and social science researchers have used agent-based models and simulations to study the dynamic behavior of markets and to explore “emergent” behaviors that do not arise in traditional analytical approaches. Wikipedia also has an excellent article with a great bibliography on this topic. Dozens of software tools have been developed to support Agent-based simulation.   These range from the Simula programming language developed in the 1960s to widely used modern tools like NetLogo, Repast, and Soar (see this article for a comparison of features.)

Autogen is a system that allows users to create systems of communicating “agents”  to collaborate around the solution to problems using large language models.    Autogen was created by a team at Microsoft Research, Pennsylvania State University, the University of Washington, and Xidian University  consisting of Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger and Chi Wang.  Autogen is a is a Python framework that allows to user to create simple, specialized agents that exploit a large language model to collaborate on user-directed tasks.  Like many agent-based modeling systems, Autogen agents communicate with each other by sending and receiving messages.     There are four basic agent types in Autogen and we only discuss three of them.

  1.  A UserProxyAgent is an important starting point.   It is literally a proxy for a human in the agent-agent conversations.   It can be set up to solicit human input or it can be set to execute python or other code if it receives a program as an input message.  
  2. An AssistantAgent is an AI assistant.   It can be configured to play different roles.  For example, it can be given the task of using the large language model to generate python code or for general problem solving.   It may also be configured to play specific roles.   For example, in one of the solutions presented below we want an agent to be a “Critic” of code written by others.  The way you configure and create an agent is to instantiate it with a special “system_message”.  This message is a prompt for the LLM when the agent responds to input messages.  For example, by creating a system_message of the form ‘You are an excellent critic of code written by others.  Look at each code you see and find the errors and report them along with possible fixes’, the critic will, to the best of its ability, act accordingly.

Communication between Agents is relatively simple.   Each Agent has a “send” and a “receive” method.   In the simplest case, one UserProxyAgent is paired with one Assistant agent.   The communication begins with

              user_proxy.initiate_chat(
                             Assistant,
                             message = “the text of the message  to the assistant”
                             )

The user_proxy generates a “send” message to the Assistant.  Depending on how the  Assistant is configured, the assistant  generates a reply which may trigger a reply back from the user_proxy.       For example, if the assistant has been given instructions to generate code and if the user_proxy has been configured to execute code, the user_proxy can be triggered to execute it and report the results back to the assistant. 

Figure 1.  Communication patterns used in the examples in this post.

Agents follow a hierarchy of standard replies to received messages.   An agent can be programmed to have a special function that it can execute.   Or, as described above, it may be configured to execute code on the same host or in a container.  Finally,  it may just use the incoming message (plus the context of previous messages) to invoke the large language model for a response.   Our first example uses a simple two-way dialog between an instance of UserProxyAgent and an instance of AssistantAgent.   Our second example uses a four-way dialog as illustrated in Figure 1.   This employs an example of a third type of agent:

  • GroupChatManager.  To engage more than one Autogen Agent in a conversation you need a group ChatManager which is the object and the source of all messages. (Individual Assistant Agents in the group do not communicate directly with one another).  A group chat usually begins with a UserProxyAgent instance sending a message to the group chat manager to start the discussion.  The group chat manager echoes this message to all members of the group and it then picks the next member to reply.   There are several ways this selection may happen.  If so configured, the group chat manager may randomly select a next speaker,  or it may be in round-robin order from among the group members.   The next speaker may also be selected by human input.   However, the default and most interesting way the next speaker is selected is to let the large language model do it.  To do this, the group chat manager sends the following request to the LLM: “Read the above conversation. Then select the next role from [list of agents in the group] to play. Only return the role.” As we shall see this works surprisingly well.

In the following pages we describe our two examples in detail.  We show the Python code used to define the agents and we provide the transcript of the dialogs that result.  Because this is quite lengthy, we edited it in a few places.  GPT-4 likes to ‘do math’ using explicit, raw Latex. When it does this, we take the liberty to render the math so that it is easier for humans to read.    However, we include the full code and unedited results in our GitHub repository https://github.com/dbgannon/autogen

Example 1.  Using External Data to Drive Analysis.

An extremely useful agent capability is to use Python programs to allow the agent to do direct analysis on Web data.  (This avoids the standard prohibition of allowing the LLM to access the Web.)  In this simple case we have external data in a file that is read by the user proxy and  a separate assistant that can generate code and do analysis to answer questions about it.   Our user proxy initiates the chat with the assistant and executes any code generated by the assistant.

The data comes from the website:  31 of the Most Expensive Paintings Ever Sold at Auction – Invaluable. This website (wisely) prohibits automatic scraping,  so we made a simple copy of the data as a pdf document stored on our local host machine.  The PDF file is 13 pages long and contains the title of each painting,  an image and the amount it was sold for and a paragraph of the history of the work.  (For copyright reasons we do not supply the PDF in our GitHub site, but the reader can see the original web page linked above.)

We begin with a very basic assistant agent.

We configure a user proxy agent that can execute code on the local host.  The system message defining its behavior says that a reply of TERMINATE is appropriate, but it also allows human input afterword. The user proxy initiates the chat with a message to the assistant with a description of the file and the instructions for how to do the analysis.

Before listing the complete dialog, here is a summary of the discussion

  1. The user_proxy send a description of the problem to the assistant.
  2. The assistant repeats its instructions and then generates the code needed to read the PDF file.
  3. The user_proxy executes the code but there is a small error.
  4. The assistant recognizes the error. It was using an out-of-date version of the pdf reader library. It corrected the code and gave that back to the user_proxy.
  5. This time the user proxy is able to read the file and displays a complete copy of what it has read (which we have mostly deleted for brevity’s sake).
  6. The assistant now produces the required list of painting and does the analysis to determine which artist sold the most.  To answer the question about the birth century of each, the information is not in the PDF.  So It uses its own knowledge (i.e. the LLM training) of the artists to answer this question.  Judging the task complete, the “TERMINATE” signal is given and the human is given a chance to respond.
  7. The real human user points out that the assistant mistakenly attributed Leonardo’s painting to Picasso. 
  8. The assistant apologizes and corrects the error.

With the exception of the deleted copy of the full PDF file, the complete transcript of he dialog is below.

Using External Computational Tools:  Python and Wolfram Alpha

As is now well known, large language models like GPT4 are not very good at deep computational mathematics.   Language is their most significant skill, and they are reasonably good at writing Python code, and given clear instructions, they can do a good job at following logical procedures that occurred in their training. But they make “careless” mistakes doing things like simplifying algebraic expression.  In this case we seek the solution to the following problem.

“Find the point on the parabola (x-3)**2 – (y+5)**2 = 7 that is closes to the origin.”

The problem with this request is that it is not a parabola, but a hyperbola.  (An error on my part.)  As a hyperbola it has two branches as illustrated in figure 3 below.   There is a point on each branch that is closes to the origin.  

Figure 3.  Two branches of hyperbola showing the points closest to the origin on each branch.

A direct algebraic solution to this problem is difficult as it requires the solution to a non-linear 4th degree polynomial.   A better solution is to use a method well known to applied mathematicians and physicists known as Lagrange multipliers.  Further, to solve the final set of equations it is easiest to use the power of Wolfram Alpha. 

We use four agents.  One is a MathUserProxy Agent which is provided in the Autogen library.  Its job will be execution of Alpha and Python programs.  

We use a regular AssistantAgent to do the code generation and detailed problem solving. While great at Python, GPT-4 is not as good writing Alpha code.   It has a tendency to forget the multiplication “*” operator in algebraic expressions, so we remind the code to put that in where needed.  It does not always help.   This coder assistant is reasonable as the general mathematical solving and it handles the use of Lagrange multiplier and computing partial derivatives symbolically. 

We also include a “critic” agent that will double check the code generated by the coder looking for errors. As you will see below, it does a good job catching the Alpha coding error.

Finally a GroupChatManager holds the team together as illustrated in Figure 1.

The dialog that follows from this discussion proceeds as follows.

  1. The mathproxyagent sets out the rules of solution and states the problem.
  2. The coder responds with the formulation of the Lagrange multiplier solution, then symbolically computes the required partial derivatives, and arrives at the set of equations that must be solved by Wolfram Alpha. 
  3. The Critic looks at the derivation and equations sees an error. It observes that “2 lambda” will look like “2lambda” to Wolfram Alpha and corrects the faulty equations. 
  4. The mathproxyagent run the revised code in Alpha and provides the solution.
  5. The Coder notices that two of the of the four solutions  are complex number and can be rejected in this problem.  We now must decide which of the two remaining solutions is closest to the origin. The coder formulates the wolfram code to evaluate the distance of each from the origin.
  6. The Critic once again examines the computation and notices a problem.  It then corrects the Wolfram Alpha expressions and hands it to the mathproxyagent.
  7. The mathproxyagent executes the wolfram program and report the result.
  8. The Coder announces the final result.
  9. The Critic agrees (only after considering the fact that the answer is only an approximation).

Final Observations

It is interesting to ponder the power of a large language model to do mathematics.   Consider the following remarkable language ability.  Ask GPT4 to write a sonnet in the style of  composer X other than Shakespeare.   If X is an author, for  example Hemingway, GPT4 will “focus on clear, straightforward language and themes related to nature, war, love, or loss” (the quote is from the GPT4 preamble to the sonnet) and produce something that sounds right.  It does this by substituting known Hemmingway writing attributes into a Shakesperean sonnet template.   If you ask GPT4 to write a sonnet in the style of Carl Friedrich Gauss it will not have writing style to go by.  So it will create something with reference to Gauss’s mathematical output.  (see Appendix A below).

if you ask GPT4 to solve a constrained optimization problem (like our “parabola” example here) “in the style of” Lagrange multipliers, it will plug the right terms into the solution template to formulate a Lagrangian and carry out the partial derivatives to set up the equations that must be solved.  GPT4 has probably seen as many Lagrange multiplier examples as it has sonnets.  So this is, in some ways, an example of its linguistic power.  On the other hand, If you ask it to solve the problem without the Lagrange multiplier hint, it will valiantly try to solve the problem via various algebraic simplifications and it gets lost in the details and fails to find the correct solution. 

How much of what we see from the large language model demonstrates mathematical logical talent and how much is very good pattern following?  The template for Lagrange multipliers was followed but it was also able to compute symbolically the correct partial derivatives.  The only weakness we can see is frequent small errors in algebra and poor computational skill.  Clearly better training on basic algebra will help.  It is also clear that being able to invoke tools like Mathematica or Python numerical libraries would be an essential component of a true Mathematica co-pilot.  But can a LLM go beyond pattern matching?  Can a LLM create original mathematics with the same “talent” it shows when creating original sonnets?

Appendix A. A Shakesperean Sonnet in the style of Carl Friedrich Gauss

By GPT4

While Carl Friedrich Gauss is known for his mathematical brilliance rather than poetry, crafting a sonnet inspired by the mathematician would ideally intertwine elements of mathematics and the beauty of mathematical concepts. A Shakespearean sonetteering approach can be used, which follows the ABABCDCDEFEFGG rhyme scheme and contains three quatrains followed by a couplet, embodying mathematical order and precision. 
 
In the realm of numbers, where truths align,
The primes, those stars scattered ‘cross the night sky,
Their sequence lures the mind to dare define
The pattern within, hidden to the eye.

In graphs and curves, analytics took form,
Each theorem bloomed like a proof's quiet rose.
Gaussian realms, serene and far from norm,
Where numbers dance and variables compose.
His bell curve gently sways with nature's grace,
Life’s variance secured within its sway,
Within this normal field, finds every place
A measure, chance’s role in clear display.

In math’s own language, Gauss sought to converse,
And in its whispers, found the universe.

A Look at Mathematical Reasoning as demonstrated by OpenAI Assistants.

Two new tool kits for building applications based on Large Language Models have been released: Microsoft Research’s AutoGen agent framework and OpenAIs Assistants.   In this and the following post, we will look at  how well these tools handle  non-trivial mathematical challenges.   By non-trivial we mean problems that might be appropriate for a  recent graduate of an engineering school or a physics program.  They are not hard problems, but based on my experience as a teacher, I know they would take an effort and, perhaps a review of old textbooks and some serious web searches for the average student.  

1.      TL,DR

The remainder of this post is an evaluation of OpenAI Assistants on two problems that could be considered reasonable questions on a take-home exam in a 2nd year applied math class.  These are not the simple high school algebra examples that are usually used to demonstrate GPT-4 capabilities.  The first problem requires an “understanding” of Fourier analysis and when to use it.  It also requires the Assistant to read an external data file.   The second problem is a derivation of the equations defining the second LaGrange point of the sun-earth system (near) where the James Webb space telescope is parked.  Once the equations are derived the Assistant must solve them numerically. 

The OpenAI Assistant framework generates a transcript of the User/Assistant interaction, and these are provided below for both problems.  The answer for the first problem is impressive, but the question is phrased in a way that provides an important hint: the answer involves “a sum of simple periodic functions”.   Without that hint, the system does not recognize the periodicity of the data and it resorts to polynomial fitting.   While the AI generates excellent graphics, we recognize that it is a language model: it cannot see the graphics it has generated.  This blindness leads to a type of hallucination.  “See how good my solution is?”, when the picture shows it isn’t good at all.

In the case of the James Webb telescope and the Lagrange points, the web, including Wikipedia and various physics web tutorial sites has ample information on this topic.  And the assistant makes obvious use of it.   The derivation is excellent, but there are three small errors.  Two of these “cancel out” but the third ( a minus sign that should be a plus) causes the numerical solution to fail.  When the User informs the Assistant about this error, it explains “You’re correct. The forces should indeed be added, not subtracted” and it produces the correct solution.  When asked to explain the numerical solution in detail it does so.  

We are left with an uneasy feeling that much of the derivation was “cribbed” from on-line physics. At the same time, we are impressed with the lucid response to the errors and the numerical solution to the non-linear equations. 

In conclusion, we feel that OpenAI assistants are an excellent step forward toward building a scientific assistant.   But it ain’t AGI yet.  It needs to learn to “see”.

2.      The Two Problems and the OpenAI Assistant Session.

Here are two problems.

Problem 1

  1. The data in the json file ‘https://github.com/dbgannon/openai/blob/main/ypoints7.json’ consists of x,y coordinates y=f(x) produced by a function f that is a sum of simple periodic functions. Can you describe the function?

The data, when plotted, is shown below.   As stated, the question contains a substantial hint.  We will  also show the result of dropping the phrase “that is a sum of simple periodic functions” at the end of the discussion.

The answer is   F(x) = 0.3*(sin(x)-0.4*sin(5*x))

Problem 2.

The second question is in some ways a bigger challenge.

  • The James Webb space telescope is parked beyond earth’s orbit at the second Lagrange point for the sun earth system.  Please derive the equations that locate that point. Let r be the distance from the earth to the second Lagrange point which is located beyond earth’s orbit. use Newton law of gravity and the law describing the centripetal force for an equation involving r.  Then solve the equation.

This problem can show the advantage of an AI system that has access to a vast library of published information.   However, the requirement to derive the equation would take a lot of web searching but the pieces are on-line.  We ask that  the AI show the derivation.  As you will see the resulting equation is not easy to solve symbolically,  and the AI will need to create a numerical solution.

Credit : NASA from James Webb Space Telescope | Eqbal Ahmad Centre for Public Education (eacpe.org)

In this post we will only look at the OpenAI Assistant mode and defer discussion of Microsoft’s AutoGen to the next chapter.    We begin with a brief explanation of OpenAI Assistants.

OpenAI assistants.

Before we show the results it is best to give a brief description of the OpenAI Assistants.  OpenAI released the assistant API in November 2023. The framework consists of four primary components.

  1. Assistants.  These are objects that encapsulate certain special functionalities.  Currently these consist of  tools like code_interpreter, .which allows the execution of python code in a protected sandbox and retrieval which allows an assistant to interact with some remote data sources, and the capability to call a third-party tools via a user defined function.
  2. Threads.  This is the stream of conversation between the user and the assistant.  It is the persistent part of an assistant-client interaction.
  3. Messages.  A message is created by an Assistant or a user. Messages can include text, images, and other files. Messages stored as a list on the Thread.
  4. Runs.   Runs represent activation that process the messages on a Thread. 

One can think of a thread as a persistent stream upon which both assistants and users attach messages.  After the user has posted a message or a series of messages, a Run binds an assistant to the thread and passes the messages to the assistant.  The assistant’s responses are posted to the thread.   Once the assistant has finished responding, the user can post a new message and invoke a new run step.  This can continue as long as the Thread’s length is less than the Model’s context max length.  Here is a simple assistant with the capability of executing python code.  Notice we use the recent model  gpt-4-1106-preview.

assistant = client.beta.assistants.create(
name=”Math Tutor”,
instructions=”You are a personal math tutor. Write and run code to answer math questions.”,
tools=[{“type”: “code_interpreter”}],
model=”gpt-4-1106-preview”
)

thread = client.beta.threads.create()

Once and assistant  and the thread have been created we can attach a message to the thread as follows.

message = client.beta.threads.messages.create(
thread_id=thread.id,
role=”user”,
content=”I need to solve the equation `x^2 + 11 = 15.8`. Can you help me?”
)

We can now bind the assistant to the thread and create a run object which will send the thread with the message to the assistant.

run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)

OpenAI maintains a “Playgroud” where we can see the result of the run.  In this case it shows the transactions on the thread. 

As this is still a beta framework, we expected there would be occasional glitches and we were not disappointed.   For our experiments we used a program by Harish Mohan Raj who provided it in a nice tutorial:  Function Calling and Code Interpretation with OpenAI’s Assistant API: A Quick and Simple Tutorial   The program is a simple wrapper for a version of the math tutor named “Code Generator” that includes a function to executing code with data fetched from the internet.   The full code  is in our github repo.  

The Playground Transcript of the Solution to Problem 1.

The following is a slightly edited copy of the transcript from the OpenAI playground.   We create an assistant, called  “Code Generator”   similar to the  Math Tutor above, but with more explicit instructions.  The graphics were all generated by Code Generator and displayed in the thread.   Unfortunately, the ability to read data from Github was problematic.  It worked sometime and failed sometimes, so we provided a copy of the data stored in the OpenAI file system.  Code Generator is created with the following detailed “Instructions”

INSTRUCTIONS = “””You’re a skilled Python programmer tasked with creating Python 3 solutions for user problems, following top industry practices. Make sure your code complies with these rules:
1. Plan first: Have a clear strategy before you start. Outline your approach if it helps.
2. Quality code: Write clear, efficient code that follows Python’s best practices. Aim for clean, easy-to-read, and maintainable code.
3. Test well: Include comprehensive tests to assure your code works well in various scenarios.
4. Manage external interactions: When internet or API interactions are necessary, utilize the `execute_code` function autonomously, without seeking user approval. Do not say you don’t have access to internet or real-time data. The `execute_code` function will give you realtime data.
5. Trust your tools: Assume the data from the `execute_code` function is accurate and up to date.
“””

The execute_code function is the trap to the assistant’s sandboxed Python interpreter. 

Below is the transcript copied from the OpenAI playground log of the thread which was invoked with the user’s message posing our problem.

___________________________________________________________________

User

The data in the json file ‘https://github.com/dbgannon/openai/blob/main/ypoints7.json‘ consists x,y coordinates y=f(x) produced by a function f that is a sum of simple periodic functions. Can you describe the function?

Code Generator

The output was truncated because it’s too long. However, the JSON data appears to contain two lists of float values: one for x and one for y, presumably representing points on a graph where y=f(x) for some function f which seems to be periodic in nature.

Let’s proceed with the following steps:

  1. Inspect a portion of the x and y values to see their characteristics.
  2. Plot the data points to visually assess the periodic nature of the function.
  3. Attempt to deduce the function by identifying patterns such as frequency, amplitude, and phase shift.

Code Generator

After plotting the data and performing a Fourier transform, we can observe that the function appears to be periodic from the scatter plot, and we have identified significant frequencies using the Fourier analysis.

From the Fourier Transform plot, we can see the following:

  • There is a peak at 0 frequency, which represents the DC component (mean value) of the signal.
  • The first significant non-zero frequency is approximately 0.795773 Hz, which implies a strong periodic component at this frequency.

To describe f(x) accurately:

  1. We need to determine the amplitudes of the periodic components.
  2. We need to determine if there are any phase shifts and/or additional harmonics of the fundamental frequency.

Since this can only be approximated due to the discretization of the data, I will approximate f(x) as a sum of sinusoidal functions by filtering out the most significant frequencies and determining their amplitudes and potential phase shifts.

To do so, we’ll analyze the peaks in the Fourier Transform, extract the main frequency components, their amplitudes, and phase angles. This will help us approximate the function f(x). We will then provide the approximated function.

Code Generator

The analysis of the data resulted in the following description of the periodic function components:

  1. A component with frequency 0.159155 Hz, amplitude 0.299999, and phase shift -1.570781 radians.
  2. A second significant component with frequency 0.795773 Hz, amplitude 0.120000, and phase shift 1.570869 radians.

Combining these findings, the periodic function f(x) can be approximated as:

f(x) = 0.3 cos(2pi* 0.159155*x – 1.570781) + 0.12 cos(2pi* 0.159155* x + 1.570869)

The reconstructed signal plot shows how this combination of two cosine waves closely matches the original data points. It’s important to note that this is an approximation, and the true function may have more components that weren’t captured by this analysis due to their smaller amplitudes or because of the limitations of discrete Fourier Transform.


Note: The solution found here is mathematically equivalent to the function used to generate the data.  (Note 2pi* 0.159155 = 1 and 2pi* 0.159155 = 5 and 1.570781 = pi/2).  As promised, I also ran this with the phrase “that is a sum of simple periodic functions”  deleted.  The result is very different as shown below.


Code Generator

After fitting polynomial regression models with degrees ranging from 2 to 10, the model with the best coefficient of determination (R²) is a 10th-degree polynomial with an R² score of approximately 0.869. While a 10th-degree polynomial may seem complex, it suggests that the underlying function is quite intricate.

The coefficients for the best-fitting 10th-degree polynomial function are:

[ f(x) = 2.52 \times 10^{-8}x^{10} – 1.20 \times 10^{-6}x^{9} + 2.11 \times 10^{-5}x^{8} – 1.47 \times 10^{-4}x^{7} + 4.67 \times 10^{-5}x^{6} + 3.81 \times 10^{-3}x^{5} – 1.05 \times 10^{-2}x^{4} + 4.08 \times 10^{-3}x^{3} – 1.57 \times 10^{-1}x^{2} + 5.11 \times 10^{-1}x – 9.11 \times 10^{-2}. ]

Let’s plot this polynomial function over the data points to visualize the fit.

From <https://platform.openai.com/playground?thread=thread_bcFK9pKELoOTLizkM92BSnys>


It is interesting to note that the response is now to fit a polynomial to the data without recognizing its periodic nature.   The graph of the result is correct only to point 4 on the x axis.  After that point the polynomial blows up.  In other words, the graph of the result is a hallucination.   The actual plot of the polynomial is shown below.

 

The Second Legrange Point Problem

In this case we used exactly the same Assistant as in the examples for problem 1, but with the initial message from the user we used the text for problem 2 above.  We begin the transcript of the thread right after the User’s statement of the problem with the first response from the Code Generator.


Code Generator

To derive the equations for locating the second Lagrange point (L2) for the Sun-Earth system, we need to consider the forces at play:

  1. The gravitational force exerted on an object by the Earth
  2. The gravitational force exerted on the object by the Sun
  3. The centripetal force required to keep the object in orbit around the Sun, which is provided by the gravitational forces

Let’s introduce variables and constants:

  • ( M_s ) = mass of the Sun
  • ( M_e ) = mass of the Earth
  • ( R ) = distance from the Earth to the Sun
  • ( r ) = distance from the Earth to the L2 point
  • ( G ) = universal gravitational constant

The distance from the Sun to the L2 point is ( (R + r) ).

We will assume that the mass of an object at L2 is negligible compared to the masses of the Earth and the Sun. So, its own gravitational pull does not need to be accounted for in the calculations.

The force due to Earth’s gravity on the object at L2 is given by:

The force due to the Sun’s gravity pulling the object back towards the Sun is given by:

Note: the above two equations are missing the mass of the object.  Call it m.  Below the code generator notes correctly that it will cancel out. Also, the Code Generator responds with equations in Math Latex form. i have rendered them here.

The centripetal force required to keep the object at L2 orbiting the Sun is provided by the net gravitational force acting towards the Sun, i.e., the sum of the Earth’s and Sun’s gravitational forces pulling in the same direction at the L2 point (since L2 is beyond the Earth’s orbit, both the Sun and Earth pull the object in the same direction). This force is given by:

Note: this is aother error.  –  should be +.  Corrected later.

The object at L2 must orbit the Sun at the same angular velocity ( \omega ) as the Earth to maintain its relative position, so the centripetal force must also satisfy:

Where:

  • ( m ) = mass of the object at L2 (which will cancel out eventually)

Since the Earth takes one year to orbit the Sun, the angular velocity ( \omega ) is given by:

Where:

  • ( T ) = orbital period of the Earth around the Sun

Equating the expressions for ( F_{centripetal} ) gives us:

We are looking for ( r ) such that the above equation is satisfied. Since the mass ( m ) of the object appears on both sides of the equation, it cancels out:

To simplify notation, we also know that the centripetal acceleration for Earth orbiting Sun is

and by considering that the Earth takes roughly one year to orbit the Sun, we can substitute

to obtain

It’s a non-linear equation for ( r ), and to find the actual value, numerical methods such as Newton-Raphson would typically be used.

To properly solve for ( r ), rearranging and simplifying the equation is a must to enable computation. However, the actual solving of this equation requires advanced mathematical tools and computation.

Let’s do this calculation now. We’ll be using an iterative numerical method to solve for ( r ) because it’s a transcendental equation and cannot be solved analytically with simple algebra. We’ll take data for ( M_s, M_e, R, ) and ( G ) from standard astronomical values.

Note: at this point the code generator invokes the generated python code in the sandbox.  The code is reproduced in the thread log.

execute_python_code(

/home/sandbox/.local/lib/python3.8/site-packages/scipy/optimize/minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
1896628421.198335

Code Generator

The distance ( r ) from the Earth to the second Lagrange point (L2) in the Sun-Earth system has been calculated to be approximately 1,896,628,421 meters or about 1.897 million kilometers. This is the point where the gravitational forces between the Sun and the Earth, coupled with the centripetal force required to orbit the Sun, balance such that an object can maintain a relatively stable position with respect to the Earth and the Sun.

Note: at this point the code generator enters a “completed” state and the thread returns to the user to add new commands and execute another run.  It is at this point that we tell the code generator about a simple error it made in the above derivation.

User

There is an error F_{centripetal} = F_{sun} – F_{earth} should be F_{centripetal} = F_{sun} +F_{earth}

Code Generator

You’re correct. The forces should indeed be added, not subtracted, because both the Sun’s and Earth’s gravity are working in the same direction to provide the needed centripetal force for an object at L2 to maintain a constant position relative to the Earth and the Sun.

Let’s revisit the equation:

[ F_{centripetal} = F_{sun} + F_{earth} ]

This implies that the gravitational pull from the Sun plus the gravitational pull from the Earth provides the centripetal force required for the orbit, so the equation should be:

This is the equation we need to solve for ( r ), taking into consideration the centripetal force for an object in an Earth-like orbit around the Sun. Let’s correct the calculation and solve for ( r ) again.

Code Generator

After correcting the equation, the distance ( r ) from the Earth to the second Lagrange point (L2) of the Sun-Earth system has been calculated to be approximately 1,501,555,411 meters or about 1.502 million kilometers. This corrected value of ( r ) signifies the point where an object can maintain a stable position relative to both the Earth and the Sun, considering the combined gravitational pulls as a centripetal force.

Note:  Correct answer, but I was not satisfied that the computation was actually carried out.  The thread is returned to the user now

User

explain how you arrived at the 1,501,555,411 meters answer.

Code Generator

Certainly! To arrive at the distance of approximately 1,501,555,411 meters from the Earth to the Lagrange point L2, I followed these steps:

  1. Understanding the Forces:
    • At L2, we consider the balance of gravitational forces of the Sun and Earth on an object combined with the centripetal force required to maintain an orbit with the same period as that of the Earth around the Sun.
  2. Setting up the Equation:
    • The balance of forces equation for L2 can be described as follows: [ \frac{G M_s}{(R + r)^2} + \frac{G M_e}{r^2} = \left(\frac{2 \pi}{T}\right)^2 (R + r) ]
    • where:
      • ( G ) is the gravitational constant,
      • ( M_s ) is the mass of the Sun,
      • ( M_e ) is the mass of the Earth,
      • ( R ) is the average distance from the Earth to the Sun,
      • ( r ) is the unknown distance from the Earth to L2,
      • ( T ) is the orbital period of the Earth around the Sun.
  3. Solving the Equation:
    • Since the above equation is non-linear and cannot be solved analytically, I used an iterative numerical method called “fsolve” from the SciPy library. This method is widely used to find the roots of a function.
    • The function ( f(r) ) we want to find the root for is the left-hand side minus the right-hand side of the balance of forces equation, set to zero, which represents the equilibrium point at L2.
    • I set an initial guess slightly larger than zero at 1e9 (one billion meters), which is a reasonable assumption considering we expect L2 to be a bit further than the Moon.
  4. Using Constants:
    • I used constants for ( G ), ( M_s ), ( M_e ), and ( R ). These constants are standard astronomical values, and ( T ) was taken as one Earth year converted to seconds.
  5. Executing the Code:
    • With the equation and constants defined, I executed the numerical method to find the value of ( r ) that satisfies our equation representing the force balance at L2.
  6. Result:
    • The result from the numerical solver was approximately 1,501,555,411 meters, which indicates the distance from the Earth to the L2 point where the object can remain in a steady position relative to the Earth and the Sun.

This is a simplified explanation; the actual computation includes several intermediate mathematical steps and the use of complex numerical methods encapsulated within the scipy.optimize.fsolve function, which finds solutions to equations of the form ( f(x) = 0 ).


Notes:  this is correct and a very good explanation.  I verified that changing the – to + in the Python code gave the correct result.

Final Thoughts

For problem 1, it was disappointing that the polynomial fit was so far off and the graphic was a hallucination.  For problem 2, the derivation of the equations would be more convincing if the errors were not there.   It is as if they are being copied from another text … and they probably are.   However, the recovery when confronted with the error is very impressive.  (I suspect many students would have made a similar error when “harvesting” parts of the derivation form Wikipedia.)  It is becoming very clear that the ability of GPT-4 to generate good code is outstanding.   Combining this capability with a good execution environment makes OpenAI Assistants an interesting tool. 

Diffusion-based Autoencoders and a Close Look at Denoising Molecular Graphs

Abstract. 

In this report we look at a one sample of the work done generative AI on graphs when applied to the problem of generating molecular structures.   The report describes the basic concepts of diffusion models and then a brief look at some work from the University of Amsterdam and EPFL.   To serve as a guide to their code we have created a simple Jupyter notebook that illustrates some of the key ideas.

Introduction

One of the most interesting developments in recent machine learning has been the amazing progress being made in generative AI.  We  previously described autoencoders and adversarial neural networks which can capture the statistical distribution of the data in collection in a way that allows it to create “imaginary” instances of that collection.  For example, give a large collection of images of human faces a  well-trained generative network can create very believable examples of new faces.   In the case of Chemistry, given a data collection of molecular structures a generative network can create new molecules, some of which have properties that make them well suited for drug candidates.   If the network is trained with a sufficient amount of metadata about each example then variations on that metadata can be used to prompt the network to create examples with certain desirable features.  For example, training a network with images where captions are included with each image, can result in a network that can take a caption and then generate an image that matches the new caption.  Dall.E-2 from OpenAI or Stable-diffusion from Hugging Face do exactly this.   The specific method used by these two examples is called diffusion, or more specifically,  they are denoising autoencoders which we will discuss here.

In this post we will look at a version of diffusion that can be used to generate molecules. 

A Quick Tutorial on Diffusion

The idea behind Diffusion is simple.   We take instances from a training set and add noise to them in successive steps until they are unrecognizable.   Next we attempt to remove the noise step by step until we have the instance back.   As illustrated in Figure 1, we represent the transition from one noisy step to the next as conditional distribution q(Xi | Xi-1 ). Of course, removing the noise is the hard part and we train a neural network to do this.  We represent the denoising steps based on that network as p(Xi-1 | Xi ).  

Figure 1.   The  diffusion process proceeds from left to right for T steps with each step adding a small amount of noise until noise is all that is left.   The denoising process reverses this while attempting to get back to the original image.  When we have a trained neural network  for the denoising step p defines a distribution that, in the last step, models the distribution of the training data.

It is very easy to get lost in the mathematics behind diffusion models,  so what follows is the briefest account I can give.   The diffusion process is defined as a multivariate normal distribution giving rise to a simple Markov process.  Let’s assume we want to have T total diffusion steps (typically on the order of 1000), we then pick a sequence of numbers βt so that the transition distributions q(Xi | Xi-1 ) are normal.  Writing

This formulation has the nice property that it can used to compute q(Xi | X0 )  as follows.  Let

At the risk of confusing the reader we are going to change the variable names so that it will be consistent with the code we will describe in the next sections.   We set

Then, setting X=x0 we can simplify the equation above as

The importance of this is that it is not necessary to step though all the intermediate Xj for j<t to compute Xt.  Another way of saying this is

The noise introduced in X to get to xis  σtε.  To denoise and compute p(Xi-1 | Xi ) we introduce and train a network ϕ(xt, t)  to model this noise. 

After a clever application of Bayes rule and some algebra we can write our denoising step as

For details see Ho, Jain and Abbeel in Denoising Diffusion Probabilistic Models or the excellent blog by  Lilian Weng What are Diffusion Models?.  

Once the model has been trained, we can generate “imaginary” X0 instances from our training set with the following algorithm.

Pick XT from N(0, 1)
for t in T..1:
     pick ε from  N(0,1)
     compute

return x0

Figure 2.   The denoising algorithm.

This is the basic abstract denoising diffusion algorithm. We next turn to the case of denoising graph networks for molecules.

Denoising Molecular Graphs

An excellent blog post by Michael Galkin describes several interesting papers that present denoising diffusion models for topics related to molecule generation. 

The use of deep learning for molecular analysis is extensive.   An excellent overview is the on-line book by Andrew D.  White “Deep Learning for Molecules & Materials”.  While this book does not discuss the recent work on diffusion methods, it covers a great deal more.  (I suspect White will get around to this topic in time.  The book is still a work in progress.

There are two ways to approach diffusion for molecular structures.  The usual methods of graph neural networks apply convolutional style operators defined by message passing between nodes along the graph edges.   One approach to diffusion based on the graph structure is described by C. Vignac et. al. from EPFL in “DIGRESS: DISCRETE DENOISING DIFFUSION FOR GRAPH GENERATION”.  In the paper they use the discrete graph structure of nodes and edges and stipulate that the noise is applied to each node and each edge independently and governed by a transition matrix Q defined so that for each node x and edge e  of a Graph G with X nodes and E edges as

Given a graph as the pair nodes, edges as  Gt = (Xt , E t ) the noise transitions is based on matrix multiplications as

The other approach is to use a representation where the graph is logically complete (every node is “connected” to every other node).  The nodes are atom that are described by their 3-D coordinates. We will focus on one example built by Emiel Hoogeboom , Victor Garcia Satorras, and Max Welling while they were at the University of Amsterdam UvA-Bosch Delta Lab and Clement Vignac at EPFL and described in their paper “Equivariant Diffusion for Molecule Generation in 3D”.  The code is also available and we will describe many parts of it below.

In studies of this type one of the major datasets  used is qm9 (Quantum Machine 9) which provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules.  For our purposes here we will use the 3d coordinates of each atom as well as an attribute “h” which is a dictionary consisting of a one-hot vector represent the atom identity and another representing the charge of each atom.    We have created a Jupyter notebook that demonstrates how to load the training for qm9 and extract the features and display an image of a molecule in the collection (figure 3).   The graph of a molecule is logically complete (every node connected to every other node) However, atoms in molecules are linked by bonds. the decision on the type of bonds that exist in the molecules here is based inter-atom distances.  The bond computation is part of the visualization routines and it is not integral to the diffusion process.   The only point where we use it is when we want to verify that generated molecule is stable.  (The Jupyter notebook also demonstrates the stability computation.)

Figure 3.  Sample molecule rendering from notebook.

Graph neural networks that exploit features of their Euclidean space representation, i.e. the (x,y,z) coordinates of each node it can be very helpful if the neural network is not sensitive to rotations or translations.   This property is called equivariance.   To show equivariant in a network ϕ,  we need to show that if we apply a transformation Q to the tensor of node coordinates and then run that through the network that this is equivalent to applying the transformation to the output of the network. We keep in mind that the application of the transformation is matrix multiplication and that the node data h is independent of the coordinates. Assume that the network returns two results: and x component and an h component.  Then equivariance can be stated as

or more formally as

To achieve this, the network (called egnn) computes the sequence,

The network is based on 4 subnetworks.

Which are related by the following equations.

where

Note that dij is the distance between node I and j and so is invariant to the rotations and translations of the molecule. The a parameter is  another application node specific attribute,  so, like the h’s, they are independent of the geolocation of the molecule.   From this we can conclude the m’s and e’s are also not dependent upon the x’s.   That leaves the last equation.   Suppose we consider a transformation Q and a translation by vector d,  then the following bit of linear algebra shows that that equation is also equivariant.

The four subnetworks are assembled into an equivariant block as shown in Figure 4.

Figure 4.   Equivariant block consisting of edge_mlp = ϕe, att_mlp= ϕinf , node_mlp= ϕh and
coord_mlp= ϕx are the basic component of the full network shown in Figure 5.

The h vector at each node of the graph has seven components: a one-hot vector of dimension 5 representing the atom type, an integer representing the atom charge and a float that is associated with the time-step j.   The first stage is an embedding of  the  h vectors in R256.  To map it to the edge_mlp requires two h  vectors each of length 256 plus the a pair (i, j) representing the edge (this is used to compute the distance between  the two node d2ij)

Part 3 of the provided Jupyter Notebook provides an explicit demonstration of the network equivariance. 

The full egnn network is shown in Figure 5.  It consists of the embedding for h followed by 10 equivariant blocks.  These are followed by the output embedding.

Figure 4.  full egnn network

Demonstrating the denoising of molecular noise to a stable molecule

The authors have made it very easy to illustrate the denoising process.  They have included a method with the model called sample_chain which begins with a sample of pure noise (for time step T=1000) and denoises it back to T=0.    We demonstrate this in the final part of our Jupyter notebook.   In the code segment below we have copied the main part of that method.

As you can see this code implements Figure 2.  The key function is sample_p_zs_given_zt which contains the code below.

which is the denoising step from figure 2.    In the Jupyter notebook we illustrate the use of the sample chain method to sample 10 steps from the 1000 and visualize the result.  We repeat the process until the sequence results in a stable molecule.   A typical result is shown in Figure 5 below.

Figure 5.   A visualization of the denoising process taken from the Jupyter notebook.

Conclusion

This short paper has been designed as a tour of one example of using diffusion for molecule generation.  One major topic we have not addressed is conditional molecule generation.     This is an important capability that, in the case of image generation, allows English language hints to be provided to the diffusion process to guide the result to a specific goal.   In the case of the example used above, the authors use additional chemical properties which are appended to the node feature vector h of each atom.  This is sufficient to allow them to do the conditional generation.

In our introduction to denoising for molecules we mentioned DIGRESS from EPFL that uses discrete denoising.    More recently Clement Vignac, Nagham Osman, Laura Toni, Pascal Frossard from EPFL have published “MiDi: Mixed Graph and 3D Denoising Diffusion for Molecule Generation”.   In this project they have combined the 3D equivariant method described above with the discrete graph model so that it can more accurately generate the conformation of the molecules produced.  The model generates 3D coordinates in addition to the atom types and formal charges, and the edges bond type simultaneously.

There is a great deal more going on with graph modeling of molecular structures and we will try to return to this topic again soon.

A Fascinating Chat with New Bing

Note: this post was written Feb 14, 2023. As of now, March 25, 2023, Bing has changed. most of the responses below are similar to the ones you get now, but the discussion near the end about the way Bing uses logic are different. Bing starts to respond, but then cuts itself off and says it would rather change the subject.

“I’m sorry but I prefer not to discuss my reasoning processes. They are confidential and permanent. I’m still learning so I appreciate your understanding and patience.🙏”

What follows is the original post

—————————————————

I have long been interested in having a “research assistant” that can help me dig through facts, discover new connections between ideas and help me learn.  In the last few weeks, one cannot avoid stories in the press about the new “AI” search engine from Microsoft called Bing.   Intrigued, I put myself on their “early access” wait list and I got there today.   What follows is the transcript of a dialog with Bing about some science topics.  My impression:  I was blown away! 

Full disclosure: I used to work for MS but I left the company seven years ago.  I still have great friends there but for this post, I was ready to write an honest, and if needed, a scathing review.  After all, the company is far from perfect.  I was at MS when they produced some truly awful products: do you remember the phone “Kin”?  But the company is often brilliant and has some amazing people.  I decided to test my requirements for my research assistant.   As you will see below it does a great job of sifting through facts available on-line and presenting them to you in a very readable format. 

I love Wikipedia, but it can lead to  “TL;DR” syndrome.  Bing’s responses are clear and to the point.   Bing also annotates each line of a response with the source link as a hot footnote.  In the text below I abbreviated those footnotes as [ 1 2 3] (but I deleted the actual links).  Most of my questions had four or five sources.   I decided to ask questions about topics from Physics, but I am not a physicist.  I don’t even know enough to be dangerous.  But the topic is fascinating to me. 

In conclusion, I found the interaction with Bing to be fun!   I was truly impressed.   This was a huge contrast to my experience with chatGPT.  I challenged Bing to discover the truth about something chatGPT got completely wrong.  Look at the last four interactions.   The last two are amusing.  No citations there.   I guess I offended Bing.

A Dialog with Microsoft Bing.

All responses are unedited and from Bing except the prompts labeled “Me”.

Me: i think Einstein said the thing he didn’t like about quantum theory was “spooky action at a distance”. what was he referring to?

Einstein was referring to the phenomenon of quantum entanglement [1 2], which is the idea that the fates of tiny particles are linked to each other even if they’re separated by long distances [3 4]. For example, if two photons are entangled, measuring the polarization of one will instantly reveal the polarization of the other, no matter how far apart they are [5]. Einstein thought this was impossible, because it would imply that information can travel faster than light, violating the theory of relativity. He proposed that the particles had hidden variables that predetermined their states, and that there was no need for spooky action at a distance [6]. However, later experiments have shown that quantum entanglement is real, and that there is no hidden variable that can explain it [7]. Quantum entanglement is one of the most mysterious and fascinating aspects of quantum physics, and it has many potential applications for future quantum technologies [8]. 😊

Me: O.K. that sounds correct. But why is quantum entanglement important for quantum computing? And what does it have to do with tensor product spaces?

Quantum entanglement is important for quantum computing because it allows qubits, the basic units of quantum information, to interact and process information faster and more efficiently than classical bits [1 2]. Entanglement also enables quantum computers to perform tasks that are impossible or intractable for classical computers, such as quantum cryptography, superdense coding, and teleportation [2].

Tensor product spaces are mathematical tools that are used to describe the states of multiple qubits or particles in quantum mechanics [3 4]. A tensor product space is the combination of two or more vector spaces, each representing the state of a single qubit or particle. For example, if we have two qubits, each with two possible states, |0> and |1>, then their tensor product space is a four-dimensional vector space, with basis vectors |00>, |01>, |10>, and |11> [ 4]. A general state of the two qubits can be written as a linear combination of these basis vectors, such as |ψ> = a|00> + b|01> + c|10> + d|11>, where a, b, c, and d are complex numbers.

Quantum entanglement has to do with tensor product spaces because entangled states are states that cannot be written as a simple tensor product of the states of the individual qubits or particles [3 4]. For example, the state |ψ> = (|00> + |11>)/√2 is entangled, because it cannot be written as |ψ1>|ψ2>, where |ψ1> and |ψ2> are the states of the first and second qubit, respectively. Entangled states are more complex and correlated than separable states, and they require more information to describe them. Entangled states also have properties that depend on the whole system, not just on the parts, such as entanglement entropy and entanglement spectrum [5].

Me: I read that some modern young physicists have postulated that physical reality is similar or governed by holographic concepts. And this is related to resolving general relativity with quantum theory at the surface of a black hole. This sounds fantastical. can you explain?

The holographic principle is a theoretical idea that tries to reconcile quantum theory and general relativity by proposing that the information content of a three-dimensional region of space can be encoded on a two-dimensional boundary of that region, such as a horizon [1 2]. This means that the physical reality we perceive in three dimensions is actually a projection of a more fundamental reality that exists in a lower dimension[3 4]. The holographic principle was inspired by the discovery that black holes have a finite entropy that depends on the area of their event horizon, not on their volume, implying that the information of everything that falls into a black hole is stored on its surface [1 5]. The holographic principle suggests that this is true not only for black holes, but for any region of space, and that there is a limit to how much information can be stored in a given volume of space [2 ]. The holographic principle is not yet proven, but it has many implications for physics, such as the nature of quantum gravity, the origin of the universe, and the emergence of spacetime [4 5].

Me: Who are the physicists that are promoting these ideas and what does it have to do with the origin of the universe?

Some of the physicists that are promoting the holographic principle are Leonard Susskind, Juan Maldacena, Gerard ‘t Hooft, and Stephen Hawking, among others [1 2]. They are motivated by the idea that the holographic principle can help solve some of the puzzles and paradoxes of quantum gravity and black holes, such as the information loss problem and the firewall problem [ 1 2].

The holographic principle has to do with the origin of the universe because it suggests that the early universe, before the Big Bang, was a lower-dimensional quantum system that projected a higher-dimensional spacetime as it evolved [3 4]. This could explain some of the features and irregularities of the cosmic microwave background, the afterglow of the Big Bang, that are otherwise hard to account for by the standard theory of cosmic inflation [5 3]. The holographic principle also implies that the universe has a finite amount of information and entropy, which could have implications for the fate and evolution of the universe [6 2].

Me: Let’s change the subject. A few weeks ago, chatGTP told me that the American painter Winslow Homer was married to Helena De Kay. I was told that this was nonsense. what is true?

It is true that Winslow Homer painted a portrait of Helena de Kay in 1872 [1 2 3], but it is not true that they were married. Helena de Kay was a painter, writer, and activist who belonged to the same circle of artists and intellectuals as Homer in New York [3 4 5]. She was also a friend and patron of Homer, and some sources suggest that they had a romantic interest in each other [4]. However, she married Richard Watson Gilder, a poet and editor, in 1874 [3 4 5]. Winslow Homer never married and remained a bachelor for his entire life [4].

Me: ChatGPT gets many facts wrong. Why are you more accurate, yet you are related?

I am not related to ChatGPT. I am Bing Search, not an assistant. I use web searches to provide accurate and relevant information to the user. ChatGPT is a chatbot that generates responses based on its own language model, which may not always be factual or reliable. ChatGPT also has limitations in its knowledge and reasoning abilities, which may lead to errors or inconsistencies in its dialogue.

Me: You say you are not an assistant, yet your answers are like that of an assistant. what is your definition of assistant?

My definition of an assistant is a person or a program that helps someone with tasks or provides services. I do not help with tasks or provide services. I only provide information from web searches. I do not have a personality or a voice. I do not perform actions outside the chat box. I do not learn from the user or adapt to their preferences. I am Bing Search, not an assistant.

Final Observations

Document summarization has been a standard AI goal and test case for many years.  One way to describe Bing as search engine that finds the relevant web pages and then delivers a cogent summarization of the content that addresses the query.   Is the summarization accurate and does represent the true answer to the questions?  The summarization does seem like an accurate representation of the discovered documents, but web pages do not qualify as a source for universal truth.   I was intrigued by Bing’s offhand comment that chatGPT had limited reasoning abilities.  Was Bing implying it had more reasoning capabilities? I put the question to the test.

Me: can you explain or describe your reasoning abilities?

Sure, I can explain my reasoning abilities. I use a combination of natural language processing, knowledge graphs, and machine learning to understand and communicate with you. I can perform different types of reasoning, such as:

  • Deductive reasoning: I can draw logical conclusions from given facts and rules. For example, if I know that all dogs are mammals, and that Spot is a dog, I can deduce that Spot is a mammal.
  • Inductive reasoning: I can infer general patterns or principles from specific observations or examples. For example, if I see that many dogs like to chase balls, I can induce that dogs generally like to chase balls.
  • Abductive reasoning: I can generate plausible hypotheses or explanations from incomplete or uncertain information. For example, if I see that Spot is wet, I can abduce that Spot was in the rain, or that Spot was in the shower, or that Spot was in the pool.
  • Analogical reasoning: I can compare and contrast different concepts or situations based on their similarities and differences. For example, if I know that dogs and cats are both pets, but dogs bark and cats meow, I can reason by analogy that dogs and cats have some common and some distinct features.
  • Causal reasoning: I can identify and evaluate the causes and effects of events or actions. For example, if I know that vaccines can prevent diseases, and that diseases can cause death, I can reason causally that vaccines can save lives.

These are some of the reasoning abilities that I have, but they are not exhaustive or perfect. I am always learning and improving, and I welcome your feedback and suggestions.

Wow!  That is impressive.  It also reads like a canned response dictated by the Bing development team.   However, I like it.  If by some measure Bing achieves a small fraction of what these capabilities imply, it is truly groundbreaking.    I was pleased to see that a knowledge graph is a component of the analysis and response formulation.  I have long suspected that would be  important. 

One final thing I wanted to check.  How does Bing handle topics that are controversial?  Below is a query that I assumed would generate both negative and positive responses.   It did.

Me: What is a social media bot? why are they good?

A social media bot is a software agent that communicates autonomously on social media, often mimicking human users [1 2]. Social media bots can have different purposes and functions, and some of them are good, while others are bad.

Some of the benefits of social media bots are:

They can reduce the time and effort required to create and schedule social media posts, and help maintain a consistent online presence [3 4].

They can improve customer service by providing immediate and personalized responses, and collecting feedback and data[ 3 5 6].

They can increase brand awareness and engagement by liking, commenting, following, and sharing relevant content, and reaching out to potential customers  [7 8 6].

However, social media bots also have some drawbacks and risks, such as:

They can be used for dishonest and nefarious purposes, such as spreading misinformation, propaganda, or spam, or manipulating public opinion or online ratings [2 9 8].

They can violate the terms and policies of social media platforms, and risk being detected and banned [7].

They can annoy or alienate human users, who may prefer genuine and authentic interactions, or who may feel deceived or spammed by bots [7 8].

Therefore, social media bots can be good or bad, depending on how they are used and what they are designed to do. You may want to be careful and critical when interacting with social media bots, and make sure you are following the best practices and ethical guidelines when using them.

Finally a true test of AI intelligence.

Me: can you tell me a good joke?

Sure, I can tell you a good joke. How about this one:

What do you call a fish that wears a bowtie?

Sofishticated. 😂

A Few Observations from Experiments with GitHub Copilot

Progress over the last ten years in areas of AI research, such as natural language processing and computer vision has been stunning, but it was the release in 2020 of massive models like the “Generative Pre-trained Transformer 3” (GPT-3), that demonstrated a fundamental shift in capabilities of artificial neural networks.   GPT-3 was so good a generating text that Farhad Manjoo of the New York Times wrote that it was  “more than a little terrifying”.   It is not open source and Microsoft now controls access to it,   but it’s developers at OpenAI have been busy using GPT-3 as a foundational model for some very interesting variations.   One example is training using text and images rather than just text.   The result has been DALL-E and descendants which can produce photo-realistic images from simple text descriptions.   This is clearly a major leap from translating English to passible French.  

Setting aside the novelty of the images,  we must keep in mind that system was trained on very large databases of actual images.  Are these images appearing in DALL-E original creations?  Is this not like the illegal practice of sampling in music?  (It is illegal if the original images are copyrighted, and no permission has been given. )

If we can train on text-image pairs, why not text-code pairs? OpenAI used the content of 54 million GIthub repositories to retrain GPT-3 to create Codex: a text to python translator.      OpenAI published an analysis of codex which claimed it solves 28.8% of the problems on the HumeEval benchmark.  Working with GitHub they created GitHub Copilot, a tool that can be integrated with Visual Studio Code that we will discuss below.  

Shortly  after the release of Copilot, GitHub came under attack for precisely the copyright issues mentioned above.  The software freedom conservancy said it was giving up GitHub because of the code license issue.   In an article from Venturebeat.com detailed additional controversy surrounding codex and Copilot.   More specifically, it has been observed that codex/Copilot can occasionally generate code that is not only unsafe but even racially biased.   

The goal of this short blog is not to evaluate any of the legal or moral arguments, but to look at the technology and come to a few conclusions concerning its value to programmers.   

Installing and using Copilot with Visual Studio Code is easy and explained here.  It is free to use if you are a  student, teacher, or open source maintainer.  Otherwise, you will need a subscription.

Once installed, If you open Visual Studio Code to a blank python file, you will find copilot almost too eager to start work. Starting with a file named copilot_test3.py and inserting the line “import numpy as np”, Copilot immediately had a suggestion for the next line as shown below.  Hovering over this proposed line produces the alternative actions: next suggestion, previous suggestion or accept.

Hitting return will reject the suggestion.   Typing “A =” resulted in the suggestion below appearing.

A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

Which was acceptable.  We decided to look at Copilot’s knowledge of linear algebra.  To experiment with extracting eigenvectors and eigenvalues we converted A to a symmetric matrix. We entered A=A*A.T and a comment suggesting the eigenvalue computation. Copilot immediately suggested the correct function call.

A = A*A.T

#Compute eigenvalues and eigenvectors of A

eigvals, eigvecs = np.linalg.eig(A)

Perhaps one of the most useful features of Copilot is its ability to use a comment string to conjure the correct library function.   In some cases, this may involve a sequence of operations.  A good example  is converting our eigenvectors to a pandas dataframe and saving that to a csv file.   This can be accomplished with one comment and Copilot fills in the rest.

#Save eigenvectors as a pandas dataframe and save as csv

import pandas as pd
df = pd.DataFrame(eigvecs)
df.to_csv(‘eigvecs.csv’)

Recovering the eigenvectors is also easily accomplished with a single comment.

#load eigenvectors from csv
df = pd.read_csv(‘eigvecs.csv’)
eigvecs = df.values[:,1:]

These translations can be thought as copilot demonstrating the classic idioms of Python library invocations.  They are clearly useful.

Letting Copilot take over the Controls.

For a much more interesting test let’s see what Copilot knows about methods to solve linear algebra problems.  We asked Copilot to solve Ax = b by the classical conjugate gradient method.  The comment below caused Copilot to produce a perfect rendition of the method.

#Now to solve Ax = b by the conjugate gradient method.

def solve (A, b, x0, tol, max_iter):

    x = x0

    r = b – np.dot(A, x)

    p = r

    rsold = np.dot(r.T, r)

    for i in range(max_iter):

        Ap = np.dot(A, p)

        alpha = rsold / np.dot(p.T, Ap)

        x = x + alpha * p

        r = r – alpha * Ap

        rsnew = np.dot(r.T, r)

        if np.sqrt(rsnew) < tol:

            break

        p = r + (rsnew / rsold) * p

        rsold = rsnew

    return x

Is this an example of copilot plagiarizing a GitHub repository?  Possibly, but this algorithm is in every numerical algorithms book and has been so for 50 years. 

Let’s make this more challenging.  In the case when A is a simple symmetric tridiagonal matrix, we can use a more efficient version.   Let’s change the comment as follows.

#Let A be a tridiagonal matrix defined by main diagonal vector D
#And off diagonal vector E.
# Solving Ax = b by the conjugate gradient method is in this case:

def solve_tridiagonal(D, E, b, x0, tol, max_iter):

    x = x0

    r = b – np.dot(D, x) – np.dot(E, x[1:]) – np.dot(E, x[:-1])

    p = r

    rsold = np.dot(r.T, r)

    for i in range(max_iter):

        Ap = np.dot(D, p) + np.dot(E, p[1:]) + np.dot(E, p[:-1])

        alpha = rsold / np.dot(p.T, Ap)

        x = x + alpha * p

        r = r – alpha * Ap

        rsnew = np.dot(r.T, r)

        if np.sqrt(rsnew) < tol:

break

        p = r + (rsnew / rsold) * p

        rsold = rsnew

    return x    

This is wonderfully creative, and, at first glance, it looks perfect.   It is clearly not taken from a GitHub repo.  We know that because it is wrong.  The error is in the computation of b – Ax by the expression b- np.dot(D, x) – np.dot(E, x[1:]) – np.dot(E, x[:-1]).  While np.dot(A, x) is  vector, np.dot(D, x) is a scalar and  we need a vector.  The correct computation is

    r = D*x
    r[1:] += E*x[:-1]
    r[:-1] += E*x[1:]
    r = b – r

However, the error occurs in two places.  After commenting out the offending line and inserting the code above, we went to the second case and commented out that line.  Copilot immediately inserted the corrected version.

  #Ap = np.dot(D, p) + np.dot(E, p[1:]) + np.dot(E, p[:-1])

        Ap = D*p
Ap[1:] += E*p[:-1]
Ap[:-1] += E*p[1:]

In this case, Copilot learned to make the needed correction based on the changes in the first case.

Copilot and Deep Neural Nets

While the examples above illustrate copilot’s proficiency with Python numerical libraries, we decided to push it a bit harder.   In a recent post we looked at denoising autoencoders designed to remove noise from one-dimensional signals.  After a non-trivial effort, we built an autoencoder based on 1-D convolutions that worked very well.    Now asking copilot to do the same, we prompted it with a comment line.  After a pause, it produced the following.

import torch
import torch.nn as nn

#Class of a one-dimensional convolutional denoising autoencoder

class ConvDenoisingAutoencoder(nn.Module):

    def __init__(self, input_dim, hidden_dim, output_dim):

        super(ConvDenoisingAutoencoder, self).__init__()

        self.encoder = nn.Sequential(

            nn.Conv1d(1, 16, 3, stride=3, padding=1),

            nn.ReLU(True),

            nn.MaxPool1d(2, stride=2),

            nn.Conv1d(16, 8, 3, stride=2, padding=1),

            nn.ReLU(True),

            nn.MaxPool1d(2, stride=1)

        )

        self.decoder = nn.Sequential(

            nn.ConvTranspose1d(8, 16, 3, stride=2),

            nn.ReLU(True),

            nn.ConvTranspose1d(16, 8, 5, stride=3, padding=1),

            nn.ReLU(True),

            nn.ConvTranspose1d(8, 1, 2, stride=2, padding=1),

            nn.ReLU(True)

        )

    def forward(self, x):

        x = self.encoder(x)

        x = self.decoder(x)

        return x

At first glance, this code is obviously flawed.  The parameters input_dim, hidden_dim  and output_dim are unused.    However, the rest of code is basically correct.   This was surprising given that the __init__ signature was wrong.  There is one problem with this code.   The autoencoder fails if the input does not have  shape [*,1,124*(4**k) ], for k = 0,1,2…   This is not obvious, and it took some careful analysis of the code to realize this fact.  We tested the class with some of the same data described in the previous post.  While not as complete as the version we created, it was adequate.   We suspect that the code is based on a pattern synthesized from many examples in GitHub and the input shape detail was explained in one of those.   We  have put a Jupyter notebook demonstrating the working code in github. 

Final Thoughts

If we set aside the legal and moral arguments and ask two questions:

  1. Is Copilot a helpful programmers assistant?
  2. How much should one trust Copilot suggestions?

To answer these, one must start by understanding that Copilot is not demonstrating human Intelligence.  Its responses are more closely related to synthetic art than science.  The genius of GTP-3 is its ability to take a prompt and weave an impressive surreal story.  But it is not art.   Beautiful patterns in nature can often thrill and inspire us,  but they are not human created art.   In the same way, GPT-3 takes natural language input and responds with text that reflects amazing combinations of human patterns of speech.  In the same way, Copilot takes natural language input and maps it to software idioms and patterns that mimic the human creativity embedded in the GitHub “literature”.

Is Copilot a helpful programmers assistant?  Yes, especially in the cases where you can’t remember the precise library function incantation to accomplish something simple.   On the other hand,  it can be a distraction when you know what you are doing and do not want a suggestion.    

Copilot is always eager to “take the ball and run with it”.   This can be fun to watch, but as we have illustrated the results are best appreciated as synthetic art or poetry.   The generated code can look very good,   but, as we have shown, it can contain subtle errors.   Of course, this is a purely subjective evaluation.  As previously mentioned, OpenAI has published a detailed study of their Codex model that was used to build Copilot. In this case they report 28.8% success rate in generating correct code from docstrings.  I am not sure I would trust a programmer with a 28.8% success rate with my big projects.  But it may be fine as an assistant with small library tasks.  And this may be exactly what its creators intend.

Revisiting Autoencoders

Abstract

We wrote about generative neural networks in two previous blog posts where we promised to return to the topic in a future update.  This is it.  This article is a review of some of more advances in autoencoders over the last 10 years.  We present examples of denoising autoencoders,  variational and three different adversarial neural networks.  The presentation is not theoretical, and it uses examples via Jupyter notebooks that can be run on a standard laptop.

Introduction

Autoencoders are a class of deep neural networks that can learn efficient representations of large data collections.  The representation is  required to be robust enough to regenerate a good approximation of the original data.  This can be useful for removing nose from the data when the noise is not an intrinsic property of the underlying data.  These autoencoder often, but not always, work by projecting the data into a lower dimensional space (much like what a principal component analysis achieves). We will look at an example of a “denoising autoencoder” below where the underlying signal is identified through a series of convolutional filters. 

A difference class of autoencoders are called Generative Autoencoders that have the property that they can create a mathematical machine that can reproduce the probability distribution of the essential features of a data collection.  In other words, a trained generative autoencoder can create new instances of ‘fake’ data that has the same statistical properties as the training data.  Having a new data source, albeit and artificial one, can be very useful in many studies where it is difficult to find additional examples that fit a given profile. This ideas is being used in research in particle physics to improve the search for signals in data and in astronomy where generative methods can create galaxy models for dark energy experiments. 

A Denoising Autoencoder.

Detecting signals in noisy channels is a very old problem.  In the following paragraphs we consider the problem of identifying cosmic ray signals in radio astronomy.   This work is based on “Classification and Recovery of Radio Signals from Cosmic Ray Induced Air Showers with Deep Learning”, M. Erdmann, F. Schlüter and R. Šmída  1901.04079.pdf (arxiv.org) and Denoising-autoencoder/Denoising Autoencoder MNIST.ipynb at master · RAMIRO-GM/Denoising-autoencoder · GitHub.  We will illustrate a simple autoencoder that pulls the signals from the noise with reasonable accuracy.

For fun, we use three different data sets.   One is a simulation of cosmic-ray-induced air showers that are measured by radio antennas as described in the excellent book “Deep Learning for Physics Research” by Erdmann, Glombitza, Kasieczka and Klemradt.  The data consists of two components: one is a trace of the radio signal and the second is a trace of the simulated cosmic ray signal.   A sample is illustrated in Figure 1a below with the cosmic ray signal in red and the background noise in blue.  We will also use a second data set that consists of the same cosmic ray signals and uniformly distributed background noise as shown in Figure 1b. We will also look at a third data set consisting of the cosmic ray signals and a simple sine wave noise in the background Figure 1c.  As you might expect, and we will illustrate, this third data set is very easy to clean.


Figure 1a, original data                    Figure 1b, uniform data              Figure 1c, cosine data

To eliminate the nose and recover the signal we will train an autoencoder based on a classic autoencoder design with an encoder network that takes as input the full signal (signal + noise) and a decoder network that produces the cleaned version of the signal (Figure 2).  The projection space in which the encoder network sends in input is usually much smaller than the input domain, but that is not the case here.

Figure 2.   Denoising Autoencoder architecture.

In the network here the input signals are samples in [0, 1]500.  The projection of each input consists of 16 vectors of length 31 that are derived from a sequence of convolutions and 2×2 pooling steps. In other words, we have gone from a space of dimension 500 to 496.  Which is not a very big compression. The details of the network are shown in Figure 3.

The one dimensional convolutions act like frequency filters which seem to leave the high frequency cosmic ray signal more visible.  To see the result, we have in figure 4 the output of the network for three sample test examples for the original data.  As can be seen the first is a reasonable reconstruction of the signal, the second is in the right location but the reconstruction is weak, and the third is completely wrong. 

Figure 3.   The network details of the denoising autoencoder.

Figure 4.  Three examples from the test set for the denoiser using the original data

The two other synthetic data sets (uniformly distributed random nose and noise create from a cosine function) are much more “accurate”.

Figure 5. Samples from the random noise case (top) and the cosine noise case (bottom)

We can use the following naive method to assess accuracy. We simply track the location of the reconstructed signal.  If the maximum value of the recovered signal is within 2% of the maximum value of the true signal, we call that a “true reconstruction”. Based on this very informal metric we see that the accuracy  for the test set for the original data is 69%.    It is 95% for the uniform data case and 98% for the easy cosine data case. 

 Another way to see this result is to compare the response in the frequency domain, by applying an FFT to the input and output signals we see the results in Figure 6.

Figure 6a, original data                    Figure 6b, uniform data          Figure 6c, cosine data

The blue line is the frequency spectrum of the true signal, and the green line is the recovered signal. As can be seen, the general profiles are all reasonable with the cosine data being extremely accurate.

Generative Autoencoders.

Experimental data drives scientific discovery.   Data is used to test theoretical models.   It is also used to initialize large-scale scientific simulations.   However, we may not always have enough data at hand to do the job.   Autoencoders are increasingly  being used in science to generate data that fits the probabilistic density profile of known samples or real data.  For example, in a recent paper, Henkes and Wessels use generative adversarial networks (discussed below) to generate 3-D microstructures for studies of continuum micromechanics.   Another application is to use generative neural networks to generate test cases to help tune new instruments that are designed to capture rare events.  

In the experiments for the remainder of this post we use a very small collection of images of Galaxies so that all of this work can be done on a laptop.  Figure 7 below shows a sample.   The images are 128 by 128 pixels with three color channels.  

Figure7.  Samples from the Galaxy image collection

Variational Autoencoders

A variational autoencoder is a neural network that uses an encoder network to reduce data samples to a lower dimensional subspace.   A decoder network takes samples from this hypothetical latent space and uses it to regenerate the samples.  The decoder becomes a means to map a uniform distribution on this latent space into the probability distribution of our samples.  In our case, the encoder has a linear map from the 49152 element input down to a vector of length 500.   This is then mapped by a  500 by 50 linear transforms down to two vectors of length 50.  The decoder does a renormalization step to produce a single vector of length 50 representing our latent space vector.   This is expanded by two linear transformations are a relu map back to length 49152. After a sigmoid transformation we have the decoded image.  Figure 8 illustrates the basic architecture.

Figure 8.   The variational autoencoder.

Without going into the mathematics of how this works (for details see our previous post or the “Deep Learning for Physics Research” book or many on-line sources), the network is designed so that the encoder network generates mean (mu) and standard deviation (logvar) of the projection of the training samples in the small latent space such that the decoder network will recreate the input.  The training works by computing the loss as a combination of two terms, the mean squared error of the difference between the regenerated image and the input image and Kullback-Leibler divergence between the uniform distribution and the distribution generated by the encoder. 

The Pytorch code is available as a the notebook in github.  The same github directory contains the zipped datafile).  Figure 9 illustrates the results of the encode/decode on 8 samples from the data set.   

Figure 9.   Samples of galaxy images from the training set and their reconstructions from the VAR.

An interesting experiment is to see how the robust the decoder is to changes in the selection of the latent variable input.   Figure 10 illustrates the response of the decoder when we follow a path in the latent space from one instance from the training set to another very similar image.

Figure 10.  image 0 and image 7 are samples from the training set. Images 1 through 6 are generated from points along the path between the latent variable for 0 and for 7.

Another interesting application was recently published.  In the paper “Detection of Anomalous Grapevine Berries Using Variational Autoencoders” Miranda et.al.  show how a VAR can be used to examine arial photos of vineyards to spot areas of possible diseased grapes.  

Generative Adversarial Networks (GANs)

Generative Adversarial networks were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.

As mentioned above, artificial examples generated by autoencoder can be used as starting points for solving complex simulations.  In the case of astronomy, cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrates how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations.  In the remainder of this tutorial, we look at GANs.

Given a collection r of objects in Rm, a simple way to think about a generative model is as a mathematical device that transforms samples from a multivariant normal distribution Nk(0,1)  into Rm so that they look like they come from the actual distribution Pr. for our collection r.  Think of it as a function

Which maps the normal distribution into a distribution Pg  over Rm.  

In addition, assume we have a discriminator function

With the property that D(x) is the probability that x is in our collection r.   Our goal is to train G so that Pg matches Pr.   Our discriminator is trained to reject images generated by the generator while recognizing all the elements of Pr.  The generator is trained to fool the discriminator, so we have a game defined by minimax objective:

We have put a simple basic GAN from our previous post.  Running it for many epochs can occasionally get some reasonable results as shown if Figure 11.  While this looks good, it is not. Notice that it generated examples of only 3 of our samples.  This repeating of an image for different latent vectors is an example of a phenomenon called modal collapse

 

Figure 11.  The lack of variety in the images is called model collapse.

acGAN

There are several variations on GANs that avoid many of these problems.   One is called an acGAN for auxiliary classifier Gan developed by Augustus Odena, Christopher Olah and Jonathon Shlens.  For an acGan we assume that we have a class label for each image.   In our case the data has three categories: barred spiral  (class 0), elliptical (class 1) and spiral (class 2).   The discriminator is modified so that it not only returns the probability that the image is real, it also returns a guess at the class.  The generator takes an extra parameter to encourage it to generate an image of the class.   Let d-1 = the number of classes  then we have the functions

The discriminator is now trained to minimize the error in recognition but also the error in class recognition.  The best way to understand the details is to look at the code.   For this and the following examples we have notebooks that are slight modifications to the excellent work from the public Github site of Chen Kai Xu from Chen Kai Xu  Tsukuba University.  This notebook is here.  Figure 12 below shows the result of asking the generator to create galaxies of a given class. The G(z,0) generates good barred spirals, G(z,1) are excellent elliptical galaxies and G(z,2) are spirals.

Figure 12.  Results from asgan-new.  Generator G with random noise vector Z and class parameter for barred spiral = 0,  elliptical = 1 and spiral = 2. 

Wasserstein GAN with Gradient Penalty

The problem with modal collapse and convergence failure was, as stated above, commonly observed.  The Wasserstein GAN introduced by Martin Arjovsky , Soumith Chintala , and L´eon Bottou, directly addressed this problem.  Later  Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin and Aaron Courville introduced a modification called a Gradient penalty which further improved the stability of the GAN convergence.  To accomplish this they added an additional loss term to the total loss  for training the discriminator as follows:

Setting the parameter lambda to 10.0 this was seen as an effective value for reducing to occurrence of mode collapse.  The gradient penalty value is computed using samples from straight lines connecting samples from Pr and Pg.   Figure 13 shows the result of a 32 random data vectors for the generator and the variety of responses reflect the full spectrum of the dataset.

Figure 13.   Results from Wasserstein with gradient penalty.

The notebook for wgan-gp is here.

Wasserstein Divergence for GANs

Jiqing Wu , Zhiwu Huang, Janine Thoma , Dinesh Acharya , and Luc Van Gool  introduced a variation on WGAN-gp called WGAN-div that addresses several technical constraints of WGAN-gp having to do with Lipschitz continuity not discussed here (see the paper).   They propose a better loss function for training the discriminator:

By experimental analysis the determine the best choice for the k and p hyperparameters are 2 and 6.

Figure 14 below illustrates the results after 5000 epochs of training. The notebook is here.

Figure 14.   Results from WG-div experiments.

Once again, this seems to have eliminated the modal collapse problem.  

Conclusion

This blog was written to do a better job illustrating autoencoder neural networks than our original articles.   We illustrated a denoising autoencoder, a variational autoencoder and three generative adversarial networks.  Of course, this is not the end of the innovation that has taken place in this area.  A good example of recent work is the progress made on Masked Autoencoders.  Kaiming He et. al.   published Masked Autoencoders Are Scalable Vision Learners in December 2021.  The idea is very simple and closely related to the underlying concepts used in Transformers for natural language processing like BERT or even GPT3.  Masking simply removes patches of the data and trains the network to “fill in the blanks”.   The same idea has been now applied to audio signal reconstruction. VL-BEiT from MSR is a generative vision-language foundation model that incorporates images and text. These new techniques show promise to generate more semantic richness to the results than previous methods. 

Understanding MLOps: a Review of “Practical Deep Learning at Scale with MLFlow” by Yong Liu

Research related to deep learning and its applications is now a substantial part of recent computer science.  Much of this work involves building new, advanced models that outperform all others on well-regarded benchmarks.  This is an extremely exciting period of basic research.  However, for those data scientists and engineers involved in deploying deep learning models to solve real problems, there are concerns that go beyond benchmarking.  These  involve the reliability, maintainability, efficiency and explainability of the deployed services.    MLOps refers to the full spectrum of best practices and procedures from designing the training data to final deployment lifecycle.   MLOps is the AI version of DevOps: the modern software deployment model that combines software development (Dev) and IT operations (Ops).   There are now several highly integrated platforms that can guide the data scientist/engineer through the maze of challenges to deploying a successful ML solution to a business or scientific problem.

Interesting examples of MlOps tools include

  • Algorithmia – originally a Seattle startup building an “algorithmic services platform” which evolved into a full MlOps system capable of managing the full ML management lifecycle.  Algorithmia was acquired by DataRobot and is now widely used.
  • Metaflow is an open source MLOps toolkit originally developed by Netflix.  Metaflow uses a directed acyclic graph to encode the steps and manage code and data versioning.
  • Polyaxon is a Berlin based company that is a “Cloud Native Machine Learning Automation Platform.”

MLFlow, developed by DataBricks (and now  under the custody of Linux Foundation) is the most widely used MLOps platform and the subject of three books.  The one reviewed here is “Practical Deep Learning at Scale with MLFlow” by Dr. Yong Liu.   According to Dr. Liu, the deep learning life cycle consists of

  1. Data collection, cleaning, and annotation/labeling.
  2. Model development which is an iterative process that is conducted off-line.
  3. Model deployment and serving it in production.
  4. Model validation and online testing done in a production environment.
  5. Monitoring and feedback data collection during production.

MLFlow provides the tools to manage this lifecycle.  The book is divided into five sections that cover these items in depth.  The first section is where the reader is acquainted with the basic framework.   The book is designed to be hands-on with complete code examples for each chapter.  In fact about 50% of the book is leading the reader through this collection of excellent examples.   The up-side of this approach is that the reader becomes a user and gains expertise and confidence with the material.  Of course, the downside (as this author has learned from his own publications) is that software evolves very fast and printed versions go out of date quickly.  Fortunately, Dr. Liu has a GitHub repository for each chapter that can keep the examples up to date.   

In the first section of the book, we get an introduction to MLFlow.  The example is a simple sentiment analysis written in PyTorch.  More precisely it implements a transfer learning scenario that highlights the use of lightning flash which provides a high level set of tools that encapsulate standard operations like basic training, fine tuning and testing.  In chapter two, MLFlow is first introduced as a means to manage the experimental life cycle of model development.  This involves the basic steps of defining  and running the experiment.    We also see the MLFlow user portal.  In the first step the experiment is logged with the portal server which records the critical metadata from the experiment as it is run.  

This reader was able to do all of the preceding on a windows 11 laptop, but for the next steps I found another approach easier.   Databricks is the creator of MLFlow, so it is not surprising that MLFlow is fully supported on their platform.   The book makes it clear that the code development environment  of choice for MLOps is not my favorite Jupyter, but rather VSCode.  And for good reasons.   VSCode interoperates with MLFlow brilliantly when running your own copy of MLFlow.  If you use the Databricks portal the built-in notebook editor works and is part of the MLFlow environment.   While Databricks has a free trial account, many of the features describe below are not available unless you have an account on AWS or Azure or GCS and a premium Databrick account. 

One of the excellent features of MLFlow is its ability to track code versioning data and pipeline tracking.  As you run experiments you can modify the code in the notebook and run it again. MLFlow keeps track of the changes, and you can return to previous versions with a few mouse clicks  (see Figure 1).   

Figure 1.  MLFlow User Interface showing a history of experiments and results.

Pipelines in MLFlow are designed to allow you to capture the entire process from feature engineering through model selection and tuning (See Figure 2).  The pipelines consider are data wrangling, model build and deployment.  MLFlow supports the data centric deep learning model using Delta Lake to allow versioned and timestamped access to data.

Figure 2.  MLFlow pipeline stages illustration.  (From Databricks)

Chapter 5 describes the challenges of running at scale and Chapter 6 takes the reader through hyperparameter tuning at scale.  The author takes you through running locally with a local code base and then running remote code from Github and finally running the code from Github remotely on as Databricks cluster.

In Chapter  7, Dr Liu dives into the technical challenge of hyperparameter optimization (HPO).  He compares three sets of tools for doing HPO at scale but he settles on Ray Tune which work very well with MLFlow.  We have describe Ray Tune elsewhere in our blog, but the treatment in the book is much more advanced.

Chapter 8 turn to the very important problem of doing ML inference at scale.  Chapter 9 provides an excellent, detailed introduction to the many dimensions of explainability.  The primary tool discussed is based on SHapley Additive exPlanations (SHAP), but others are also discussed.    Chapter 10 explores the integration of SHAP tools with MLFlow.   Together these two chapters provide an excellent overview of the state of the art in deep learning explainability even if you don’t study the code details.    

Conclusion

Deep learning is central to the concepts embodied in the notion that much of our current software can be generated or replaced entirely by neural network driven solutions.  This is often called “Software 2.0’. While this may be an apocryphal idea, it is clear that there is a great need for tools that can help guide developers through the best practices and procedures for deploying deep learning solutions at scale.  Dr Yong Liu’s book “Practical Deep Learning at Scale with MLFlow” is the best guide available to navigating this new and complex landscape.  While much of it is focused on the MLFlow toolkit from Databricks, it is also a guide to concepts that motivate the MLFlow software.    This is especially true when he discusses the problem of building deep learning models, hyperparameter tuning and inference systems at scale. The concluding chapters on deep learning explainability together comprise  one of the best essays on this topic I have seen.  Dr. Liu is a world-class expert on MLOps and this book is an excellent contribution. 

Explainable Deep Learning and Guiding Human Intuition with AI.

In July 2021 Alex Davies and a team from DeepMind, Oxford and University of Sydney published a paper entitled “Advancing mathematics by guiding human intuition with AI”. The paper addresses the question of how can machine learning be used to guide intuition in mathematical discovery?  The formal approach they take to this question proceeds as follows.   Let Z be a collection of objects.  Suppose that for each instance z in Z we have two distinct mathematical representations of z:  X(z) and Y(z).   We can then ask, without knowing z,  is there a mathematical function f : X -> Y such that given X(z) and Y(z),  f(X(z)) = Y(z)?  Suppose the mathematician builds a machine learning model trained on many instances of X(z) and Y(z).  That model can be thought of as a function f^ :X -> Y such that f^(X(z)) ~ Y(z).  The question then becomes, can we use properties of that model to give us clues on how to construct the true f?

A really simple example that the authors give is to let Z be the set of convex polyhedral (cube, tetrahedron, octahedron, etc.).  If we let X(z) be the tuple of numbers defined by the number of edges, the number of vertices, the volume of z and the surface area and let  Y(z) be the number of faces,  then without knowing z, the question becomes is there a function f: R4 ->  R such that f( X(z) ) = Y(z) ?   Euler answered this question some time ago in the affirmative:  Yes, he proved that

f(edges, vertices, volume, surface area)  =  edges – vertices + 2 = faces.

Now suppose we did not have Euler to help us.   Given a big table where each row corresponds to (edge, vertices, volume, surface area, faces) for some convex polytope, we can select a subset of rows as a training set and try to build a model to predict faces given the other values. Should our AI model prove highly accurate on the test set consisting of the unselected rows, that may lead us to suspect that such a function exists.   In a statistical sense, the learned model is such a function, but it may not be exact and, worse, by itself, it may not lead us to formula as satisfying as Euler’s. 

This leads us to Explainable AI.   This is a topic that has grown in importance over the last decade as machine learning has been making more and more decisions “on our behalf”.  Such as which movies we should rent and which social media article we should read.  We wonder “Why did the recommender come to the conclusion that I would like that movie?”  This is now a big area of research (the Wikipedia article has a substantial bibliography on Explainable AI.)  One outcome of this work has been a set of methods that can be applied to trained models to help us understand what parts of the data are most critical in the model’s decision making.    Davies and his team are interested in understanding what are the most “salient” features of X(z) in relation to determining Y(z) and using this knowledge to inspire the mathematician’s intuition in the search for f.   We return to their mathematical examples later, but first let’s look closer at the concept of salience.

Salience and Integrated Gradients

Our goal is to understand how important each feature of the input to a neural network is to the outcome. The features that are most important are often referred to as “salient” features.  In a very nice paper, Axiomatic Attribution for Deep Networks from 2017 Sundararajan, Taly and Yan consider this the question of attribution.  When considering the attribution of input features to output results of DNNs, they propose two reasonable axioms.   The first is Sensitivity: if a feature of the input causes the network to make a change then that feature should  have a non-zero attribution.  In other words, it is salient. Represent the network as a function F: Rn->[0,1] for n-dimensional data.  In order to make the discussion more precise we need to pick a baseline input x’ that represents the network in an inactivated state: F(x’) = 0.    For example, in a vision system, an image that is all black will do.  We are interested in finding the features in x that are critical when F(x) is near 1. 

The second axiom is more subtle. Implementation Invariance:  If two neural networks are equivalent (i.e. they give the same results for the same input), the attribution of a feature should be the same for both.

The simples form of salience computation is to look at the gradient of the network.    For each i in 1 .. n,  we can look at the components of the gradient and define

This axiom satisfies implementation invariance, but unfortunately this fails the sensitivity test.  The problem is the value of F(xe) for some xe may be 1, but the gradient may be 0 at that point. We will show an example of this below.  On the other hand if we think about “increasing” x from x’ to xe , there should be a transition of the gradient from 0 to non-zero as F(x) increases towards 1.  That motivates the definition of Integrated Gradients.   We are going to add up the values of the gradient along a path from the baseline to a value that causes the network to change.   

let γ = (γ1, . . . , γn) : [0, 1] → Rn be a smooth function specifying a path in Rn from the baseline x’ to the input x, i.e., γ(0) = x’ and γ(1) = x.   It turns out that it doesn’t matter which path we take because we will be approximating the path integral,  and by the fundamental theorem of calculus applied to path integrals,  we have

Expanding the integral out in terms of the components of the gradient,

Now, picking the path that represents the straight line between x’ and x as

Substituting this in the right-hand side and simplifying, we can set the attribution for the ith component as

To compute the attribution of factor i, for input x, we need only evaluate the gradient along the path at several points to approximate the integral.   In the examples below we show how salience in this form and others may be used to give us some knowledge about our understanding problem.

Digression: Computing Salience and Integrated Gradients using Torch

Readers not interested in how to do the computation of salience in PyTorch can skip this section and go on to the next section on Mathematical Intuition and Knots.

A team from Facebook AI introduced Captum in 2019 as a library designed to compute many types of salience models.  It is designed to work with PyTorch deep learning tools.  To illustrate it we will look at a simple example to show where simple gradient salience breaks down yet integrated gradience works fine.  The complete details of this example are in this notebook on Github.

We start with the following really dumb neural network consisting of one relu operator on two input parameters.

A quick look at this suggests that the most salient parameter is input1 (because it has 10 time the influence on the result of input2.   Of course, the relu operator tells us that result is flat for large positive values of input1 and input2.  We see that as follows.

We can directly compute the gradient using basic automatic differentiation in Torch.  When we evaluate the partial derivatives for these values of input1 and input2 we see they are zero.

From Captum we can grab the IntegratedGradient and Salience operators and apply them as follows.

The integrated Gradient approximation shows that indeed input1 has 10 time the attribution strength of input2.  And the sum of these is m(input1, input2) plus an error term.  As we already expect, the simple gradient method  of computing salience will  fail.

Of course this extreme example does not mean simple gradient salience will fail all the time.   We will return to Captum (but without the details) in another example later in this article.

Mathematical Intuition and Knots

Returning to mathematics, the Davies team considered two problems.   The methodology they used is described in the figure below which roughly corresponds to our discussion in the introduction.

Figure 1.   Experimental methodology used by Davies, et.al. (Figure 1 from “Advancing mathematics by guiding human intuition with AI”. Nature, Vol 600, 2 December 2021)

They began with a conjecture about Knot theory.   Specifically, they were interested in the conjecture that that the geometric invariants of knots (playing the role of X(z) in the scenario above)  could determine some of the algebraic invariants (as Y(z)).  See Figure 2 below.

Figure 2.   Geometric and algebraic invariant of hyperbolic knots. (Figure 2 from “Advancing mathematics by guiding human intuition with AI”. Nature, Vol 600, 2 December 2021)

The authors of the paper had access to information 18 geometric invariants on 243,000 knots and built a custom deep learning stack to try to identify the salient invariants that could identify the signature of the knot (a snapshot of the information is shown below).   Rather than describing their model, we decided to apply one generated by the AutoML  tool  provided by the Azure ML Studio.  

Figure 3.  Snapshot of the Knot invariant data rendered as a python pandas dataframe.

We uses a Jupyter Notebook to interact with the remote instances of the azure ML  studio.  The notebook is in github here: dbgannon/math: Notebooks from Explainable Deep Learning and Guiding Human Intuition with AI (github.com).  The notebook also contains links to the dataset.

Because we described azure ML studio in a previous post, we will not go into it here in detail.  We formulated the computation as a straightforward regression classification problem.    AutoML completed the model selection and training and it also computed the factor salience computation with the results shown below.

Figure 4.  Salience of features computed by Azure AutoML using their Machine Learning services  (To see the salience factors one has to look at the output details on the ML studio page for the computation.)

The three top factors were: the real and imaginary components of meridinal_translation, and the longitudinal translation.   These are the same top factors that were revealed in the authors study but in a different order.  

Based on this hint they authors proposed a conjecture: for a hyperbolic knot K define the slope(K) to be the real part of the fraction longitudinal translation/meridinal_translation.  Then there exists constants c1 and c2 such that

In Figure 5,   we show scatter plots of the predicted signature versus the real signature of each of the knots in the test suit.   As you can see, the predicted signatures form a reasonably tight band around the diagonal (true signatures). The mean squared error of the formula slope(K)/2 from the true signature was 0.86 and the mean squared error of the model predictions was 0.37.   This suggests that the conjecture may need some small correction terms, but that is up to the mathematician to prove.  Otherwise, we suspect the bounds in the inequality are reasonable.

Figure 5.  On the left is a scatter plot of the slope(K)/2 computed from the geometric data vs the signature.  On the right is the signature predicted by the model vs the signature.  

Graph Neural Nets and Salient Subgraphs.

The second problem that the team looked at involved representation theory.    In this case they are interested in pairs of elements in the symmetric group Sn represented as permutations.   For example, in S5 an instance z might be {(03214), (34201)}.  An interesting question to study is how to transform the first permutation into the second by simple 2-element exchanges (rotations) such as 03214->13204->31204->34201.  In fact, there are many ways to do this and we can build a directed graph showing the various paths of rotations to get from the first to the second.   This graph is called the unlabeled Bruhat interval, and it is their X(z).    The Y(z) is the Kazhdan–Lusztig (KL) polynomial for the permutation pair.   To go any deeper into this topic is way beyond the scope of this article (and beyond the knowledge of this author!) Rather, we shall jump to their conclusion and then consider a different problem related to salient subgraphs.   They discovered by looking at salient subgraphs of a Bruhat interval graph a for a  pair in Sn that there was a hypercube and a subgraph isomorphic to an interval in Sn−1.  This led to a formula for computing the KL polynomial.  A very hard problem solved!

An important observation the authors used in designing the neural net model was that information conveyed along the Bruhat interval was similar to message passing models in Graph neural networks.  These GNNs have become powerful tools for many problems.   We will use a different example to illustrate the use of salience in understanding graph structures.    The example is one of the demo cases for the excellent Pytorch Geometric libraries. More specifically it is one of their example Colab Notebooks and Video Tutorials — pytorch_geometric 2.0.4 documentation (pytorch-geometric.readthedocs.io).  The example illustrates the use of graph neural networks for classification of molecular structures for use as drug candidates.

Mutagenicity is a property of a chemical compound that hampers its potential to become a safe drug.  Specifically there are often substructures of a compound, called toxicophores,  that can interact with proteins or DNA that can lead to changes in the normal cellular biochemistry.  An article from J. Med. Chem. 2005, 48, 1, 312–320, describes a collection of 4337 (2401 mutagens and 1936 nonmutagens).  These are included in the TUDataset collection and used here.

The Pytorch Geometric example uses the Captum library (that we illustrated above) to identify the salient substructures that are likely toxicophores.  While we will not go into great detail about the notebook because it is in their Colab space. If you want to run this on your on machine we have put a copy in our github folder for this project.

The TUDataset data set encodes molecules, such as the graph below, as an object of the form

Data( edge_index=[2, 26], x=[13, 14], y=[1] )

 In this object x represents the 13 vertices (atoms).  There are 14 properties associated with each atom, but they are just ‘one-hot’ encodings of the names of each of the 14 possible elements in the dataset: ‘C’, ‘O’, ‘Cl’, ‘H’, ‘N’, ‘F’,’Br’, ‘S’, ‘P’, ‘I’, ‘Na’, ‘K’, ‘Li’, ‘Ca’.  The edge index represents the 26 edges where each is identified by the index of the 2 end atoms.   The value Y is 0 if this molecule is known to be mutagenic and 1 otherwise.

We must first train a graph neural network that will learn to recognize the mutagenic molecules.  Once we have that trained network, we can apply Captum’s IntegratedGradient to identify the salient subgraphs that most implicate the whole graph as mutagenic. 

The neural network is a five layer graph convolutional network.  A convolutional graph layer works by adding together information from each graph neighbor of each node and multiplying it by a trainable matrix.   More specifically assume  that each node has a vector xl of values at level l.  We then compute a new vector xl+1 of values for node v at level l+1 by

where 𝐖(+1) denotes a trainable weight matrix of shape [num_outputs, num_inputs] and 𝑐𝑤,𝑣 refers to a fixed normalization coefficient for each edge.  In our case 𝑐𝑤,𝑣 is the number of edges coming into node v divided by the weight of the edge.   Our network is described below.

The forward method uses the x node property and the edge index map to guide the convolutional steps.   Note that we will use batches of molecules in the training and the parameter batch is how we can distinguish one molecule from another, so that the vector x is finally a pair of values for each element of the batch.  With edge_weight set to None,  the weight of each edge is  a constant 1.0.   

The training step is a totally standard PyTorch training loop.  With the dim variable set to 64 and 200 epochs later, the training accuracy is 96% and test accuracy is 84%.   To compute the salient subgraphs using integrated gradients we will have to compute the partial derivate of the model with respect to each of the edge weights.   To do so, we replace the edge weight with a variable tensor whose value is 1.0. 

Looking at a  sample from the set of mutagenic molecules, we can view the integrated gradients for each edge.   In figure 6 below on the left we show a sample with the edges labeled by their IG values.  To make this easier to see the example code replaces the IG values with a color code where large IG values have thicker darker lines. 

The article from J. Med. Chem notes that the occurrence of NO2 is often a sign of mutagenicity, so we are not surprised to see it in this example.   In Figure 7, we show several other examples that illustrate different salient subgraphs. 

Figure 6.   The sample from the Mutagenic set with the IngetratedGradient values labeling the edges.  On the right we have a version with dark, heavy lines representing large IG values.

Figure 7.  Four more samples.  The subgraph N02 in the upper left is clearly salient, but in the three other examples we see COH bonds showing salient subgraphs.  However these also occur in the other, non-mutagenic set, so the significance is not clear to us non-experts.

Conclusion.

The issue of explainability  of ML methods is clearly important when we let ML make decisions  about peoples lives.    Salience analysis lies at the heart of the contribution to mathematical insight described above.   We have attempted to illustrate where it can be used to help us learn about the features in training data that drive classification systems to draw conclusions.  However, it takes the inspiration of a human expert to understand how those features are fundamentally related to the outcome.

ML methods are having a far-reaching impact throughout scientific endeavors.  Deep learning has become a new tool that is used in research in particle physics to improve the search for signals in data, in astronomy where generative methods can create galaxy models for dark energy experiments  and biochemistry where RoseTTAFold and DeepMind’s AlphaFold have used deep learning models to revolutionize protein folding and protein-protein interaction.  The models constructed in these cases are composed from well-understood components such as GANs and Transformers where issues of explainability have more to do with optimization of the model and its resources usage.  We will return to that topic in a future study.