{"id":1163228,"date":"2026-03-03T13:59:12","date_gmt":"2026-03-03T21:59:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1163228"},"modified":"2026-03-03T13:59:14","modified_gmt":"2026-03-03T21:59:14","slug":"actions-speak-louder-than-prompts-rethinking-how-llms-reason-over-graph-data","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/actions-speak-louder-than-prompts-rethinking-how-llms-reason-over-graph-data\/","title":{"rendered":"Actions Speak Louder Than Prompts: Rethinking How LLMs Reason Over Graph Data"},"content":{"rendered":"\n<p>By\u202f<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/benfinkelshtein.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Ben Finkelshtein<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u202f(University of Oxford),\u202f<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/silviu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Silviu Cucerzan<\/a>,\u202f<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sjauhar\/\" target=\"_blank\" rel=\"noreferrer noopener\">Sujay Kumar Jauhar<\/a>, and\u202f<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ryenw\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ryen W. White<\/a>\u202f(Microsoft)&nbsp;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Think about the last time you opened a shared document at work. Behind that simple action lies a&nbsp;complex network&nbsp;of relationships: the colleagues who edited the file before you, the team site&nbsp;on which&nbsp;it&nbsp;is stored,&nbsp;the&nbsp;related&nbsp;documents those collaborators have touched, and the organizational structure connecting&nbsp;all of it.&nbsp;Such collaborative platforms&nbsp;are built on graphs&nbsp;\u2013&nbsp;rich networks of people, content, and activity.&nbsp;A&nbsp;fundamental challenge in making them intelligent is understanding what each node in that graph represents.&nbsp;Should this document be flagged as sensitive? Which files should surface in a colleague&#8217;s feed? Does this sharing pattern look anomalous?&nbsp;<\/p>\n\n\n\n<p>These are all instances of\u202f<em>node classification<\/em>: given an entity embedded in a&nbsp;network&nbsp;of relationships,&nbsp;the goal is to&nbsp;assign&nbsp;it&nbsp;a meaningful label.&nbsp;It\u2019s&nbsp;a problem that extends far beyond&nbsp;human&nbsp;collaboration&nbsp;to applications such as&nbsp;fraud detection in financial networks, product categorization in e\u2011commerce, and&nbsp;road traffic&nbsp;congestion&nbsp;prediction. And&nbsp;it&#8217;s&nbsp;a problem where large language models (LLMs) are increasingly being applied.&nbsp;<\/p>\n\n\n\n<p>The appeal&nbsp;for using LLMs&nbsp;is clear. Graph neural networks (GNNs), the traditional tool for this task, must be trained per dataset,&nbsp;don&#8217;t&nbsp;transfer across domains, and struggle with the rich textual information that real-world nodes often carry&nbsp;\u2013&nbsp;lengthy document content, detailed product descriptions, user profiles.&nbsp;By&nbsp;contrast&nbsp;LLMs&nbsp;offer a compelling alternative&nbsp;with their broad world knowledge and flexible reasoning&nbsp;capabilities.&nbsp;Yet&nbsp;despite a surge of interest, the field&nbsp;has lacked&nbsp;a principled understanding of\u202f<em>how<\/em>\u202fLLMs should interact with graph data,\u202f<em>when<\/em>\u202fdifferent approaches work best, and\u202f<em>why<\/em>.&nbsp;<\/p>\n\n\n\n<p>Our new study, &#8220;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2509.18487\" target=\"_blank\" rel=\"noopener noreferrer\">Actions Speak Louder than Prompts<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&#8221;&nbsp;\u2013 which will appear as an oral presentation at the upcoming ICLR 2026 conference&nbsp;\u2013&nbsp;aims to fill that gap.&nbsp;We conducted one of the largest controlled evaluations of LLMs for graph inference to date, spanning 14 datasets across four domains, multiple structural regimes, and a range of model sizes and capabilities. The result is a set of practical, actionable insights for&nbsp;people&nbsp;building systems that combine language models with structured data&nbsp;\u2013&nbsp;whether in collaborative platforms, social networks, e-commerce, or beyond.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"it-s-not-just-what-you-ask-it-s-how-you-let-the-model-work\"><strong>It&#8217;s&nbsp;not just what you ask;&nbsp;it&#8217;s&nbsp;how you let the model work&nbsp;<\/strong><\/h2>\n\n\n\n<p>When most people think about applying LLMs to a problem, they think about\u202f<em>prompting<\/em>\u202f\u2013&nbsp;crafting the right instructions and feeding the relevant information directly into the&nbsp;model&#8217;s&nbsp;context window. This is indeed the most common approach in the LLM-for-graphs literature: serialize a node&#8217;s neighborhood into text, describe the labels, and ask the model to classify.&nbsp;<\/p>\n\n\n\n<p>However,&nbsp;prompting is only one way an LLM can interact with a graph.&nbsp;To&nbsp;paint a more complete picture of LLM&nbsp;interaction&nbsp;paradigms,&nbsp;we systematically compared three fundamentally different&nbsp;strategies:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prompting<\/strong>, where the graph neighborhood is serialized into text and presented to the model in a single shot.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>GraphTool<\/strong>, a&nbsp;ReAct-style approach where the model iteratively queries the graph through a fixed set of tools&nbsp;by&nbsp;retrieving neighbors, reading features, or checking labels one step at a time.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Graph-as-Code<\/strong>, where the model writes and executes short programs against a structured API, composing arbitrary queries over the graph&#8217;s features, structure, and labels.&nbsp;<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"936\" height=\"339\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-1.png\" alt=\"Diagram showing progression\" class=\"wp-image-1163230\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-1.png 936w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-1-300x109.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-1-768x278.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-1-240x87.png 240w\" sizes=\"auto, (max-width: 936px) 100vw, 936px\" \/><\/figure>\n\n\n\n<p>The progression from&nbsp;prompting to&nbsp;tool use to code generation&nbsp;represents&nbsp;a spectrum of increasing\u202f<em>agency<\/em>,&nbsp;from passively consuming information to actively deciding what to look at and how to process it. Our core finding is that this agency matters. As models are given more flexibility in\u202f<em>how<\/em>\u202fthey interact with the graph, classification accuracy consistently improves.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"letting-llms-write-code-over-graphs\"><strong>Letting LLMs write code over graphs&nbsp;<\/strong><\/h2>\n\n\n\n<p>The standout performer across our evaluation was Graph-as-Code. Rather than constraining the model to a fixed set of retrieval actions or requiring all information to be packed into a prompt, this approach lets the LLM compose targeted programs\u00a0by\u00a0combining structural queries, feature lookups, and label checks in whatever way it deems most useful for the node at hand.\u00a0You can see these results in the table below where performance across long-text homophilic datasets highlights\u00a0the gap between Prompting and Graph-as-Code, especially on high-degree graphs like wiki-cs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"936\" height=\"357\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3.png\" alt=\"Results table summarizing dataset characteristics \" class=\"wp-image-1163233\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3.png 936w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3-300x114.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3-768x293.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3-240x92.png 240w\" sizes=\"auto, (max-width: 936px) 100vw, 936px\" \/><\/figure>\n\n\n\n<p>This advantage is especially pronounced in settings that mirror real-world complexity. Consider a collaborative platform where content nodes carry lengthy document text and are densely connected through sharing, co-authorship, and organizational links,&nbsp;or&nbsp;as another example,&nbsp;e-commerce network where product nodes have detailed descriptions and hundreds of connections. A prompting approach quickly hits the LLM&#8217;s context window limit&nbsp;because&nbsp;there&nbsp;is simply too much text from too many neighbors to fit. Graph-as-Code sidesteps this&nbsp;issue&nbsp;entirely: the model selectively retrieves only the information it needs, keeping its context focused and efficient.&nbsp;<\/p>\n\n\n\n<p>In practice,&nbsp;the&nbsp;most valuable real-world graph applications tend to involve exactly this kind of dense, text-rich network. Collaborative content graphs, recommendation systems, fraud detection networks, social platforms&nbsp;aren&#8217;t&nbsp;small, sparse toy problems, but rather&nbsp;large-scale networks where nodes carry rich information and have many connections. For practitioners building intelligent features&nbsp;over&nbsp;these graphs, our findings suggest that investing in code-generation interfaces for LLMs may yield&nbsp;substantially better&nbsp;outcomes than refining prompts.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"challenging-conventional-wisdom-on-graph-structure\"><strong>Challenging conventional wisdom on graph structure&nbsp;<\/strong><\/h2>\n\n\n\n<p>A common conclusion and conception, cited in many&nbsp;publications&nbsp;in the LLM-for-graphs literature,&nbsp;is that these models struggle on\u202f<em>heterophilic<\/em>\u202fgraphs&nbsp;\u2013&nbsp;networks where connected nodes tend to have\u202f<em>different<\/em>\u202flabels rather than similar ones. The intuition is straightforward: if an LLM relies on neighborhood cues to classify a node, and those cues are misleading (because neighbors belong to different classes), performance should suffer.&nbsp;<\/p>\n\n\n\n<p>In collaborative platforms,&nbsp;people&nbsp;frequently&nbsp;work across organizational boundaries&nbsp;\u2013&nbsp;an engineer collaborates with a&nbsp;designer,&nbsp;a finance team shares documents with marketing. The resulting graphs&nbsp;don&#8217;t&nbsp;have the neat clustering that homophily assumes. The same is true of&nbsp;networks&nbsp;of web-page links, interdisciplinary research, and many social networks.&nbsp;<\/p>\n\n\n\n<p>Our results tell a different story. Across four heterophilic datasets&nbsp;all three LLM interaction strategies performed well, consistently outperforming classical baselines like label propagation.&nbsp;Our results&nbsp;challenge&nbsp;the assumption that LLMs are inherently limited to homophilic settings and suggest&nbsp;they can extract useful&nbsp;signal&nbsp;from node features and non-local patterns, rather than relying solely on neighborhood voting.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"654\" height=\"348\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-2.png\" alt=\"Results table summarizing dataset characteristics \" class=\"wp-image-1163231\" style=\"width:654px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-2.png 654w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-2-300x160.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-2-240x128.png 240w\" sizes=\"auto, (max-width: 654px) 100vw, 654px\" \/><\/figure>\n\n\n\n<p>This broadens the applicability of LLM-based graph reasoning to the messy, cross-cutting networks that real-world systems&nbsp;operate&nbsp;on.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"understanding-what-llms-rely-on\"><strong>Understanding what LLMs rely on&nbsp;<\/strong><\/h2>\n\n\n\n<p>Beyond overall accuracy, we wanted to understand\u202f<em>how<\/em>\u202fthese models use&nbsp;different types&nbsp;of information. Do they lean on textual features? Graph structure?&nbsp;Known&nbsp;labels? And does this change&nbsp;depending&nbsp;on the interaction strategy?&nbsp;<\/p>\n\n\n\n<p>To answer this, we ran a series of controlled ablations&nbsp;\u2013&nbsp;systematically removing edges, truncating text features, and&nbsp;deleting&nbsp;labels&nbsp;\u2013&nbsp;and tracked how accuracy responded. The results, visualized as 2D heatmaps, revealed a striking contrast.&nbsp;<\/p>\n\n\n\n<p>Prompting degrades predictably: remove edges or labels, and accuracy drops along both axes. The model needs both structure and labels to function, and it has no way to compensate when either is degraded.&nbsp;<\/p>\n\n\n\n<p>Graph-as-Code, by contrast, displays&nbsp;a remarkable&nbsp;adaptability. On homophilic datasets where structure is informative, it relies on edges. On heterophilic datasets where features matter more, it shifts to text. When labels are removed but features and structure remain, it&nbsp;is barely&nbsp;impacted.&nbsp;Performance&nbsp;only suffers&nbsp;when\u202f<em>multiple<\/em>\u202fsources of information are simultaneously degraded.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"936\" height=\"357\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3.png\" alt=\"Results table summarizing dataset characteristics \" class=\"wp-image-1163232\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3.png 936w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3-300x114.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3-768x293.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/03\/image-3-240x92.png 240w\" sizes=\"auto, (max-width: 936px) 100vw, 936px\" \/><\/figure>\n\n\n\n<p>This adaptive behavior is a key property of the code-generation paradigm. Because the model can compose arbitrary queries, it naturally gravitates toward whichever signal is most informative for the task at hand&nbsp;\u2013&nbsp;a kind of emergent robustness that&nbsp;doesn&#8217;t&nbsp;need to be explicitly engineered. For systems&nbsp;operating&nbsp;over real-world data, where information is often incomplete or noisy, this resilience is especially valuable.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"design-principles-for-llm-graph-systems\"><strong>Design principles for LLM-graph systems&nbsp;<\/strong><\/h2>\n\n\n\n<p>Our study yields several practical guidelines for building systems that combine LLMs with graph-structured data:&nbsp;<\/p>\n\n\n\n<p><strong>Match the interaction mode to the graph&#8217;s characteristics.<\/strong>\u202fFor small, sparse graphs with short text features, prompting may suffice. But as graphs grow denser, features grow longer, or the application demands robustness, code-generation approaches like Graph-as-Code offer clear advantages.&nbsp;<\/p>\n\n\n\n<p><strong>Don&#8217;t&nbsp;rule out LLMs for heterophilic graphs.<\/strong>\u202fPrior assumptions about LLM limitations in low-homophily settings appear to be an artifact of studying only the prompting paradigm. With the right interaction strategy, LLMs are effective across structural regimes,&nbsp;including the cross-cutting, boundary-spanning networks common in collaborative and organizational settings.&nbsp;<\/p>\n\n\n\n<p><strong>Think beyond prompt engineering.<\/strong>\u202fIn graph applications,\u202f<em>how<\/em>\u202fthe model accesses information matters at least as much as\u202f<em>what<\/em>\u202finstructions it receives. Investing in richer interaction interfaces&nbsp;\u2013&nbsp;tool use, code execution, structured APIs&nbsp;\u2013&nbsp;can unlock performance that no amount of prompt tuning will achieve.&nbsp;<\/p>\n\n\n\n<p>These principles reflect a broader shift in how we think about LLMs: not as static question-answering systems, but as\u202f<em>agents<\/em>\u202fthat can plan, explore, and compose actions to solve complex reasoning tasks. Graphs, with their rich relational structure and diverse information types, are a natural proving ground for this agentic paradigm.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"looking-ahead\"><strong>Looking ahead&nbsp;<\/strong><\/h2>\n\n\n\n<p>As LLMs continue to grow in&nbsp;capability&nbsp;the advantages of agentic interaction modes are likely to compound. Our results already show that larger models and reasoning-enabled variants consistently improve performance across all interaction strategies. But critically, the\u202f<em>gap<\/em>\u202fbetween prompting and code generation persists at every model scale, suggesting that interaction design is a complementary axis of improvement to model scaling.&nbsp;<\/p>\n\n\n\n<p>For teams building intelligent features&nbsp;on&nbsp;collaborative platforms, knowledge graphs, or any system where entities are connected by rich relationships, this work offers a clear message: the way you let an LLM engage with your data can matter as much as the model itself. As the ecosystems of people, content, and activity that power modern productivity tools continue to grow in scale and complexity, principled approaches to LLM-graph interaction will only become more important.&nbsp;<\/p>\n\n\n\n<p>The title of our paper captures the core insight: when it comes to LLMs and graphs, actions truly do speak louder than prompts.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading is-style-m\" id=\"learn-more\"><strong>Learn More<\/strong><\/h2>\n\n\n\n<p><em>Read the full paper:\u202f<\/em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2509.18487\" target=\"_blank\" rel=\"noopener noreferrer\"><em>Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By\u202fBen Finkelshtein (opens in new tab)\u202f(University of Oxford),\u202fSilviu Cucerzan,\u202fSujay Kumar Jauhar, and\u202fRyen W. White\u202f(Microsoft)&nbsp; Think about the last time you opened a shared document at work. Behind that simple action lies a&nbsp;complex network&nbsp;of relationships: the colleagues who edited the file before you, the team site&nbsp;on which&nbsp;it&nbsp;is stored,&nbsp;the&nbsp;related&nbsp;documents those collaborators have touched, and the organizational structure [&hellip;]<\/p>\n","protected":false},"author":43305,"featured_media":1163277,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":1160955,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[13556,13555],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1163228","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-search-information-retrieval","msr-locale-en_us"],"msr_assoc_parent":{"id":1160955,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1163228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43305"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1163228\/revisions"}],"predecessor-version":[{"id":1163290,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1163228\/revisions\/1163290"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1163277"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1163228"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1163228"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1163228"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1163228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}