{"id":1144723,"date":"2025-07-14T18:51:28","date_gmt":"2025-07-15T01:51:28","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1144723"},"modified":"2025-07-15T19:22:42","modified_gmt":"2025-07-16T02:22:42","slug":"feature-tree-driven-synthesis-improves-training-data-for-code-llms","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/feature-tree-driven-synthesis-improves-training-data-for-code-llms\/","title":{"rendered":"Feature tree-driven synthesis improves training data for code LLMs\u00a0"},"content":{"rendered":"\n<p><a id=\"_msocom_1\"><\/a><\/p>\n\n\n\n<p>As large language models (LLMs) continue to improve at writing code, a key challenge has emerged: enabling them to generate complex, high-quality training data that actually reflects real-world programming.<\/p>\n\n\n\n<p>Currently, most data synthesis methods rely on simple code snippets as starting points. While these fragments are useful for illustrating specific functions, they often fail to capture the complexity of actual software development\u2014such as elaborate architectures, module dependencies, and cross-file logic. As a result, training data lacks the structural and semantic depth needed for models to generalize and perform well on advanced tasks.<\/p>\n\n\n\n<p>To address this, Microsoft Research Asia has developed a <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/epicoder-encompassing-diversity-and-complexity-in-code-generation\/\">feature tree\u2013driven data synthesis framework<\/a> that models the hierarchical organization of code based on its semantic features. Instead of relying on isolated fragments, the framework extracts key elements like variable types, function structures, and control flows to construct a layered \u201cfeature tree.\u201d This tree guides the systematic generation of diverse and complex training instructions.<\/p>\n\n\n\n<p>Using this approach, the research team developed and trained a new model called <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/epicoder-encompassing-diversity-and-complexity-in-code-generation\/\">EpiCoder<\/a>, which is setting new benchmarks. They made the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/huggingface.co\/datasets\/microsoft\/EpiCoder-func-380k\">dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/EpiCoder\">code repository<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> publicly available.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"a-semantic-leap-in-code-generation\">A semantic leap in code generation<\/h2>\n\n\n\n<p>Unlike traditional approaches that use abstract syntax trees to model code structure, the feature tree framework focuses on the semantic relationships between code elements. This \u201cfeature-to-structure\u201d approach improves the diversity and coverage of generated data, better mirroring the logic of real-world codebases.<\/p>\n\n\n\n<p>Illustrated in Figure 1, the framework operates in three core stages:<\/p>\n\n\n\n<p><strong>1. Feature tree extraction<\/strong><\/p>\n\n\n\n<p>Starting with an initial code dataset, researchers extract semantic features and group them through clustering to build a hierarchical tree. A tree demo was used to teach LLMs how to extract and organize these features, helping them establish \u00a0relationships among code components.<\/p>\n\n\n\n<p><strong>2. Feature tree evolution<\/strong><\/p>\n\n\n\n<p>To expand semantic diversity, the team introduced a tree \u201cevolution\u201d mechanism. By adjusting the tree\u2019s depth and breadth, they generate subtrees that represent increasingly complex code structures. This evolution reflects real development workflows\u2014progressing from functions to modules and from modules to full files.<\/p>\n\n\n\n<p><strong>3. Feature tree\u2013guided code generation<\/strong><\/p>\n\n\n\n<p>Using these subtrees, LLMs can generate targeted code samples with specific semantics, achieving on-demand synthesis. A probabilistic sampling method further improves coverage in weaker model areas, helping the model learn edge cases and rare structures.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"647\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-1024x647.png\" alt=\"diagram, timeline\" class=\"wp-image-1139170\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-1024x647.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-300x189.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-768x485.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-1536x970.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-2048x1293.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-1-1-240x152.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1. The feature tree framework<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evaluating-epicoder-across-benchmarks-and-repository-scale-tasks\">Evaluating EpiCoder across benchmarks and repository-scale tasks<\/h2>\n\n\n\n<p>Using this framework, researchers synthesized over 430,000 instruction samples covering a broad range of tasks, from individual functions to full file-level compositions. These samples were used to fine-tune two base models, Qwen2.5-Coder-7B-Base and DeepSeek-Coder-6.7B-Base, resulting in the EpiCoder model.<\/p>\n\n\n\n<p>EpiCoder achieved state-of-the-art performance across both function- and file-level benchmarks.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"988\" height=\"964\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/epicoder-2-2.png\" alt=\"chart, radar chart\" class=\"wp-image-1144724\" style=\"width:744px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/epicoder-2-2.png 988w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/epicoder-2-2-300x293.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/epicoder-2-2-768x749.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/epicoder-2-2-184x180.png 184w\" sizes=\"auto, (max-width: 988px) 100vw, 988px\" \/><figcaption class=\"wp-element-caption\">Figure 2. Benchmark performance of EpiCoder-Qwen-7B (fine-tuned on Qwen2.5-Coder-7B-Base) and its counter parts. XFileDep is file-level code generation benchmark, all others are function-level.<\/figcaption><\/figure>\n\n\n\n<p>The team also explored how this approach could scale to repository-level code synthesis. Using the \u201cLLM-as-a-judge\u201d evaluation me thod, they quantitatively assessed the complexity and diversity of the generated data. <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/epicoder-encompassing-diversity-and-complexity-in-code-generation\/\"><\/a>The approach produced more complex and diverse code than baseline methods, confirming its advantage in generating realistic, high-quality training data.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"618\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-3-1024x618.png\" alt=\"Example of repository-level code generation\" class=\"wp-image-1139172\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-3-1024x618.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-3-300x181.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-3-768x463.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-3-240x145.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/epicoder-3.png 1309w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Example of repository-level code generation<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"toward-more-controllable-and-interpretable-code-intelligence\">Toward more controllable and interpretable code intelligence<\/h2>\n\n\n\n<p>Traditional data synthesis techniques for code often lack semantic granularity and controllability. The feature tree framework addresses this by improving the structure and interpretability of synthetic data, helping lay the foundation for more transparent and systematic code LLMs.<\/p>\n\n\n\n<p>EpiCoder\u2019s success highlights the value of feature-driven semantic modeling as a foundation for training next-generation code models. As the framework continues to evolve, it could support a wide range of applications, including targeted knowledge completion, structured code understanding, and automated repository refactoring.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As large language models (LLMs) continue to improve at writing code, a key challenge has emerged: enabling them to generate complex, high-quality training data that actually reflects real-world programming. Currently, most data synthesis methods rely on simple code snippets as starting points. While these fragments are useful for illustrating specific functions, they often fail to [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1139173,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1144723","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1144723","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1144723\/revisions"}],"predecessor-version":[{"id":1144725,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1144723\/revisions\/1144725"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1139173"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1144723"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1144723"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1144723"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1144723"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}