{"id":171346,"date":"2014-04-25T23:43:54","date_gmt":"2014-04-26T06:43:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/project\/discussion-graph-tool\/"},"modified":"2020-03-13T17:22:16","modified_gmt":"2020-03-14T00:22:16","slug":"discussion-graph-tool","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/discussion-graph-tool\/","title":{"rendered":"Discussion Graph Tool"},"content":{"rendered":"<p class=\"page-content\">Discussion Graph Tool (DGT) is an easy-to-use analysis tool that provides a domain-speci\ufb01c language extracting co-occurrence relationships from social media and automates the tasks of tracking the context of relationships and other best practices. DGT provides a single-machine implementation, and also generates map-reduce-like programs for distributed, scalable analyses.<\/p>\n<p class=\"page-content\">DGT simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data.<\/p>\n<p>With just 3-4 simple lines of script, you can load your social media data, extract complex features, and generate a graph among arbitrary features. Throughout, DGT automates best-practices, such as tracking the context of relationships.<span id=\"1223794b-b2e2-4d9b-b81e-cfdfaccb343b\" class=\"ImageBlock fn\"><span id=\"ImageCaption1223794b-b2e2-4d9b-b81e-cfdfaccb343b\" class=\"ImageCaptionCoreCss ImageCaption\"><span id=\"21e308a9-a708-48ee-8b73-e03d861303b1\" class=\"ImageBlock fn\"><span id=\"ImageCaption21e308a9-a708-48ee-8b73-e03d861303b1\" class=\"ImageCaptionCoreCss ImageCaption\"><br \/>\n<\/span><\/span><\/span><\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2014\/04\/dgt_figure.png\" alt=\"\" width=\"927\" height=\"253\" \/><\/p>\n<div id=\"en-usprojectsdgtdefault\" class=\"page-content\">\n<h2><a class=\"btn btn-primary\" href=\"https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=52606\">Download<\/a><\/h2>\n<p>&nbsp;<\/p>\n\t<div data-wp-context='{\"items\":[]}' data-wp-interactive=\"msr\/accordion\">\n\t\t\t\t\t<div class=\"clearfix\">\n\t\t\t\t<div\n\t\t\t\t\tclass=\"btn-group align-items-center mb-g float-sm-right\"\n\t\t\t\t\tdata-bi-aN=\"accordion-collapse-controls\"\n\t\t\t\t>\n\t\t\t\t\t<button\n\t\t\t\t\t\tclass=\"btn btn-link m-0\"\n\t\t\t\t\t\tdata-bi-cN=\"Expand all\"\n\t\t\t\t\t\tdata-wp-bind--aria-controls=\"state.ariaControls\"\n\t\t\t\t\t\tdata-wp-bind--aria-expanded=\"state.ariaExpanded\"\n\t\t\t\t\t\tdata-wp-bind--disabled=\"state.isAllExpanded\"\n\t\t\t\t\t\tdata-wp-class--inactive=\"state.isAllExpanded\"\n\t\t\t\t\t\tdata-wp-on--click=\"actions.onExpandAll\"\n\t\t\t\t\t\ttype=\"button\"\n\t\t\t\t\t>\n\t\t\t\t\t\tExpand all\t\t\t\t\t<\/button>\n\t\t\t\t\t<span aria-hidden=\"true\"> | <\/span>\n\t\t\t\t\t<button\n\t\t\t\t\t\tclass=\"btn btn-link m-0\"\n\t\t\t\t\t\tdata-bi-cN=\"Collapse all\"\n\t\t\t\t\t\tdata-wp-bind--aria-controls=\"state.ariaControls\"\n\t\t\t\t\t\tdata-wp-bind--aria-expanded=\"state.ariaExpanded\"\n\t\t\t\t\t\tdata-wp-bind--disabled=\"state.isAllCollapsed\"\n\t\t\t\t\t\tdata-wp-class--inactive=\"state.isAllCollapsed\"\n\t\t\t\t\t\tdata-wp-on--click=\"actions.onCollapseAll\"\n\t\t\t\t\t\ttype=\"button\"\n\t\t\t\t\t>\n\t\t\t\t\t\tCollapse all\t\t\t\t\t<\/button>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t\t\t<ul class=\"msr-accordion\">\n\t\t\t\t\t\t\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-2\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-2\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-1\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tAvailable Features\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-1\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-2\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<table class=\"gray-stripe tWiz tableBorder borderRows\" cellspacing=\"3\" cellpadding=\"3\">\n<tbody>\n<tr class=\"gray-stripeTableEvenRow\">\n<td class=\"gray-stripeTableEvenCol\">\n<p align=\"left\">Out-of-the-box feature extraction for common scenarios, including mood and geo-location; as well as customizable dictionary and regular expression-based extractions.<\/p>\n<\/td>\n<td class=\"gray-stripeTableOddCol\">\n<p align=\"left\">Analyze text for signs of joviality, fatigue, sadness, guilt, hostility, fear, and serenity.\u00a0Map lat-lon coordinates to FIPS county codes. Recognize gender based on name.<\/p>\n<\/td>\n<\/tr>\n<tr class=\"gray-stripeTableOddRow\">\n<td class=\"gray-stripeTableEvenCol\">\n<p align=\"left\">Identifies co-occurrence relationships within social media messages, user behaviors, locations or other features.<\/p>\n<\/td>\n<td class=\"gray-stripeTableOddCol\">\n<p align=\"left\">Extract planar graphs and hyper-graphs of co-occurrence relationships, and tracks contextual statistics for each relationship.<\/p>\n<\/td>\n<\/tr>\n<tr class=\"gray-stripeTableEvenRow\">\n<td class=\"gray-stripeTableEvenCol\">\n<p align=\"left\">Import raw social media data from existing sources.<\/p>\n<\/td>\n<td class=\"gray-stripeTableOddCol\">\n<p align=\"left\">Reads delimeter-separated TSV and CSV files, line-based JSON format (including output of common Twitter downloaders) and multi-line record formats.<\/p>\n<\/td>\n<\/tr>\n<tr class=\"gray-stripeTableOddRow\">\n<td class=\"gray-stripeTableEvenCol\">\n<p align=\"left\">Analyze results in popular tools such asR, Gephi, and Excel<\/p>\n<\/td>\n<td class=\"gray-stripeTableOddCol\">\n<p align=\"left\">Outputs JSON, TSV and GEXF.<\/p>\n<\/td>\n<\/tr>\n<tr class=\"gray-stripeTableEvenRow\">\n<td class=\"gray-stripeTableEvenCol\">\n<p align=\"left\">Extend DGT with custom feature extractors<\/p>\n<\/td>\n<td class=\"gray-stripeTableOddCol\">\n<p align=\"left\">Incorporate your own feature extractors with DGT through a simple API. This makes it easy for others to build on your techniques and mix-and-match with others.<\/p>\n<\/td>\n<\/tr>\n<tr class=\"gray-stripeTableOddRow\">\n<td class=\"gray-stripeTableEvenCol\" colspan=\"2\">\n<p align=\"center\"><em>More coming soon&#8230;<\/em><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-4\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-4\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-3\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tNews\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-3\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-4\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<p><b>Aug 13:<\/b> Some people were seeing errors trying to run the binaries because of an invalid signature on the binaries. We&#8217;ve fixed that now. Thanks for the bug reports!<\/p>\n<p><b>Aug 8:<\/b> We&#8217;ve updated the DGT release, adding support for weighting data and projection on weighted values.\u00a0 We&#8217;ve also updated and expanded our location mapping capabilities to map lat-lon coordinates and user-specified locations to countries, US states and US counties.<\/p>\n<p><b>June 19: <\/b>Our first release is available!\u00a0 Get in touch with your questions.\u00a0 We&#8217;re looking for feedback. Tweet @emrek or email the team at <a href=\"mailto:discussiongraph@microsoft.com\">discussiongraph@microsoft.com<\/a>.\u00a0 Thanks!<\/p>\n<p><b>June 16:<\/b>\u00a0 In preparation for our tool release, we&#8217;ve added 2 new step-by-step walkthroughs on analyzing the moods of product reviews and extracting graphs of hashtag relationships on Twitter.<\/p>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-6\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-6\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-5\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tRead More\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-5\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-6\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<p align=\"justify\">Our step-by-step walkthroughs, and our reference guide give details about the\u00a0tool and its usage.<\/p>\n<p align=\"justify\">Read more about our tool and using it for deeper contextual analyses in our ICWSM 2014 paper, &#8220;Discussion Graphs: Putting Social Media Analysis in Context&#8221;, by K\u0131c\u0131man, Counts, Gamon, De Choudhury and Thiesson. [<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/kiciman-icwsm-discussiongraphs_final.pdf\" target=\"_self\" rel=\"noopener noreferrer\">PDF<\/a>]<\/p>\n<p align=\"justify\">\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-8\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-8\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-7\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tDiscussion and Feedback\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-7\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-8\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<\/p>\n<p>Have a question about how to use DGT for an analysis? Have a feedback or bug report?\u00a0 Want to use your own feature extractor within DGT?<\/p>\n<p>Contact @emrek via Twitter or reach all of us via email at <a href=\"mailto:discussiongraph@microsoft.com\">discussiongraph@microsoft.com<\/a>.<\/p>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-10\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-10\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-9\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tComing Soon\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-9\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-10\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<div id=\"en-usprojectsdgtwalkthrough2\" class=\"page-content\">\n<p>We are continuing development of the public release of DGT.\u00a0 Here is what is currently under development:<\/p>\n<ul>\n<li>Qualitative sampling of raw data that supports each extracted relationship.<\/li>\n<li>FILTER command for conditioning analyses on demographic or other feature values.<\/li>\n<li><strong><em>(Now available as of version 0.6)<\/em><\/strong> Improved support for extracting relationships among continuous or weighted feature values<\/li>\n<li>Improved aggregation\/summarization performance<a style=\"color: #ffffff;\" href=\"\">v0<\/a><\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t<\/div\t\t\t\t\t<\/ul>\n\t<\/div>\n\t\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Discussion Graph Tool (DGT) is an easy-to-use analysis tool that provides a domain-speci\ufb01c language extracting co-occurrence relationships from social media and automates the tasks of tracking the context of relationships and other best practices. DGT provides a single-machine implementation, and also generates map-reduce-like programs for distributed, scalable analyses. DGT simplifies social media analysis by making [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13559],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-171346","msr-project","type-msr-project","status-publish","hentry","msr-research-area-social-sciences","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2014-04-25","related-publications":[166370],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Feature Extractors","content":"<h2>Feature Extractors<\/h2>\r\nA feature extractor in DGT is responsible for analyzing the raw data of a social media message and recognizing, extracting, inferring or detecting higher level information.\u00a0 The raw data of a social media message may include the text as well available metadata about the message, the message author, and other geo-temporal or social context.\r\n\r\nDGT includes several out-of-the-box feature extractors\u00a0for common scenarios.\u00a0 These include some complex analysis tasks, such as mood inference and geo-location mapping, as well as\u00a0support for simpler analyses, such as customizable dictionary and regular expression-based feature extractors.\r\n\r\nThe reference guide lists the feature extractors included in DGT, and examples of using the customizable feature extractors.\r\n\r\nThe <a href=\"https:\/\/github.com\/lintool\/twitter-tools\">TREC 2013 Microblog track<\/a> provided a convenient set of tools for retrieving tweets, including a tool for sampling from the public twitter stream.\u00a0 To install this tool and begin downloading tweets, follow these instructions:\r\n<ol>\r\n \t<li>Install the prerequisite software\r\n<ol>\r\n \t<li><a href=\"http:\/\/www.oracle.com\/technetwork\/java\/javase\/downloads\/java-se-jdk-7-download-432154.html\" target=\"_new\">Java Development Kit<\/a><\/li>\r\n \t<li><a href=\"http:\/\/maven.apache.org\/\" target=\"_new\">Apache Maven<\/a><\/li>\r\n<\/ol>\r\n<\/li>\r\n \t<li>Download the twitter-tools zip file from <a href=\"https:\/\/github.com\/lintool\/twitter-tools\/\">https:\/\/github.com\/lintool\/twitter-tools\/<\/a>\u00a0and extract it on your computer<\/li>\r\n \t<li>Open a command-line to the directory where you extracted the twitter-tools zip file and run the following two commands to build the twitter-tools program<\/li>\r\n<\/ol>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt; cd twitter-tools-core\r\n&gt; mvn clean package appassembler:assemble<\/div>\r\n<p style=\"padding-left: 30px\">4. Follow the <a href=\"https:\/\/github.com\/lintool\/twitter-tools\/wiki\/Sampling-the-public-Twitter-stream\">instructions<\/a>\u00a0on the twitter-tools site for creating your Twitter access tokens, setting up a twitter4j.properties file, and running the GatherStatusStream.bat program to retrieve tweets from the public Twitter stream<\/p>"},{"id":1,"name":"Install","content":"<h2>Installing the Discussion Graph tool<\/h2>\r\nThis short step-by-step walks you through installing DGT and adding it to your execution path.\r\n\r\n[accordion]\r\n\r\n[panel header=\"Discussion Graph Tool Install\"]\r\n\r\nTo install the Discussion Graph Tool, <a href=\"https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=52606\">Download<\/a> the latest DGT release as a zip file.\r\n<ol>\r\n \t<li>Install the prerequisite <a href=\"http:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=30653\" target=\"_self\">.Net Framework 4.5<\/a><\/li>\r\n \t<li>\r\n<div align=\"left\">Check that the downloaded DGT zip file is \"unblocked\" on your computer.\u00a0 Right-click on the downloaded zip file, and click \"properties...\" and ensure that the \"Unblock\" check box is checked and click Apply, or press the \"Unblock\" button in older versions of Windows.<\/div>\r\n<div align=\"left\"><img class=\"size-full wp-image-213979 alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2014\/04\/dgt_unblock.png\" alt=\"dgt_unblock\" width=\"363\" height=\"509\" \/><\/div><\/li>\r\n \t<li>Extract the dgt-0.5.zip file to a location, such as your user directory <strong>c:users<\/strong><em><strong>myName<\/strong>\u00a0 <\/em>(where myName is your login),\u00a0<strong>c:program files<\/strong>, or\u00a0an alternate location, such as <strong>e:<\/strong>\u00a0 Wherever you decide to extract the zip file, you should find a <strong>dgt<\/strong> directory, and within it a <strong>bin<\/strong> directory<\/li>\r\n \t<li>Edit the system environment variables to add the dgt-0.5bin directory to the execution path.\r\n<ol>\r\n \t<li>To do so, open the Control Panel, search for Environment Variables and click \"Edit the System Environment Variables\".<\/li>\r\n \t<li>In the Advanced tab, click the environment variables button<\/li>\r\n \t<li>Select the PATH variable from the system variables list and click the Edit button.<\/li>\r\n \t<li>Edit the variable value (the current search paths), and append a ';' (semicolon character without the quotes) and the full path to the DGT binaries.\u00a0 (don't forget to include the trailing <strong>dgtbin<\/strong> directory, e.g., <b>c:usersmyNamedgtbin, c:program filesdgtbin <\/b>or<b> e:dgtbin<\/b>).<\/li>\r\n \t<li>Click OK.<\/li>\r\n<\/ol>\r\n<\/li>\r\n \t<li>Test the installation\r\n<ol>\r\n \t<li>Open a new command-line window (run cmd.exe)<\/li>\r\n \t<li>Type the command \"dgt\".\u00a0 You should see the following output.<\/li>\r\n<\/ol>\r\n<\/li>\r\n<\/ol>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt;dgt --help\r\nDiscussion Graph Tool Version 0.5\r\nMore info: [permalink post_id=171346]\r\nContact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]\r\nOptions:\r\n--target=local|... Specify target execution environment.\r\n--config=filename.xml Specify non-default configuration file<\/div>\r\nTo learn more about the Discussion Graph Tool, read the getting started guide and the step-by-step walkthroughs.\r\n\r\n[\/panel]\r\n\r\n[\/accordion]"},{"id":2,"name":"Walkthroughs","content":"<h2>[accordion]\r\n[panel header=\"Walkthrough #1: Analyzing Mood of Product Reviews\"]<\/h2>\r\n<h2>Analyzing Mood of Product Reviews<\/h2>\r\nThis walkthrough focuses on answering the question: How does mood (joviality, anger, guilt, ...) correlate with product review score? Does this vary by gender? As a bonus, see how to extract a graph of products based on their common reviewers. Read the step-by-step.\r\n\r\n<img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2014\/04\/dgt_moods_across_review_scores.png\" alt=\"\" width=\"870\" height=\"289\" \/>\r\n\r\n<span id=\"6cc73c34-8618-45bf-8db9-26c4255bb611\" class=\"ImageBlock fn\"><span id=\"ImageCaption6cc73c34-8618-45bf-8db9-26c4255bb611\" class=\"ImageCaptionCoreCss ImageCaption\">\u00a0<\/span><\/span>\r\n\r\nIn this walkthrough, we will be working with Amazon review data for\u00a0fine food products. First, we are going to ask the question, \u201cwhat are the moods associated with positive and negative reviews?\u201d Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.\r\n\r\nThrough this example, we will introduce the basic concepts and commands of a DGT script. We\u2019ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.\r\n<h1>Getting the Discussion Graph Tool<\/h1>\r\n<b>Step 1. Download the Discussion Graph Tool (DGT) <\/b>\r\n<blockquote>If you haven\u2019t already, download and install the discussion graph tool. The rest of this walkthrough will assume that you have installed the tool and added it to your executable\u00a0path.<\/blockquote>\r\n<blockquote>To double-check the installation, open a new command-line window and type the command \u201cdgt --help\". You should see the following output:<\/blockquote>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt;dgt --help\r\nDiscussion Graph Tool Version 0.5\r\nMore info: [permalink post_id=171346]\r\nContact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]\r\nOptions:\r\n--target=local|... Specify target execution environment.\r\n--config=filename.xml Specify non-default configuration file<\/div>\r\n<strong>Step 2. Create a new directory for this walkthrough.\u00a0 Here, we'll use the directory E:dgt-sample<\/strong>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt;mkdir e:dgt-sample<\/div>\r\n<h1>\u00a0Getting the Data<\/h1>\r\n<blockquote>Before we start to write our first script, let\u2019s get some data to analyze. We\u2019ll be using Amazon review data collected by McAuley and Leskovec. This dataset includes over 500M reviews of 74k food-related products. Each review record includes a product id, user id, user name, review score, helpfulness rating, timestamp and both review and summary text.\u00a0 The user names are often real names, and review scores are\u00a0integers on\u00a0a scale from 1 to 5<\/blockquote>\r\n<strong>Step 3. Download <a href=\"http:\/\/snap.stanford.edu\/data\/finefoods.txt.gz\">finefoods.txt.gz<\/a> from the <a href=\"http:\/\/snap.stanford.edu\/data\/\" target=\"_new\">Stanford Network Analysis Project\u2019s data archive<\/a>. Save the file to E:dgt-sample<\/strong>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt; e:\r\n&gt; cd e:dgt-sample\r\ne:dgt-sample&gt; dir\r\nVolume in drive E is DISK\r\nVolume Serial Number is AAAA-AAAADirectory of E:dgt-sample06\/10\/2014\u00a0\u00a011:17 AM\u00a0\u00a0\u00a0\u00a0&lt;DIR&gt;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0.\r\n06\/10\/2014\u00a0\u00a011:17 AM\u00a0\u00a0\u00a0\u00a0&lt;DIR&gt;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0..\r\n06\/10\/2014\u00a0\u00a011:16 AM\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0122,104,202 finefoods.txt.gz\r\n1 File(s)\u00a0\u00a0\u00a0\u00a0122,104,202 bytes\r\n2 Dir(s)\u00a0\u00a045,007,622,144 bytes free<\/div>\r\n<h1>Writing the Script<\/h1>\r\n<blockquote>There are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let\u2019s take things step-by-step.<\/blockquote>\r\n<strong>Step 4. Create a new file <\/strong><em>mood-reviews.dgt<\/em><strong> Use notepad.exe, emacs, vi\u00a0or your favorite text editor.<\/strong>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; notepad mood-reviews.dgt<\/div>\r\n<b>Step 5. LOAD the data.<\/b>\r\n<blockquote>The first command in the script is going to be to load the data file. The reviews we downloaded are in a multi-line record format, where each line in the file represents a key-value field of a record; and records are separated by blank lines. The LOAD\u00a0MultiLine()\u00a0command\u00a0will parse this data file.\u00a0 Add the following line as the first command in the script file:\r\n<pre>LOAD Multiline(path:\"finefoods.txt.gz\",ignoreErrors:\"true\");<\/pre>\r\nSince the multi-line format naturally embeds the schema within the data file, we don\u2019t have to specify it in the LOAD command.\u00a0 There are some spurious newlines in the finefoods.txt.gz data, so we do we need to set the ignoreErrors flag to true.\u00a0 This will tell DGT to ignore data that is misformatted.<\/blockquote>\r\n<p dir=\"ltr\"><b>Step 6. EXTRACT higher-level features from the raw data<\/b><\/p>\r\n\r\n<blockquote>Add the following line as the second command in the script file:\r\n<pre>EXTRACT AffectDetector(field:\"review_text\"),<\/pre>\r\n<pre>        Gender(field:\"review_profileName\"),<\/pre>\r\n<pre>        review_score;<\/pre>\r\nThis EXTRACT statement generates 3 higher-level features:\r\n<ul>\r\n \t<li style=\"list-style-type: none\">\r\n<ul>\r\n \t<li>The AffectDetector() call infers the affect, or mood, of a text. The field argument tells it which of the raw fields to analyze. We\u2019ll choose the long review field but could just as easily have selected the summary field. If you don\u2019t pass a field argument, then the AffectDetector() extractor will by default look for a field named \u201ctext\u201d in the raw data.<\/li>\r\n \t<li>The Gender() call infers the gender of the author, based on the author\u2019s first name. The field argument tells it which field includes the author\u2019s name. If you don\u2019t pass a field argument, then the Gender() extractor will by default look for a field named \u201cusername\u201d in the raw data.<\/li>\r\n \t<li>By naming the reviewscore field---without parentheses---we tell the script to pass the reviewscore field through without modification.<\/li>\r\n<\/ul>\r\n<\/li>\r\n<\/ul>\r\n<\/blockquote>\r\n<div style=\"background-color: cornsilk;color: black;padding: 10px;border-color: black\">\r\n\r\n<b>A note on naming outputs and inputs:<\/b>By default, EXTRACT, PROJECT and OUTPUT commands operate on the results of the previous statement. You can also explicitly name the results of commands. To do so, use the \u201cvar x = \u201c notation to assign results to a variable, then add \u201cFROM x\u201d to later commands. For example:\r\n\r\n<code>\r\n<b>var finefoodsdata = <\/b>LOAD MultiLine(path:\"finefoods.txt.gz\",ignoreErrors:\"true\");\r\nEXTRACT AffectDetector(field:\"review_text\"), Gender(field:\"review_profileName\"), reviewscore <b>FROM finefoodsdata<\/b>;\r\n<\/code>\r\n\r\n<\/div>\r\n<b>Step 7. PROJECT the data to focus on the relationships of importance<\/b>\r\n<blockquote>Now, we tell the script what we relationships we care about. Often, we\u2019ll be using DGT to extract a graph of co-occurrence relations from a set of data. In this first example, we\u2019re going to ask for a simpler result set, essentially using DGT as a simple aggregator or \u201cgroup by\u201d style function.\u00a0 Add the following line to the script:\r\n<pre>PROJECT TO review_score;<\/pre>\r\nBy projecting to \u201creview_score\u201d, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record. Since in this dataset every record has at most one review score, that means that there are no co-occurrence relationships. The resulting graph is then simply the degenerate graph of 5 nodes (1 for each score from 1 to 5).\u00a0 For each of these nodes, DGT aggregates the affect and gender information that we extracted.<\/blockquote>\r\n<p dir=\"ltr\"><strong>Step 8. OUTPUT the results to disk<\/strong><\/p>\r\n\r\n<blockquote>Finally, we add the following command to the script to save the results:\r\n<pre>OUTPUT TO \"finefoods_reviewscore_context.graph\";<\/pre>\r\nIf you haven't already, now would be a good time to save your script file...\u00a0 The whole script should look like this:\r\n<pre>LOAD Multiline(path:\"finefoods.txt.gz\",ignoreErrors:\"true\");<\/pre>\r\n<pre>EXTRACT AffectDetector(field:\"review_text\"),<\/pre>\r\n<pre>        Gender(field:\"review_profileName\"),<\/pre>\r\n<pre>        review_score;<\/pre>\r\n<pre>PROJECT TO review_score;<\/pre>\r\n<pre>OUTPUT TO \"finefoods_reviewscore_context.graph\";<\/pre>\r\n<\/blockquote>\r\n<h1>Run the Script<\/h1>\r\n<b>Step 9. <\/b>From the command line, run DGT against the script mood-reviews.dgt:\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; dgt.exe mood-reviews.dgt<\/div>\r\nThe output file \"finefoods_reviewscore_context.graph\" should now be in the e:dgt-sample directory.\u00a0 Each row of the output file represents a reviewscore, since that is what we projected to in our script. Columns are tab-separated and the first column of each row is the name of the edge (or nodes) in the graph; The second column is the count of records seen with the given review score; and the third column is a JSON formatted bag of data distributions for gender and affect observations.\r\n\r\nTo import this data into R, Excel or other tools, we\u00a0have included a\u00a0command-line utility dgt2tsv.exe that can pull out specific values.\u00a0 Use the following command to build a TSV file that summarizes the gender and mood for each review score:\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; dgt2tsv.exe finefoods_reviewscore_context.graph count,gender.m,gender.f,gender.u,mood.joviality,mood.fatigue,mood.hostility,mood.sadness,mood.serenity,mood.fear,mood.guilt finefoods_reviewscore_gendermood.tsv<\/div>\r\nHere\u2019s a quick graph of the results about how mood varies across review scores.\r\n\r\n<img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/dgt-moods_across_review_scores.png\" alt=\"\" width=\"870\" height=\"289\" \/>\r\n\r\nWe see that joviality increases and sadness decreases with higher review scores.\u00a0 We see that there is more hostility in lower review scores and more serenity in higher review scores.\u00a0 While most moods are monotonically increasing or decreasing with review score, we see that guilt peaks in 2- and 3-star reviews.\r\n<h1>Further Explorations<\/h1>\r\nThe design goal of DGT is to make it easy to explore the relationships embedded in social media data and capture the context of the discussions from which the relationships were inferred.\r\n\r\n<b>Are the distributions of mood across review scores different for men and women?<\/b> Conditioning the mood distributions on gender as well as review score gives us this information.\u00a0 We can do this simply by adding the gender field to our PROJECT command, as follows (changes from the original script are <b>bolded<\/b>):\r\n<pre>LOAD Multiline(path:\"finefoods.txt.gz\",ignoreErrors:\"true\");<\/pre>\r\n<pre>EXTRACT AffectDetector(field:\"review_text\"),<\/pre>\r\n<pre>        Gender(field:\"review_profileName\"),<\/pre>\r\n<pre>        review_score;<\/pre>\r\n<pre>PROJECT TO review_score<strong>, gender<\/strong>;<\/pre>\r\n<pre>OUTPUT TO \"finefoods_reviewscore_gender_context.graph\";<\/pre>\r\nHere's a quick look at the results.\u00a0 Here, I've graphed the joviality (solid line) and sadness (dashed line) for men (orange) and women (green).\u00a0 We see that the general trends hold, though there are some differences that one might continue digging deeper\u00a0into...\r\n\r\n<img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/dgt-mood_by_gender.png\" alt=\"\" width=\"480\" height=\"321\" \/>\r\n\r\n<b>How are products related to each other by reviewer?\u00a0 <\/b>For example, how many people that wrote a review of \"Brand A Popcorn\" also wrote about \"Brand X chocolate candies\"?\u00a0 We can answer this question by defining a co-occurrence relationship based on user id.\u00a0 That is, we'll say that two product ids are related if the same user reviewed both products.\u00a0 Here's how we do that in the script:\r\n<pre>LOAD Multiline(path:\"finefoods.txt.gz\",ignoreErrors:\"true\");\r\nEXTRACT product_productId, review_userId;\r\nRELATE BY review_userId;\r\nPLANAR PROJECT TO product_productId AGGREGATE();\r\nOUTPUT TO \"finefoods_products_relateby_user.graph\";<\/pre>\r\n<pre><\/pre>\r\n(We'll learn more about the RELATE BY and PLANAR PROJECT commands in the next walkthroughs.)\u00a0 This will generate a discussion graph that connects pairs of products that were reviewed by the same person.\u00a0 We can convert this into a file readable by the Gephi graph visualization tool using the dgt2gexf command:\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; dgt2gexf.exe finefoods_products_relateby_user.graph count finefoods_products_relateby_user.gexf filterbycount=1000<\/div>\r\nThe dgt2gexf command mirrors the dgt2tsv command.\u00a0 In this case, we decided to use a filterbycount option to only output edges that have at least 1000 users who have co-reviewed the pair of products.\u00a0 This filter helps keep the visualization relatively manageable.\r\n\r\nHere's the resulting product graph, laid out using Gephi's Fructerman Reingold algorithm: Each of the clusters represents a group of products that are frequently co-reviewed food products on Amazon...\r\n\r\n<img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/dgt-products_relateby_userid.png\" alt=\"\" width=\"1024\" height=\"1024\" \/>\r\n\r\n[\/panel]\r\n[panel header=\"Walkthrough #2: Analyzing Twitter Hashtags\"]\r\n<h2>Analyzing Twitter Hashtags<\/h2>\r\nThis walkthrough focuses on twitter data and extracting a graph of related hashtags based on co-occurrences. Read the step-by-step.\r\n\r\n<img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2014\/04\/dgt_twitter_hashtags_worldcup.png\" alt=\"\" width=\"1024\" height=\"1024\" \/>\r\n\r\n<span id=\"999920ae-79a1-4a0e-8e0f-d21b6f5de66d\" class=\"ImageBlock fn\"><span id=\"ImageCaption999920ae-79a1-4a0e-8e0f-d21b6f5de66d\" class=\"ImageCaptionCoreCss ImageCaption\">[accordion]\r\n<\/span><\/span>\r\n\r\n[panel header=\"DGT Walkthrough: Hashtags in Twitter\"]\r\n\r\nIn this walkthrough, we will be working with public stream data from Twitter. First, we are going to ask the question, \u201cwhat are the moods associated with positive and negative reviews?\u201d Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.\r\n\r\nThrough this example, we will introduce the basic concepts and commands of a DGT script. We\u2019ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.\r\n<h1>Getting the Discussion Graph Tool<\/h1>\r\n<b>Step 1. Download the Discussion Graph Tool (DGT) <\/b>\r\n\r\nIf you haven\u2019t already, download and install the discussion graph tool (Detailed installation instructions.) The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.\r\n\r\nTo double-check the installation, open a new command-line window and type the command \u201cdgt --help\". You should see the following output:\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt;dgt --help\r\nDiscussion Graph Tool Version 1.0\r\nMore info: [permalink post_id=171346]\r\nContact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]\r\nOptions:\r\n--target=local|... Specify target execution environment.\r\n--config=filename.xml Specify non-default configuration file<\/div>\r\n<b>Step 2. Create a new directory for this walkthrough. Here, we'll use the directory E:dgt-sample<\/b>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt;mkdir e:dgt-sample<\/div>\r\n<h1>Getting Twitter Data<\/h1>\r\nFirst, let\u2019s get some data to analyze. We\u2019ll be using Twitter data for this walkthrough.\u00a0 Twitter doesn't allow redistribution of its data, but does have an API for retrieving a sample stream of tweets.\u00a0 There are a number of steps you'll have to complete, including registering for API keys and access tokens from Twitter.\u00a0 We've put up full instructions.\r\n\r\n<b>Step 3. Install twitter-tools package.\u00a0 <\/b>See our instructions.\r\n\r\n<b>Step 4. Download a sample of tweets.\u00a0 <\/b>Run the GatherStatusStream.bat for \"a while\"---press Ctl-C to stop the download.\u00a0 This will generate a file (or files) called statuses.log.YYYY-MM-DD-HH where YY-MM-DD-HH represent the current date and hour.\u00a0 The files may be compressed (indicated with a .gz file suffix)\r\n\r\nEach of the line in this file represents a tweet (*), in JSON format, that includes all available metadata about the tweet, tweet author, etc.\u00a0 (* the file also includes some other information, such as tweet deletions.\u00a0 There's no need to worry about those for this walkthrough.)\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">&gt; twitter-tools-mastertwitter-tools-coretargetappassemblerbinGatherStatusStream.bat\r\n1000 messages received.\r\n2000 messages received.\r\n3000 messages received.\r\n4000 messages received.\r\n5000 messages received.\r\n6000 messages received.\r\n7000 messages received.\r\n8000 messages received.\r\n9000 messages received.\r\n10000 messages received.\r\nTerminate batch job (Y\/N)? Y&gt; dir statuses*\r\nVolume in drive C is DISK\r\nVolume Serial Number is AAAA-AAAADirectory of E:dgt-sampletwitter-tools-core06\/13\/2014\u00a0\u00a012:53 PM\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a049,665,736 statuses.log.2014-06-13-12\r\n1 File(s)\u00a0\u00a0\u00a0\u00a0\u00a049,665,736 bytes\r\n0 Dir(s)\u00a0\u00a043,039,879,168 bytes free<\/div>\r\n<div id=\"en-usprojectsdgtwalkthrough2\" class=\"page-content\">\r\n<h1>Writing the Script<\/h1>\r\nAs we saw in walkthrough #1, there are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let\u2019s take things step-by-step.\r\n\r\n<b>Step 5. Create a new file <\/b><em>twitter-hashtags.dgt<\/em><b> Use notepad.exe, emacs, vi or your favorite text editor.<\/b>\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; notepad twitter-hashtags.dgt<\/div>\r\n<b>Step 6. LOAD the data.<\/b>\r\n\r\nThe first command in the script is going to be to load the data file. The tweets we downloaded are in a JSON-based record format, where each line in the file is a JSON-formatted key-value field of a record; and records are separated by blank lines. The LOAD Twitter() command can parse this file. Add the following line as the first command in the script file:\r\n<pre>LOAD Twitter(path:\"statuses.log.2014-06-13-12\",ignoreErrors:\"true\");<\/pre>\r\nThe Twitter data source already knows about ***the key fields in the Twitter JSON data file*** (ADD LINK), so we don\u2019t have to specify any more information. The twitter-tools adds some non-JSON lines into its output, so we'll also set the ignoreErrors flag to true. This will tell DGT to ignore misformatted lines in the input.\r\n\r\n<b>Step\u00a07. EXTRACT higher-level features from the raw data<\/b>\r\n\r\nAdd the following line as the second command in the script file:\r\n<pre>EXTRACT AffectDetector(), Gender(), hashtag;<\/pre>\r\nThis EXTRACT statement generates 3 higher-level features:\r\n<ul>\r\n \t<li style=\"list-style-type: none\">\r\n<ul>\r\n \t<li>The AffectDetector() call infers the affect, or mood, of a text.\u00a0 By default, the AffectDetector() looks for a field named \"text\" in the raw data, though we could set the \"field\" argument to make it look at other fields instead.<\/li>\r\n \t<li>The Gender() call infers the gender of the author, based on the author\u2019s first name. By default, the Gender() extractor looks for a field named \"username\" in the raw data.\u00a0 Again, we could override this using the \"field\" argument.<\/li>\r\n \t<li>By naming the hashtag field---without parentheses---we tell the script to pass the hashtag field through without modification.<\/li>\r\n<\/ul>\r\n<\/li>\r\n<\/ul>\r\n<div style=\"background-color: cornsilk;color: black;padding: 10px;border-color: black\"><b>Note:<\/b> The output of twitter-tools already includes hashtags, user mentions, urls and stock symbols as explicit fields already parsed out of the raw text. We'll see in the further explorations how we can use exact phrase matching and regular expression matching to pull values out of the text ourselves.<\/div>\r\n<b>S<\/b><b>tep 8. PROJECT the data to focus on the relationships of importance<\/b>\r\n\r\nNow, we tell the script what we relationships we care about. Here, we want to extract the pair-wise co-occurrence relationships among hashtags.\u00a0 That is, which hashtags are used together?\r\n<pre>PLANAR PROJECT TO hashtag;<\/pre>\r\nBy projecting to \u201chashtag\u201d, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record.\r\n\r\nIn this exercise, we're choosing to use a PLANAR PROJECT command because we're going to visually display the resulting hashtag graph at the end of this walkthrough, and planar graphs are simply easier to render.\u00a0 However, it's worth noting that the planar representation is incomplete.\u00a0 For example, if\u00a03 hashtags always co-occur together that information will be lost because\u00a0the planar graph cannot represent this information.\u00a0 A hyper-graph can represent such complex co-occurrences, however.\u00a0 For this reason, the PROJECT command defaults to a hyper-graph, and we recommend using this representation if you are going to be computing on the result.\r\n\r\n<b>Step 9. OUTPUT the results to disk<\/b>\r\n\r\nFinally, we add the following command to the script to save the results:\r\n<pre>OUTPUT TO \"twitter_hashtags.graph\";<\/pre>\r\nIf you haven't already, now would be a good time to save your script file... The whole script should look like this:\r\n<pre>LOAD Twitter(path:\"statuses.log.2014-06-13-12\",ignoreErrors:\"true\");<\/pre>\r\n<pre>EXTRACT AffectDetector(), Gender(), hashtag;<\/pre>\r\n<pre>PLANAR PROJECT TO hashtag;<\/pre>\r\n<pre>OUTPUT TO \"twitter_hashtags.graph\";<\/pre>\r\n<h1>Run the Script<\/h1>\r\n<b>Step 9. <\/b>From the command line, run DGT against the script twitter-hashtags.dgt:\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; dgt.exe twitter-hashtags.dgt<\/div>\r\nThe output file \"twitter_hashtags.graph\" should now be in the e:dgt-sample directory. Each row of the output file represents a relationship between a pair of hashatags, since we projected to the planar relationship between co-occurring hashtags in our script. Columns are tab-separated and the first column of each row is the name of the edge\u00a0in the graph\u00a0(the edge name is simply the concatenation of the two node names, in this case the two hashtags); The second column is the count of tweets seen with the pair of hashtags; and the third column is a JSON formatted bag of data distributions for gender and affect observations.\r\n\r\nTo import this data into visualization and analysis tools, we have included\u00a0two command-line utilities dgt2tsv.exe and dgt2gexf.exe that can extract specific values into a tab-separated values (TSV) file or a Graph Exchange XML Format (GEXF) file.\r\n\r\nWe'll use the dgt2gexf command and visualize the result with the Gephi graph visualization tool:\r\n<div style=\"background-color: black;color: white;font-family: courier;padding: 10px\">e:dgt-sample&gt; dgt2gexf.exe twitter_hashtags.graph count twitter_hashtags.gexf<\/div>\r\nIf your twitter sample is large, you might consider adding the option \"filtercount=N\" (without the quotes) to the command-line.\u00a0 This will only include edges that were seen at least N times in your sample.\u00a0 Use an appropriate number, from 10 to 1000 or higher, depending on the size of your sample.\r\n\r\nHere's the resulting hashtag graph.\u00a0 Each of the clusters represents a group of hashtags that are frequently co-mentioned in our tiny sample of Twitter data...\r\n\r\n<img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/dgt-twitter_hashtags_all.png\" alt=\"\" width=\"1024\" height=\"1024\" \/>\r\n\r\nFor clarity and fun, we'll filter out low-frequency edges and zoom into one of the clusters of hashtags about world-cup related topics.\u00a0 We see from the thickness of the edges that #NED and #ESP are the most frequently co-occurring hashtags, and each also co-occurs relatively frequently with #WorldCup.\u00a0 We also see a number of people piggy-backing on the popular #worldcup hashtag\u00a0with topically unrelated hashtags (#followers, #followback, #retweet, #followme)\u00a0 to solicit followers and retweets.\r\n\r\n<span id=\"887635cf-6625-43da-97f2-03b68b963f40\" class=\"ImageBlock fn\"><img class=\"alignnone\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/dgt-twitter_hashtags_worldcup.png\" alt=\"\" width=\"1024\" height=\"1024\" \/><\/span>\r\n<h1>Further Explorations<\/h1>\r\nThere are many interesting things to explore in hashtag relationships, such as the evolution of hashtag relationships over time --- for example, use PROJECT TO hashtag,absoluteday; --- hashtag relationships conditioned on gender --- PROJECT TO hashtag,Gender(); --- and inspections of token distributions, moods and other features associated with hashtags and their relationships.\r\n\r\n<b>What are you going to explore next? <\/b>Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at <a href=\"mailto:discussiongraph@microsoft.com\">discussiongraph@microsoft.com<\/a>. Thanks!\r\n\r\n[\/panel]\r\n[\/accordion]\r\n\r\n<\/div>"},{"id":3,"name":"Reference Guide","content":"<h2>Discussion Graph Tool Reference Guides<\/h2>\r\n[accordion][panel header=\"Basic Concepts\"]\r\n\r\nIn the discussion graph tool framework, a co-occurrence analysis consists of the following key steps:\r\n<table class=\"gray-stripe tWiz tableBorder\" cellspacing=\"3\" cellpadding=\"3\">\r\n<tbody>\r\n<tr class=\"gray-stripeTableHeaderRow\">\r\n<td class=\"gray-stripeTableHeaderFirstCol\">\r\n<p align=\"right\">Step<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableHeaderOddCol\">Task<\/td>\r\n<td class=\"gray-stripeTableHeaderEvenCol\">DGT\u00a0command<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">1<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Reading from a social media data source.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">LOAD<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">2<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Extracting low-level features from individual messages.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">EXTRACT<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">3<\/p>\r\n<p align=\"right\">(optional)<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Declaring the feature that defines a co-occurrence. What defines the fact that two or more features have co-occurred?\r\n\r\nBy default, two features are considered to co-occur if they both occur in the same social media message.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">RELATE BY<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableOddCol\"><i>Steps 2 and 3 implicitly define an initial discussion graph.<\/i><i> <\/i><i>All co-occurring feature values that were seen to co-occur in the raw social media data will be connected by hyper-edges to form a large, multi-dimensional hyper-graph.<\/i><i>\u00a0<\/i><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">4<\/p>\r\n<p align=\"right\">(optional)<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">By default, each social media message is weighted equally.\u00a0 We can change this so that the data is weighted by user, location, or other feature.\u00a0 For example,\u00a0we might want data from every user to count equally, regardless of how many social media messages each user sent.\u00a0 This would prevent our analyses from being dominated by users who post too frequently.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">WEIGHT BY<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">5<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">We project the initial discussion graph to focus on those relationships we care about for our analysis. For this step, the task must specify the domains we care about.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">PROJECT<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">6<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Output results<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">OUTPUT<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableFirstCol\">\r\n<p align=\"right\">7<\/p>\r\n<p align=\"right\">(optional)<\/p>\r\n<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Often, we\u2019ll want to further analyze our results with higher-level machine learning, network analyses, and visualization techniques. This is outside the scope of DGT.<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nFor more details on the core concepts behind discussion graphs, we recommend reading our ICWSM 2014 paper.\r\n<h2>A note on projecting weighted data<\/h2>\r\nOften, feature values are weighted. For example, the affect classifier produces a weighted feature value indicating how likely a message is to be expressing joviality, sadness, etc. (In other cases, the use of the WEIGHT BY command implicitly creates a weighted value).\r\n\r\nWhen it encounters a weighted feature value in its target domains, the PROJECT TO command treats the weights as probabilities of a feature value having occurred. For example, let's continue our analysis of activity and location mentions such as in the following message:\r\n<pre>\"I'm having fun hiking tiger mountain\" tweeted by Alice on a Saturday at 10am<\/pre>\r\nLet's say our mood analysis indicates that the message has joviality with a weight of \"0.8\", serenity has a weight of \"0.4\" in this message, in addition to the other discrete features:\r\n<table class=\"gray-stripe tWiz\" cellspacing=\"3\" cellpadding=\"3\">\r\n<tbody>\r\n<tr class=\"gray-stripeTableHeaderRow\">\r\n<td class=\"gray-stripeTableHeaderEvenCol\"><b>Domain <\/b><\/td>\r\n<td class=\"gray-stripeTableHeaderOddCol\"><b>Feature <\/b><\/td>\r\n<td class=\"gray-stripeTableHeaderEvenCol\"><b>Weighted value<\/b><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Mood<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Joviality<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">0.8<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Mood<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Serenity<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">0.4<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Activity<\/td>\r\n<td class=\"gray-stripeTableOddCol\">hiking<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">1.0<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Location<\/td>\r\n<td class=\"gray-stripeTableOddCol\">tiger mountain<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">1.0<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Author<\/td>\r\n<td class=\"gray-stripeTableOddCol\">Alice<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">1.0<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nThe two weighted features are interpreted as independent probabilities. That is, there is an 80% likelihood of this message being jovial and a 20% likelihood of not being jovial. Independently, there is a 40% likelihood of the message being serene, and 60% chance of not being serene.\r\n\r\nIf we project this single message to the relationship between location and mood (PROJECT TO Mood, Location;) this message will expand to the following 4 projected edges::\r\n<table class=\"gray-stripe tWiz\">\r\n<tbody>\r\n<tr class=\"gray-stripeTableHeaderRow\">\r\n<td class=\"gray-stripeTableHeaderEvenCol\">Edge<\/td>\r\n<td class=\"gray-stripeTableHeaderOddCol\">Weight<\/td>\r\n<td class=\"gray-stripeTableHeaderEvenCol\">Metadata<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Joviality and Tiger Mountain<\/td>\r\n<td class=\"gray-stripeTableOddCol\">0.48<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">hiking, Alice<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Serenity and Tiger Mountain<\/td>\r\n<td class=\"gray-stripeTableOddCol\">0.08<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">hiking, Alice<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">Joviality and Serenity and Tiger Mountain<\/td>\r\n<td class=\"gray-stripeTableOddCol\">0.32<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">hiking, Alice<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">(No mood) and Tiger Mountain<\/td>\r\n<td class=\"gray-stripeTableOddCol\">0.12<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">hiking, Alice<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nOf course, when analyzing a larger corpus of social messages, each message will be expanded individually and the results aggregated.\r\n\r\n[\/panel][panel header=\"Script Command Reference\"]\r\n\r\nThe discussion graph tool\u2019s scripting language currently supports the following commands.\r\n\r\nNote that square brackets [ ] indicate optional elements of the command. <i>Italicized<\/i><i>terms <\/i>indicate user-specified arguments, variable names, etc. of the command.\r\n<h3>LOAD<\/h3>\r\nSyntax: LOAD <i>Datasource<\/i>([arguments]);\r\n\r\nExample: LOAD MultiLine(path:\u201dproductreviews.txt\u201d);\r\n\r\nThe LOAD command loads social media data from some datasource. The required arguments are datasource-specific. Generally, datasources require a path to the input file as well as schema information to interpret the file. See the Common things you\u2019ll want to do section below for examples of loading TSV, Multiline record, JSON and Twitter files.\r\n<h3>EXTRACT<\/h3>\r\nSyntax: EXTRACT [PRIMARY] <i>field<\/i>|<i>FeatureExtractor([arguments]),\u2026 [FROM varname];<\/i>\r\n\r\nExample: EXTRACT PRIMARY hashtag, Gender(), AffectDetector();\r\n\r\nThe EXTRACT command runs a series of feature extractors against the raw social media messages loaded from a data source via the LOAD command.\r\n\r\nExtracting a <i>field<\/i> will pass through a field from the raw data unmodified.\r\n\r\nExtracting a feature using a <i>FeatureExtractor<\/i>() will run the specified feature extractor against the social media message. Feature extractors may generate 0, 1 or more feature values for each message they process, and the domain of the feature need not match the name of the feature extractor. For example, the AffectDetector() generates features in several domains (Subjective, Mood and PosNegAffect), and other feature extractors, such as Phrases() can generate features in custom domains.\r\n\r\nThe PRIMARY flag acts as a kind of filter on the raw social media data. EXTRACT must find at least one PRIMARY field or feature in a message, otherwise the message will be ignored. If no fields or features are marked as PRIMARY, then EXTRACT will not filter messages.\r\n\r\nFROM <i>varname <\/i>tells the EXTRACT command where to get its input data. If not specified, EXTRACT will read from the output of the previous command.\r\n<h3>WEIGHT BY<\/h3>\r\nSyntax: WEIGHT BY featureDomain[, ...] [FROM varname];\r\n\r\nExample: WEIGHT BY userid;\r\n\r\nThe WEIGHT BY command reweights the data from social media messages. By default, every social media message counts as a single observation.\u00a0 If we see a co-occurrence relationship occurring in 2 social media messages, then the co-occurrence relationship will have a weight of 2.\u00a0 We can change this using the WEIGHT BY command so that every unique user (or location or other feature value) counts as a single observation.\u00a0 So, for example, if a co-occurrence relationship is expressed by 2 unique users, then it will have a weight of 2.\u00a0 Conversely, if a single user expresses 2 distinct co-occurrence relationships, each relationship will have a weight of only 0.5.\r\n\r\nNote that\u00a0we can WEIGHT BY one feature but RELATE BY another feature.\r\n<h3>RELATE BY<\/h3>\r\nSyntax: RELATE BY <i>featureDomain <\/i>[FROM<i> varname<\/i>]<i>;<\/i>\r\n\r\nExample: RELATE BY userid;\r\n\r\nThe RELATE BY command declares the domain that defines a co-occurrence relationship. All features that co-occur with the same feature value in this domain are considered to have co-occurred.\r\n\r\nFROM <i>varname <\/i>tells the RELATE BY command where to get its input data. If not specified, RELATE BY will read from the output of the previous command.\r\n\r\nNote that we can WEIGHT BY one feature but RELATE BY another feature.\r\n<h3>PROJECT<\/h3>\r\nSyntax: PROJECT TO [<i>featureDomain, \u2026<\/i>] [FROM <i>varname<\/i>];\r\n\r\nVariants: PLANAR PROJECT TO [<i>featureDomain, \u2026<\/i>] [FROM <i>varname<\/i>];\r\n\r\nVariant: PLANAR BIPARTITE PROJECT TO [<i>featureDomain, \u2026<\/i>] [FROM <i>varname<\/i>];\r\n\r\nExample: PROJECT TO hashtag;\r\n\r\nThe PROJECT TO command will project an initial hyper-graph to focus on only relationships among the specified feature domains. That is, only edges which connect 1 or more nodes in the specified domains will be kept, and any nodes in other feature domains will be removed from the structure of the graph. By default, the PROJECT TO command generates a hyper-graph. This means that nodes that do not co-occur with other nodes will still be described by a degenerate 1-edge. Also, if many nodes simultaneously co-occur together, their relationship will be described by a k-edge (where k == the number of co-occurring nodes)\r\n\r\nOften, especially for ease of visualization, it is useful to restrict the discussion graph to be a planar graph (where every edge in the graph connects exactly 2 nodes). The PLANAR PROJECT TO command achieves this. All hyper-edges will be decomposed and re-aggregated into their corresponding 2-edges.\r\n\r\nFurthermore, it can be useful to restrict the graph to be bipartite, where only edges that cross domains are kept. For example, we may only care about the relationship between users and the hashtags they use, and not care about the relationship among hashtags themselves. The PLANAR BIPARTITE PROJECT TO command achieves this. Semantically, this is the equivalent of doing a planar projection and then dropping all edges that connect nodes are in the same domain.\r\n<h3>MERGE<\/h3>\r\nSyntax: MERGE <i>varname1,varname2<\/i>[<i>,\u2026<\/i>];\r\n\r\nExample: MERGE MentionAndUserGraph,HashTagAndUserGraph;\r\n\r\nThe MERGE command overlays two discussion graphs atop each other. Nodes with the same feature domain and values will be merged.\r\n<h3>OUTPUT<\/h3>\r\nSyntax: OUTPUT TO \u201cfilename.graph\u201d [FROM <i>varname<\/i>];\r\n\r\nExample: OUTPUT TO \u201cmentions.graph\u201d;\r\n\r\nThe OUTPUT TO command saves a discussion graph to the specified file.\r\n\r\nFile\u2019s are saved in DGT\u2019s native format. This format consists of 3 tab-separated columns. The first column is the edge identifier: the comma-separated list of nodes connected by this edge. The second column is the count of the number of times this co-occurrence relationship was observed to occur. The third column is a JSON-formatted representation of the context of the relationship or, in other words, the distribution of feature values conditioned on the co-occurrence relationship.\r\n<h2>Naming variables<\/h2>\r\nWe can assign the result of commands to variables, and use these variables in later commands:\r\n\r\nSyntax:\r\n<pre>var x = <i>COMMAND1;<\/i><\/pre>\r\n<pre>COMMAND2 FROM x;<\/pre>\r\nExample:\r\n<pre>var reviewData = LOAD Multiline(path:\u201dfinefoods.tar.gz\u201d);<\/pre>\r\n<pre>var reviewFeatures = EXTRACT AffectDetector(),reviewscore FROM reviewData;\r\n\r\n<\/pre>\r\n[\/panel][panel header=\"Feature Extractor Reference\"]\r\n\r\nHere's a current list of feature extractors included in the discussion graph tool release.\r\n<table class=\"gray-stripe tWiz\" cellspacing=\"3\" cellpadding=\"3\">\r\n<tbody>\r\n<tr class=\"gray-stripeTableHeaderRow\">\r\n<td class=\"gray-stripeTableHeaderEvenCol\"><b>Feature Extractor <\/b><\/td>\r\n<td class=\"gray-stripeTableHeaderOddCol\"><b>Arguments <\/b><\/td>\r\n<td class=\"gray-stripeTableHeaderEvenCol\"><b>Output Domain<\/b><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>AffectDetector()<\/b>\r\n\r\nInfers mood from text<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field:<\/b> input field to analyze (default='text')<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><b>Mood<\/b>: weights for 7 moods (joviality, sadness, guilt, fatigue, hostility, serenity, fear)\r\n\r\n<b>PosNeg<\/b>: aggregation of positive\/negative affects<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>Gender()<\/b>\r\n\r\nInfers gender from user names<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze (default='username')\r\n\r\n<b>discrete: <\/b>whether to output discrete or weighted gender values (default='true')<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><b>gender<\/b>: m=male, f=female, u=unknown<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>GeoPoint()<\/b>\r\n\r\nexplicit lat-lon coordinates<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze (default='geopoint')\r\n\r\n<b>rounding<\/b>: number of decimal places to include<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><b>geopoint<\/b>: lat-lon value<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>GeoShapeMapping()<\/strong>\r\n\r\nMaps lat-lon points to\u00a0feature values via a user-specified GeoJSON formatted shapefile<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><strong>field: <\/strong>input field to analyze (default='geopoint').\u00a0 this field should contain both lat and lon coordinates, separated by a space or comma.\r\n\r\n<b>latfield: <\/b>input field containing latitude value.\r\n\r\n<b>lonfield: <\/b>input field containing longitude value.\r\n\r\n<b>shapefile:<\/b>\u00a0GeoJSON formatted shapefile\r\n\r\n<b>propertynames:<\/b>\u00a0comma separated list of property:domain pairs.\u00a0\u00a0The property names a property within the shapefile, and the domain specifies a custom domain name for that property.\u00a0 If a lat-lon point falls within a shape specified in the shapefile, the feature extractor will output all the specified properties in the propertynames list.&lt;\/SPAN&gt;\r\n\r\n<b>unknownvalue<\/b>: value to assign to a lat-lon outside of given shapes\r\n\r\n<em>Note: <\/em>Please specify either the field argument or both the latfield and lonfield arguments.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">[custom domain name]<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>Country()<\/strong>\r\n\r\nAn instance of GeoShapeMapping that maps lat-lon to country\/region two-letter codes and country\/region names<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='geopoint'). this field should contain both lat and lon coordinates, separated by a space or comma.\r\n\r\n<b>latfield: <\/b>input field containing latitude value.\r\n\r\n<b>lonfield: <\/b>input field containing longitude value.\r\n\r\n<b>unknownvalue<\/b>: value to assign to a lat-lon outside of countries\/regions\r\n\r\n<em>Note:<\/em><b> <\/b>Please specify either the field argument or both the latfield and lonfield arguments.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><b>fips_country:<\/b>\r\n\r\n<b>country:<\/b><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>USAState()<\/strong>\r\n\r\nAn instance of GeoShapeMapping that maps lat-lon to USA subregions and states<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='geopoint'). this field should contain both lat and lon coordinates, separated by a space or comma.\r\n\r\n<b>latfield: <\/b>input field containing latitude value.\r\n\r\n<b>lonfield: <\/b>input field containing longitude value.\r\n\r\n<b>unknownvalue<\/b>: value to assign to a lat-lon outside of US states\r\n\r\n<em>Note:<\/em><b> <\/b>Please specify either the field argument or both the latfield and lonfield arguments.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>USA_subregion:<\/strong>\r\n\r\n<b>USA_state:<\/b>\r\n\r\n<b>USA_fips:<\/b><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>CountyFIPS()<\/b>\r\n\r\nAn instance of GeoShapeMapping that maps lat-lon to US county names and FIPS codes<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='geopoint'). this field should contain both lat and lon coordinates, separated by a space or comma.\r\n\r\n<b>latfield: <\/b>input field containing latitude value.\r\n\r\n<b>lonfield: <\/b>input field containing longitude value.\r\n\r\n<b>unknownvalue<\/b>: value to assign to a lat-lon outside of US counties\r\n\r\n<em>Note:<\/em><b> <\/b>Please specify either the field argument or both the latfield and lonfield arguments.<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><b>countygeoid<\/b>:\r\n\r\n<b>countyname<\/b>:<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>Time() <\/b>\r\n\r\nExtracts various temporal features<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze (default='creationdate')\r\n\r\n<b>options:<\/b> list of time features to extract: absoluteminute, absolutehour, absoluteday, absoluteweek, monthofyear, dayofweek, hourofday. (default is to output all fields)\r\n\r\n<b>format:<\/b> 'unix' or 'ticks' (default='unix')<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><b>absoluteminute:<\/b>\r\n\r\n<b>absolutehour:<\/b>\r\n\r\n<b>absoluteday:<\/b>\r\n\r\n<b>absoluteweek:<\/b>\r\n\r\n<b>monthofyear:<\/b>\r\n\r\n<b>dayofweek:<\/b>\r\n\r\n<b>hourofday:<\/b><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>ProfileLocation()<\/b>\r\n\r\nMaps geographic regions from user profile locations with a user-specified mapping file<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze (default='userlocation')\r\n\r\n<b>domain<\/b>: set custom output domain\r\n\r\n<b>mappingfile<\/b>: model for mapping from user location names to geographic locations. DGT comes with a mapping file for major international metropolitan areas, and United States country regions and divisions.\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">[custom domain name]<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToCountry()<\/strong>\r\n\r\nMaps user profile locations to 2-letter country\/region FIPS codes<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><strong>field: <\/strong>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>country:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToCountryName()<\/strong>\r\n\r\nMaps user profile locations to country\/region names&lt;\/SPAN&gt;<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>countryname:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToUSASubregion()\u00a0<\/strong>\r\n\r\nMaps user profile locations to subregions of USA (e.g., Pacific, Mid-Atlantic)<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>usa_subregion:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToUSAState()<\/strong>\r\n\r\nMaps user\u00a0profile locations to US states<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>usa_state:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToUSACounty()\u00a0<\/strong>\r\n\r\nMaps user profile locations to US county FIPS codes<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>usa_county:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToUSACountyName()<\/strong>\r\n\r\nMaps user profile locations to US county names<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>usa_countyname:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><strong>ProfileLocationToMetroArea()<\/strong>\r\n\r\nMaps user profile locations to major metropolitan areas<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field: <\/b>input field to analyze (default='userlocation')\r\n\r\n<b>unknownvalue<\/b>: value to assign to unrecognized profile locations<\/td>\r\n<td class=\"gray-stripeTableEvenCol\"><strong>metroarea:<\/strong><\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>ExactPhrases()<\/b>\r\n\r\nMatches specific phrases in a given list or mapping file<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze (default='text')\r\n\r\n<b>domain<\/b>: set custom output domain\r\n\r\n<strong>accept:<\/strong> a comma-separated list of phrases to match\r\n\r\n<b>acceptfile<\/b>: a text file listing phrases. Use a tab-separated two-column file to specify canonical forms for matched phrases<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">[custom domain name]<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>Regex()<\/b>\r\n\r\nMatches regular expressions<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze\r\n\r\n<b>domain<\/b>: set custom output domain\r\n\r\n<b>regex<\/b>: the regular expression to match against text<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">[custom domain name]<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\"><b>Tokens()<\/b>\r\n\r\nExtracts unigram tokens<\/td>\r\n<td class=\"gray-stripeTableOddCol\"><b>field<\/b>: input field to analyze\r\n\r\n<b>domain<\/b>: set custom output domain\r\n\r\n<strong>stopwordsfile: <\/strong>file of tokens to ignore (default=none)\r\n\r\n<strong>porter:<\/strong> use porter stemming (default=\"false\")<\/td>\r\n<td class=\"gray-stripeTableEvenCol\">[custom domain name]<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n[\/panel]\r\n\r\n[\/accordion]"},{"id":4,"name":"FAQ","content":"<h2>Common things you\u2019ll want to do<\/h2>\r\n[accordion]\r\n[panel header=\"Load data in different formats\"]\r\n\r\nDGT can load social media data in delimeter-separated TSV and CSV files, line-based JSON format (including the output of common twitter downloaders) and multi-line record formats.\r\n<h3>TSV and CSV data<\/h3>\r\nTo load a TSV or CSV, use the following LOAD command. The path to a file is required. Also, either the hasHeader flag must be set to true (indicating the first row of the file is a header line) or the schema argument must be set.\r\n<pre>LOAD TSV(path:\"filename.txt\",<\/pre>\r\n<pre>         fieldSeparator:\",\", \/\/ optional: default is tab character<\/pre>\r\n<pre>         ignoreErrors:\"true\", \/\/ optional: default is false<\/pre>\r\n<pre>         hasHeader:\"false\", \/\/ optional: default is false<\/pre>\r\n<pre>         schema:\"col1,col2,...\" \/\/ either hasHeader:\"true\" or a schema is required<\/pre>\r\n<pre>         );<\/pre>\r\n<h3>Multi-line record data<\/h3>\r\nA multi line record formatted file includes a single record field per-line, with a blank line separating records. For example:\r\n<table class=\" tWiz tableBorder\" cellpadding=\"0\">\r\n<tbody>\r\n<tr>\r\n<td>name: Bob\r\n\r\ntext: hello world!\r\n\r\nmessagetime:5\/4\/2013\r\n\r\nname: Alice\r\n\r\ntext: hello back!\r\n\r\nmessagetime:5\/5\/2013\r\n\r\n...<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\nTo load a multiline record, use the following LOAD command. Only the path argument is required. The schema is implicit in the file itself.\r\n<pre>LOAD Multiline(path:\"filename.txt\",<\/pre>\r\n<pre>               fieldSeparator:\":\", \/\/ optional: default is : character<\/pre>\r\n<pre>               ignoreErrors:\"true\" \/\/ optional: default is false<\/pre>\r\n<pre>               );<\/pre>\r\n<h3>JSON file<\/h3>\r\nDGT can read JSON line formatted files (where each line of a text file is a JSON object).\r\n<pre>LOAD JSON(path:\"filename.txt\",<\/pre>\r\n<pre>          ignoreErrors:\"true\",<\/pre>\r\n<pre>          schema:\"field1:jsonpath1,field2:jsonpath2,...\");<\/pre>\r\nThe schema must specify both the fields to be extracted as well as their JSON paths. If multiple values in the JSON object match a given path, the field will be assigned multiple values.\r\n<h3>Twitter data<\/h3>\r\nDGT also includes a pre-defined data source for Twitter output of the twitter-tools utilities. (This is a JSON-line formatted file) To load the output of the twitter-tools utilities, use the following LOAD command. Only the path argument is required.\r\n<pre>LOAD Twitter(path:\"filename.txt\");<\/pre>\r\nThis data source includes schema definitions for most of the common Twitter fields:\r\n<table class=\"gray-stripe tWiz\" cellspacing=\"3\" cellpadding=\"3\">\r\n<tbody>\r\n<tr class=\"gray-stripeTableHeaderRow\">\r\n<td class=\"gray-stripeTableHeaderEvenCol\">Field<\/td>\r\n<td class=\"gray-stripeTableHeaderOddCol\">JSON path<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">contextid<\/td>\r\n<td class=\"gray-stripeTableOddCol\">id_str<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">createdat<\/td>\r\n<td class=\"gray-stripeTableOddCol\">c<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">text<\/td>\r\n<td class=\"gray-stripeTableOddCol\">text<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">inreplytostatusid<\/td>\r\n<td class=\"gray-stripeTableOddCol\">in_reply_to_status_id<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">inreplytoscreenname<\/td>\r\n<td class=\"gray-stripeTableOddCol\">in_reply_to_screen_name<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userid<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/id_str<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">username<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/\/name<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userscreenname<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/screen_name<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userlocation<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/location<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">lang<\/td>\r\n<td class=\"gray-stripeTableOddCol\">lang<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userdescription<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/description<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userfollowerscount<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/followers_count<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userfriendscount<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/friends_count<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userlistedcount<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/listed_count<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">usercreatedat<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/created_at<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userfavouritescount<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/favourites_count<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userutcoffset<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user[permalink post_id=171140]tc_offset<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">usertimezone<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/time_zone<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userverified<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/verified<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">userstatusescount<\/td>\r\n<td class=\"gray-stripeTableOddCol\">user\/statuses_count<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">retweetcreatedat<\/td>\r\n<td class=\"gray-stripeTableOddCol\">retweeted-status\/created_at<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">retweetid<\/td>\r\n<td class=\"gray-stripeTableOddCol\">retweeted_status\/id_str<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">retweettext<\/td>\r\n<td class=\"gray-stripeTableOddCol\">retweeted_status\/text<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">retweetuser<\/td>\r\n<td class=\"gray-stripeTableOddCol\">retweeted_status\/user\/id_str<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">retweetusername<\/td>\r\n<td class=\"gray-stripeTableOddCol\">retweeted_status\/user\/\/name<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">retweetuserscreenname<\/td>\r\n<td class=\"gray-stripeTableOddCol\">retweeted_status\/user\/screen_name<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">hashtag<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/symbols\/text<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">symbol<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/symbols\/text<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">url<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/urls\/url<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">urlexpanded<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/urls\/expanded_url<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">mentionuserid<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/user_mentions\/id_str<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">mentionusername<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/user_mentions\/\/name<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableOddRow\">\r\n<td class=\"gray-stripeTableEvenCol\">mentionuserscreenname<\/td>\r\n<td class=\"gray-stripeTableOddCol\">entities\/user_mentions\/screenname<\/td>\r\n<\/tr>\r\n<tr class=\"gray-stripeTableEvenRow\">\r\n<td class=\"gray-stripeTableEvenCol\">geopoint<\/td>\r\n<td class=\"gray-stripeTableOddCol\">geo\/coordinates\/$$<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n[\/panel][panel header=\"Filter out irrelevant messages\"]\r\n\r\nSometimes a specific social media message is simply irrelevant to a specific analysis. For example, in a study about hashtag usage on Twitter, we might want to ignore messages that do not have hashtags. To do this, we can use the PRIMARY keyword of the EXTRACT command.\r\n<pre>EXTRACT PRIMARY hashtag, PRIMARY mention, AffectDetector();<\/pre>\r\nIn this example, we have marked the hashtag and mention fields as PRIMARY fields (any field or feature extractor may be marked as PRIMARY). This PRIMARY flag tells the EXTRACT command that it must find either a hashtag or a mention value in a message in order to continue processing it. If a message has either a hashtag or a mention, EXTRACT will also run the AffectDetector() and pass the values along to the rest of the script. If a message does not have any hashtag value and does not have any mention value, then that message will be ignored.\r\n\r\nThe PRIMARY flag can be combined with the acceptfilter and rejectfilter arguments accepted by most feature extractors. If you want to only analyze social media messages by women, for example, you can use the acceptfilter argument to achieve this:\r\n<pre>EXTRACT PRIMARY Gender(accept:\u201df\u201d), hashtag, mention;<\/pre>\r\nThe Gender feature extractor understands the acceptfilter argument, and will only output feature values that match the list. The result in this case is that only messages where the author\u2019s gender is identifiably female will be processed. (note that the hashtag and mention fields are no longer marked as PRIMARY fields).\r\n\r\nIf you have a long list of values you want to accept, you can use the acceptfilterfilename argument. The syntax and behavior for the acceptfilter and acceptfilterfilename is the same as for the Phrases() feature extractor.\r\n\r\n[\/panel][panel header=\"Detect phrases and words in tweet text\"]\r\n\r\nYou can extend the Phrases() feature extractor with a different set of arguments to detect different phrases. Here is an adaptation of our \u201cpolitician detector\u201d from our simple example, but this time modified to detect super heroes. By default, phrase detection is case-insensitive.\r\n<pre>EXTRACT ExactPhrases(domain:\u201dparent\u201d,accept:\u201ddad,mom,father,mother\u201d);<\/pre>\r\nIf you have a long list of phrases you want to detect, you can put them in a file and reference them in your processor. Note that you also have to specify the datafile as a resource, so that the framework knows to include that file as part of the job.\r\n<pre>EXTRACT ExactPhrases(domain:\u201dparent\u201d,acceptfile:\u201dparentphrases.txt\u201d);<\/pre>\r\nIn its simple form, this file is simply a list of phrases to detect. You can also use this file to group or canonicalize detected phrases by adding a 2<sup>nd<\/sup> tab-separated column that includes the canonical form. For example, if you used the following file, it would detect the nicknames for parents and map them to their canonical name. That is, whenever the phrase extractor finds \u201cmommy\u201d or \"mom\" the extracted feature will be emitted as \u201cmother\u201d.\r\n<table width=\"100%\" cellpadding=\"0\">\r\n<tbody>\r\n<tr>\r\n<td width=\"17%\">mom<\/td>\r\n<td width=\"81%\">mother<\/td>\r\n<\/tr>\r\n<tr>\r\n<td width=\"17%\">mother<\/td>\r\n<td width=\"81%\">mother<\/td>\r\n<\/tr>\r\n<tr>\r\n<td width=\"17%\">mommy<\/td>\r\n<td width=\"81%\">mother<\/td>\r\n<\/tr>\r\n<tr>\r\n<td width=\"17%\">dad<\/td>\r\n<td width=\"81%\">father<\/td>\r\n<\/tr>\r\n<tr>\r\n<td width=\"17%\">...<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n[\/panel][panel header=\"Import results into R, Excel, Gephi or other tools\"]\r\n\r\nOften, you will want to perform further higher-level analyses (machine learning analyses, visualizations and\/or statistical analyses) on the output of DGT. To do so, we provide utilities to convert from DGTs native output format to TSV and GEXF files that will let you load the data in R, Excel, Gephi and other tools.\r\n\r\nTo convert to TSV, use the dgt2tsv.exe command:\r\n<pre>dgt2tsv.exe input.graph [outputfields] outputfilename.tsv<\/pre>\r\nThe list of outputfields may include \"count\", any of the domains output by a feature extractor, a domain name followed by \".count\", or a domain name followed by a specific feature value.\r\n\r\nFor example, the following command will output a count of the number of messages seen for each edge in a discussion graph; the gender of the author; and the weight of the \"fatigue\" value in the Mood domain.\r\n<pre>dgt2tsv.exe input.graph count,gender,Mood.fatigue output.tsv<\/pre>\r\nTo output a .gexf file that can be read by Gephi for graph analyses and visualizations, use the dgt2gexf.exe command:\r\n<pre>dgt2gexf.exe input.graph [outputfields] outputfilename.gexf\r\n\r\n[\/panel]\r\n[\/accordion]\r\n<\/pre>"}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Scott Counts","user_id":31471,"people_section":"Group 1","alias":"counts"},{"type":"user_nicename","display_name":"Emre Kiciman","user_id":31739,"people_section":"Group 1","alias":"emrek"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171346","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":5,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171346\/revisions"}],"predecessor-version":[{"id":604269,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171346\/revisions\/604269"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=171346"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=171346"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=171346"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=171346"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=171346"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}