Discussion Graph Tool

Established: April 25, 2014

Discussion Graph Tool (DGT) is an easy-to-use analysis tool that provides a domain-specific language extracting co-occurrence relationships from social media and automates the tasks of tracking the context of relationships and other best practices. DGT provides a single-machine implementation, and also generates map-reduce-like programs for distributed, scalable analyses.

DGT simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data.

With just 3-4 simple lines of script, you can load your social media data, extract complex features, and generate a graph among arbitrary features. Throughout, DGT automates best-practices, such as tracking the context of relationships.

Download

 

Available Features

Out-of-the-box feature extraction for common scenarios, including mood and geo-location; as well as customizable dictionary and regular expression-based extractions.

Analyze text for signs of joviality, fatigue, sadness, guilt, hostility, fear, and serenity. Map lat-lon coordinates to FIPS county codes. Recognize gender based on name.

Identifies co-occurrence relationships within social media messages, user behaviors, locations or other features.

Extract planar graphs and hyper-graphs of co-occurrence relationships, and tracks contextual statistics for each relationship.

Import raw social media data from existing sources.

Reads delimeter-separated TSV and CSV files, line-based JSON format (including output of common Twitter downloaders) and multi-line record formats.

Analyze results in popular tools such asR, Gephi, and Excel

Outputs JSON, TSV and GEXF.

Extend DGT with custom feature extractors

Incorporate your own feature extractors with DGT through a simple API. This makes it easy for others to build on your techniques and mix-and-match with others.

More coming soon…

News

Aug 13: Some people were seeing errors trying to run the binaries because of an invalid signature on the binaries. We’ve fixed that now. Thanks for the bug reports!

Aug 8: We’ve updated the DGT release, adding support for weighting data and projection on weighted values.  We’ve also updated and expanded our location mapping capabilities to map lat-lon coordinates and user-specified locations to countries, US states and US counties.

June 19: Our first release is available!  Get in touch with your questions.  We’re looking for feedback. Tweet @emrek or email the team at discussiongraph@microsoft.com.  Thanks!

June 16:  In preparation for our tool release, we’ve added 2 new step-by-step walkthroughs on analyzing the moods of product reviews and extracting graphs of hashtag relationships on Twitter.

Read More

Our step-by-step walkthroughs, and our reference guide give details about the tool and its usage.

Read more about our tool and using it for deeper contextual analyses in our ICWSM 2014 paper, “Discussion Graphs: Putting Social Media Analysis in Context”, by Kıcıman, Counts, Gamon, De Choudhury and Thiesson. [PDF]

Discussion and Feedback

Have a question about how to use DGT for an analysis? Have a feedback or bug report?  Want to use your own feature extractor within DGT?

Contact @emrek via Twitter or reach all of us via email at discussiongraph@microsoft.com.

Coming Soon

We are continuing development of the public release of DGT.  Here is what is currently under development:

  • Qualitative sampling of raw data that supports each extracted relationship.
  • FILTER command for conditioning analyses on demographic or other feature values.
  • (Now available as of version 0.6) Improved support for extracting relationships among continuous or weighted feature values
  • Improved aggregation/summarization performancev0

People

Publications

2014

Downloads

Feature Extractors

Feature Extractors

A feature extractor in DGT is responsible for analyzing the raw data of a social media message and recognizing, extracting, inferring or detecting higher level information.  The raw data of a social media message may include the text as well available metadata about the message, the message author, and other geo-temporal or social context.

DGT includes several out-of-the-box feature extractors for common scenarios.  These include some complex analysis tasks, such as mood inference and geo-location mapping, as well as support for simpler analyses, such as customizable dictionary and regular expression-based feature extractors.

The reference guide lists the feature extractors included in DGT, and examples of using the customizable feature extractors.

The TREC 2013 Microblog track provided a convenient set of tools for retrieving tweets, including a tool for sampling from the public twitter stream.  To install this tool and begin downloading tweets, follow these instructions:

  1. Install the prerequisite software
    1. Java Development Kit
    2. Apache Maven
  2. Download the twitter-tools zip file from https://github.com/lintool/twitter-tools/ and extract it on your computer
  3. Open a command-line to the directory where you extracted the twitter-tools zip file and run the following two commands to build the twitter-tools program
> cd twitter-tools-core
> mvn clean package appassembler:assemble

4. Follow the instructions on the twitter-tools site for creating your Twitter access tokens, setting up a twitter4j.properties file, and running the GatherStatusStream.bat program to retrieve tweets from the public Twitter stream

Install

Installing the Discussion Graph tool

This short step-by-step walks you through installing DGT and adding it to your execution path.

Discussion Graph Tool Install

To install the Discussion Graph Tool, Download the latest DGT release as a zip file.

  1. Install the prerequisite .Net Framework 4.5
  2. Check that the downloaded DGT zip file is “unblocked” on your computer.  Right-click on the downloaded zip file, and click “properties…” and ensure that the “Unblock” check box is checked and click Apply, or press the “Unblock” button in older versions of Windows.
    dgt_unblock
  3. Extract the dgt-0.5.zip file to a location, such as your user directory c:usersmyName  (where myName is your login), c:program files, or an alternate location, such as e:  Wherever you decide to extract the zip file, you should find a dgt directory, and within it a bin directory
  4. Edit the system environment variables to add the dgt-0.5bin directory to the execution path.
    1. To do so, open the Control Panel, search for Environment Variables and click “Edit the System Environment Variables”.
    2. In the Advanced tab, click the environment variables button
    3. Select the PATH variable from the system variables list and click the Edit button.
    4. Edit the variable value (the current search paths), and append a ‘;’ (semicolon character without the quotes) and the full path to the DGT binaries.  (don’t forget to include the trailing dgtbin directory, e.g., c:usersmyNamedgtbin, c:program filesdgtbin or e:dgtbin).
    5. Click OK.
  5. Test the installation
    1. Open a new command-line window (run cmd.exe)
    2. Type the command “dgt”.  You should see the following output.
>dgt –helpDiscussion Graph Tool Version 0.5More info: [permalink post_id=171346]Contact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

To learn more about the Discussion Graph Tool, read the getting started guide and the step-by-step walkthroughs.

Walkthroughs

Walkthrough #1: Analyzing Mood of Product Reviews

Analyzing Mood of Product Reviews

This walkthrough focuses on answering the question: How does mood (joviality, anger, guilt, …) correlate with product review score? Does this vary by gender? As a bonus, see how to extract a graph of products based on their common reviewers. Read the step-by-step.

In this walkthrough, we will be working with Amazon review data for fine food products. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

Getting the Discussion Graph Tool

Step 1. Download the Discussion Graph Tool (DGT)

If you haven’t already, download and install the discussion graph tool. The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

To double-check the installation, open a new command-line window and type the command “dgt –help”. You should see the following output:

>dgt –helpDiscussion Graph Tool Version 0.5More info: [permalink post_id=171346]Contact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

Step 2. Create a new directory for this walkthrough.  Here, we’ll use the directory E:dgt-sample

>mkdir e:dgt-sample

 Getting the Data

Before we start to write our first script, let’s get some data to analyze. We’ll be using Amazon review data collected by McAuley and Leskovec. This dataset includes over 500M reviews of 74k food-related products. Each review record includes a product id, user id, user name, review score, helpfulness rating, timestamp and both review and summary text.  The user names are often real names, and review scores are integers on a scale from 1 to 5

Step 3. Download finefoods.txt.gz from the Stanford Network Analysis Project’s data archive. Save the file to E:dgt-sample

> e:> cd e:dgt-samplee:dgt-sample> dirVolume in drive E is DISKVolume Serial Number is AAAA-AAAADirectory of E:dgt-sample06/10/2014  11:17 AM              .06/10/2014  11:17 AM              ..06/10/2014  11:16 AM       122,104,202 finefoods.txt.gz1 File(s)    122,104,202 bytes2 Dir(s)  45,007,622,144 bytes free

Writing the Script

There are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

Step 4. Create a new file mood-reviews.dgt Use notepad.exe, emacs, vi or your favorite text editor.

e:dgt-sample> notepad mood-reviews.dgt

Step 5. LOAD the data.

The first command in the script is going to be to load the data file. The reviews we downloaded are in a multi-line record format, where each line in the file represents a key-value field of a record; and records are separated by blank lines. The LOAD MultiLine() command will parse this data file.  Add the following line as the first command in the script file:

LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");

Since the multi-line format naturally embeds the schema within the data file, we don’t have to specify it in the LOAD command.  There are some spurious newlines in the finefoods.txt.gz data, so we do we need to set the ignoreErrors flag to true.  This will tell DGT to ignore data that is misformatted.

Step 6. EXTRACT higher-level features from the raw data

Add the following line as the second command in the script file:

EXTRACT AffectDetector(field:"review_text"),
        Gender(field:"review_profileName"),
        review_score;

This EXTRACT statement generates 3 higher-level features:

    • The AffectDetector() call infers the affect, or mood, of a text. The field argument tells it which of the raw fields to analyze. We’ll choose the long review field but could just as easily have selected the summary field. If you don’t pass a field argument, then the AffectDetector() extractor will by default look for a field named “text” in the raw data.
    • The Gender() call infers the gender of the author, based on the author’s first name. The field argument tells it which field includes the author’s name. If you don’t pass a field argument, then the Gender() extractor will by default look for a field named “username” in the raw data.
    • By naming the reviewscore field—without parentheses—we tell the script to pass the reviewscore field through without modification.

    A note on naming outputs and inputs:By default, EXTRACT, PROJECT and OUTPUT commands operate on the results of the previous statement. You can also explicitly name the results of commands. To do so, use the “var x = “ notation to assign results to a variable, then add “FROM x” to later commands. For example:

    var finefoodsdata = LOAD MultiLine(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT AffectDetector(field:"review_text"), Gender(field:"review_profileName"), reviewscore FROM finefoodsdata;

    Step 7. PROJECT the data to focus on the relationships of importance

    Now, we tell the script what we relationships we care about. Often, we’ll be using DGT to extract a graph of co-occurrence relations from a set of data. In this first example, we’re going to ask for a simpler result set, essentially using DGT as a simple aggregator or “group by” style function.  Add the following line to the script:

    PROJECT TO review_score;

    By projecting to “review_score”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record. Since in this dataset every record has at most one review score, that means that there are no co-occurrence relationships. The resulting graph is then simply the degenerate graph of 5 nodes (1 for each score from 1 to 5).  For each of these nodes, DGT aggregates the affect and gender information that we extracted.

    Step 8. OUTPUT the results to disk

    Finally, we add the following command to the script to save the results:

    OUTPUT TO "finefoods_reviewscore_context.graph";

    If you haven’t already, now would be a good time to save your script file…  The whole script should look like this:

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
    EXTRACT AffectDetector(field:"review_text"),
            Gender(field:"review_profileName"),
            review_score;
    PROJECT TO review_score;
    OUTPUT TO "finefoods_reviewscore_context.graph";

    Run the Script

    Step 9. From the command line, run DGT against the script mood-reviews.dgt:

    e:dgt-sample> dgt.exe mood-reviews.dgt

    The output file “finefoods_reviewscore_context.graph” should now be in the e:dgt-sample directory.  Each row of the output file represents a reviewscore, since that is what we projected to in our script. Columns are tab-separated and the first column of each row is the name of the edge (or nodes) in the graph; The second column is the count of records seen with the given review score; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

    To import this data into R, Excel or other tools, we have included a command-line utility dgt2tsv.exe that can pull out specific values.  Use the following command to build a TSV file that summarizes the gender and mood for each review score:

    e:dgt-sample> dgt2tsv.exe finefoods_reviewscore_context.graph count,gender.m,gender.f,gender.u,mood.joviality,mood.fatigue,mood.hostility,mood.sadness,mood.serenity,mood.fear,mood.guilt finefoods_reviewscore_gendermood.tsv

    Here’s a quick graph of the results about how mood varies across review scores.

    We see that joviality increases and sadness decreases with higher review scores.  We see that there is more hostility in lower review scores and more serenity in higher review scores.  While most moods are monotonically increasing or decreasing with review score, we see that guilt peaks in 2- and 3-star reviews.

    Further Explorations

    The design goal of DGT is to make it easy to explore the relationships embedded in social media data and capture the context of the discussions from which the relationships were inferred.

    Are the distributions of mood across review scores different for men and women? Conditioning the mood distributions on gender as well as review score gives us this information.  We can do this simply by adding the gender field to our PROJECT command, as follows (changes from the original script are bolded):

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
    EXTRACT AffectDetector(field:"review_text"),
            Gender(field:"review_profileName"),
            review_score;
    PROJECT TO review_score, gender;
    OUTPUT TO "finefoods_reviewscore_gender_context.graph";

    Here’s a quick look at the results.  Here, I’ve graphed the joviality (solid line) and sadness (dashed line) for men (orange) and women (green).  We see that the general trends hold, though there are some differences that one might continue digging deeper into…

    How are products related to each other by reviewer?  For example, how many people that wrote a review of “Brand A Popcorn” also wrote about “Brand X chocolate candies”?  We can answer this question by defining a co-occurrence relationship based on user id.  That is, we’ll say that two product ids are related if the same user reviewed both products.  Here’s how we do that in the script:

    LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT product_productId, review_userId;RELATE BY review_userId;PLANAR PROJECT TO product_productId AGGREGATE();OUTPUT TO "finefoods_products_relateby_user.graph";

    (We’ll learn more about the RELATE BY and PLANAR PROJECT commands in the next walkthroughs.)  This will generate a discussion graph that connects pairs of products that were reviewed by the same person.  We can convert this into a file readable by the Gephi graph visualization tool using the dgt2gexf command:

    e:dgt-sample> dgt2gexf.exe finefoods_products_relateby_user.graph count finefoods_products_relateby_user.gexf filterbycount=1000

    The dgt2gexf command mirrors the dgt2tsv command.  In this case, we decided to use a filterbycount option to only output edges that have at least 1000 users who have co-reviewed the pair of products.  This filter helps keep the visualization relatively manageable.

    Here’s the resulting product graph, laid out using Gephi’s Fructerman Reingold algorithm: Each of the clusters represents a group of products that are frequently co-reviewed food products on Amazon…

    Walkthrough #2: Analyzing Twitter Hashtags

    Analyzing Twitter Hashtags

    This walkthrough focuses on twitter data and extracting a graph of related hashtags based on co-occurrences. Read the step-by-step.

    [accordion]

    [panel header="DGT Walkthrough: Hashtags in Twitter"]

    In this walkthrough, we will be working with public stream data from Twitter. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

    Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

    Getting the Discussion Graph Tool

    Step 1. Download the Discussion Graph Tool (DGT)

    If you haven’t already, download and install the discussion graph tool (Detailed installation instructions.) The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

    To double-check the installation, open a new command-line window and type the command “dgt –help”. You should see the following output:

    >dgt –helpDiscussion Graph Tool Version 1.0More info: [permalink post_id=171346]Contact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

    Step 2. Create a new directory for this walkthrough. Here, we’ll use the directory E:dgt-sample

    >mkdir e:dgt-sample

    Getting Twitter Data

    First, let’s get some data to analyze. We’ll be using Twitter data for this walkthrough.  Twitter doesn’t allow redistribution of its data, but does have an API for retrieving a sample stream of tweets.  There are a number of steps you’ll have to complete, including registering for API keys and access tokens from Twitter.  We’ve put up full instructions.

    Step 3. Install twitter-tools package.  See our instructions.

    Step 4. Download a sample of tweets.  Run the GatherStatusStream.bat for “a while”—press Ctl-C to stop the download.  This will generate a file (or files) called statuses.log.YYYY-MM-DD-HH where YY-MM-DD-HH represent the current date and hour.  The files may be compressed (indicated with a .gz file suffix)

    Each of the line in this file represents a tweet (*), in JSON format, that includes all available metadata about the tweet, tweet author, etc.  (* the file also includes some other information, such as tweet deletions.  There’s no need to worry about those for this walkthrough.)

    > twitter-tools-mastertwitter-tools-coretargetappassemblerbinGatherStatusStream.bat1000 messages received.2000 messages received.3000 messages received.4000 messages received.5000 messages received.6000 messages received.7000 messages received.8000 messages received.9000 messages received.10000 messages received.Terminate batch job (Y/N)? Y> dir statuses*Volume in drive C is DISKVolume Serial Number is AAAA-AAAADirectory of E:dgt-sampletwitter-tools-core06/13/2014  12:53 PM        49,665,736 statuses.log.2014-06-13-121 File(s)     49,665,736 bytes0 Dir(s)  43,039,879,168 bytes free

    Writing the Script

    As we saw in walkthrough #1, there are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

    Step 5. Create a new file twitter-hashtags.dgt Use notepad.exe, emacs, vi or your favorite text editor.

    e:dgt-sample> notepad twitter-hashtags.dgt

    Step 6. LOAD the data.

    The first command in the script is going to be to load the data file. The tweets we downloaded are in a JSON-based record format, where each line in the file is a JSON-formatted key-value field of a record; and records are separated by blank lines. The LOAD Twitter() command can parse this file. Add the following line as the first command in the script file:

    LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");

    The Twitter data source already knows about ***the key fields in the Twitter JSON data file*** (ADD LINK), so we don’t have to specify any more information. The twitter-tools adds some non-JSON lines into its output, so we’ll also set the ignoreErrors flag to true. This will tell DGT to ignore misformatted lines in the input.

    Step 7. EXTRACT higher-level features from the raw data

    Add the following line as the second command in the script file:

    EXTRACT AffectDetector(), Gender(), hashtag;

    This EXTRACT statement generates 3 higher-level features:

      • The AffectDetector() call infers the affect, or mood, of a text.  By default, the AffectDetector() looks for a field named “text” in the raw data, though we could set the “field” argument to make it look at other fields instead.
      • The Gender() call infers the gender of the author, based on the author’s first name. By default, the Gender() extractor looks for a field named “username” in the raw data.  Again, we could override this using the “field” argument.
      • By naming the hashtag field—without parentheses—we tell the script to pass the hashtag field through without modification.
      Note: The output of twitter-tools already includes hashtags, user mentions, urls and stock symbols as explicit fields already parsed out of the raw text. We’ll see in the further explorations how we can use exact phrase matching and regular expression matching to pull values out of the text ourselves.

      Step 8. PROJECT the data to focus on the relationships of importance

      Now, we tell the script what we relationships we care about. Here, we want to extract the pair-wise co-occurrence relationships among hashtags.  That is, which hashtags are used together?

      PLANAR PROJECT TO hashtag;

      By projecting to “hashtag”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record.

      In this exercise, we’re choosing to use a PLANAR PROJECT command because we’re going to visually display the resulting hashtag graph at the end of this walkthrough, and planar graphs are simply easier to render.  However, it’s worth noting that the planar representation is incomplete.  For example, if 3 hashtags always co-occur together that information will be lost because the planar graph cannot represent this information.  A hyper-graph can represent such complex co-occurrences, however.  For this reason, the PROJECT command defaults to a hyper-graph, and we recommend using this representation if you are going to be computing on the result.

      Step 9. OUTPUT the results to disk

      Finally, we add the following command to the script to save the results:

      OUTPUT TO "twitter_hashtags.graph";

      If you haven’t already, now would be a good time to save your script file… The whole script should look like this:

      LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");
      EXTRACT AffectDetector(), Gender(), hashtag;
      PLANAR PROJECT TO hashtag;
      OUTPUT TO "twitter_hashtags.graph";

      Run the Script

      Step 9. From the command line, run DGT against the script twitter-hashtags.dgt:

      e:dgt-sample> dgt.exe twitter-hashtags.dgt

      The output file “twitter_hashtags.graph” should now be in the e:dgt-sample directory. Each row of the output file represents a relationship between a pair of hashatags, since we projected to the planar relationship between co-occurring hashtags in our script. Columns are tab-separated and the first column of each row is the name of the edge in the graph (the edge name is simply the concatenation of the two node names, in this case the two hashtags); The second column is the count of tweets seen with the pair of hashtags; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

      To import this data into visualization and analysis tools, we have included two command-line utilities dgt2tsv.exe and dgt2gexf.exe that can extract specific values into a tab-separated values (TSV) file or a Graph Exchange XML Format (GEXF) file.

      We’ll use the dgt2gexf command and visualize the result with the Gephi graph visualization tool:

      e:dgt-sample> dgt2gexf.exe twitter_hashtags.graph count twitter_hashtags.gexf

      If your twitter sample is large, you might consider adding the option “filtercount=N” (without the quotes) to the command-line.  This will only include edges that were seen at least N times in your sample.  Use an appropriate number, from 10 to 1000 or higher, depending on the size of your sample.

      Here’s the resulting hashtag graph.  Each of the clusters represents a group of hashtags that are frequently co-mentioned in our tiny sample of Twitter data…

      For clarity and fun, we’ll filter out low-frequency edges and zoom into one of the clusters of hashtags about world-cup related topics.  We see from the thickness of the edges that #NED and #ESP are the most frequently co-occurring hashtags, and each also co-occurs relatively frequently with #WorldCup.  We also see a number of people piggy-backing on the popular #worldcup hashtag with topically unrelated hashtags (#followers, #followback, #retweet, #followme)  to solicit followers and retweets.

      Further Explorations

      There are many interesting things to explore in hashtag relationships, such as the evolution of hashtag relationships over time — for example, use PROJECT TO hashtag,absoluteday; — hashtag relationships conditioned on gender — PROJECT TO hashtag,Gender(); — and inspections of token distributions, moods and other features associated with hashtags and their relationships.

      What are you going to explore next? Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at discussiongraph@microsoft.com. Thanks!

      Reference Guide

      Discussion Graph Tool Reference Guides

      Basic Concepts

      In the discussion graph tool framework, a co-occurrence analysis consists of the following key steps:

      Step

      TaskDGT command

      1

      Reading from a social media data source.LOAD

      2

      Extracting low-level features from individual messages.EXTRACT

      3

      (optional)

      Declaring the feature that defines a co-occurrence. What defines the fact that two or more features have co-occurred?

      By default, two features are considered to co-occur if they both occur in the same social media message.

      RELATE BY
      Steps 2 and 3 implicitly define an initial discussion graph. All co-occurring feature values that were seen to co-occur in the raw social media data will be connected by hyper-edges to form a large, multi-dimensional hyper-graph. 

      4

      (optional)

      By default, each social media message is weighted equally.  We can change this so that the data is weighted by user, location, or other feature.  For example, we might want data from every user to count equally, regardless of how many social media messages each user sent.  This would prevent our analyses from being dominated by users who post too frequently.WEIGHT BY

      5

      We project the initial discussion graph to focus on those relationships we care about for our analysis. For this step, the task must specify the domains we care about.PROJECT

      6

      Output resultsOUTPUT

      7

      (optional)

      Often, we’ll want to further analyze our results with higher-level machine learning, network analyses, and visualization techniques. This is outside the scope of DGT.

      For more details on the core concepts behind discussion graphs, we recommend reading our ICWSM 2014 paper.

      A note on projecting weighted data

      Often, feature values are weighted. For example, the affect classifier produces a weighted feature value indicating how likely a message is to be expressing joviality, sadness, etc. (In other cases, the use of the WEIGHT BY command implicitly creates a weighted value).

      When it encounters a weighted feature value in its target domains, the PROJECT TO command treats the weights as probabilities of a feature value having occurred. For example, let’s continue our analysis of activity and location mentions such as in the following message:

      "I'm having fun hiking tiger mountain" tweeted by Alice on a Saturday at 10am

      Let’s say our mood analysis indicates that the message has joviality with a weight of “0.8”, serenity has a weight of “0.4” in this message, in addition to the other discrete features:

      Domain Feature Weighted value
      MoodJoviality0.8
      MoodSerenity0.4
      Activityhiking1.0
      Locationtiger mountain1.0
      AuthorAlice1.0

      The two weighted features are interpreted as independent probabilities. That is, there is an 80% likelihood of this message being jovial and a 20% likelihood of not being jovial. Independently, there is a 40% likelihood of the message being serene, and 60% chance of not being serene.

      If we project this single message to the relationship between location and mood (PROJECT TO Mood, Location;) this message will expand to the following 4 projected edges::

      EdgeWeightMetadata
      Joviality and Tiger Mountain0.48hiking, Alice
      Serenity and Tiger Mountain0.08hiking, Alice
      Joviality and Serenity and Tiger Mountain0.32hiking, Alice
      (No mood) and Tiger Mountain0.12hiking, Alice

      Of course, when analyzing a larger corpus of social messages, each message will be expanded individually and the results aggregated.

      Script Command Reference

      The discussion graph tool’s scripting language currently supports the following commands.

      Note that square brackets [ ] indicate optional elements of the command. Italicizedterms indicate user-specified arguments, variable names, etc. of the command.

      LOAD

      Syntax: LOAD Datasource([arguments]);

      Example: LOAD MultiLine(path:”productreviews.txt”);

      The LOAD command loads social media data from some datasource. The required arguments are datasource-specific. Generally, datasources require a path to the input file as well as schema information to interpret the file. See the Common things you’ll want to do section below for examples of loading TSV, Multiline record, JSON and Twitter files.

      EXTRACT

      Syntax: EXTRACT [PRIMARY] field|FeatureExtractor([arguments]),… [FROM varname];

      Example: EXTRACT PRIMARY hashtag, Gender(), AffectDetector();

      The EXTRACT command runs a series of feature extractors against the raw social media messages loaded from a data source via the LOAD command.

      Extracting a field will pass through a field from the raw data unmodified.

      Extracting a feature using a FeatureExtractor() will run the specified feature extractor against the social media message. Feature extractors may generate 0, 1 or more feature values for each message they process, and the domain of the feature need not match the name of the feature extractor. For example, the AffectDetector() generates features in several domains (Subjective, Mood and PosNegAffect), and other feature extractors, such as Phrases() can generate features in custom domains.

      The PRIMARY flag acts as a kind of filter on the raw social media data. EXTRACT must find at least one PRIMARY field or feature in a message, otherwise the message will be ignored. If no fields or features are marked as PRIMARY, then EXTRACT will not filter messages.

      FROM varname tells the EXTRACT command where to get its input data. If not specified, EXTRACT will read from the output of the previous command.

      WEIGHT BY

      Syntax: WEIGHT BY featureDomain[, …] [FROM varname];

      Example: WEIGHT BY userid;

      The WEIGHT BY command reweights the data from social media messages. By default, every social media message counts as a single observation.  If we see a co-occurrence relationship occurring in 2 social media messages, then the co-occurrence relationship will have a weight of 2.  We can change this using the WEIGHT BY command so that every unique user (or location or other feature value) counts as a single observation.  So, for example, if a co-occurrence relationship is expressed by 2 unique users, then it will have a weight of 2.  Conversely, if a single user expresses 2 distinct co-occurrence relationships, each relationship will have a weight of only 0.5.

      Note that we can WEIGHT BY one feature but RELATE BY another feature.

      RELATE BY

      Syntax: RELATE BY featureDomain [FROM varname];

      Example: RELATE BY userid;

      The RELATE BY command declares the domain that defines a co-occurrence relationship. All features that co-occur with the same feature value in this domain are considered to have co-occurred.

      FROM varname tells the RELATE BY command where to get its input data. If not specified, RELATE BY will read from the output of the previous command.

      Note that we can WEIGHT BY one feature but RELATE BY another feature.

      PROJECT

      Syntax: PROJECT TO [featureDomain, …] [FROM varname];

      Variants: PLANAR PROJECT TO [featureDomain, …] [FROM varname];

      Variant: PLANAR BIPARTITE PROJECT TO [featureDomain, …] [FROM varname];

      Example: PROJECT TO hashtag;

      The PROJECT TO command will project an initial hyper-graph to focus on only relationships among the specified feature domains. That is, only edges which connect 1 or more nodes in the specified domains will be kept, and any nodes in other feature domains will be removed from the structure of the graph. By default, the PROJECT TO command generates a hyper-graph. This means that nodes that do not co-occur with other nodes will still be described by a degenerate 1-edge. Also, if many nodes simultaneously co-occur together, their relationship will be described by a k-edge (where k == the number of co-occurring nodes)

      Often, especially for ease of visualization, it is useful to restrict the discussion graph to be a planar graph (where every edge in the graph connects exactly 2 nodes). The PLANAR PROJECT TO command achieves this. All hyper-edges will be decomposed and re-aggregated into their corresponding 2-edges.

      Furthermore, it can be useful to restrict the graph to be bipartite, where only edges that cross domains are kept. For example, we may only care about the relationship between users and the hashtags they use, and not care about the relationship among hashtags themselves. The PLANAR BIPARTITE PROJECT TO command achieves this. Semantically, this is the equivalent of doing a planar projection and then dropping all edges that connect nodes are in the same domain.

      MERGE

      Syntax: MERGE varname1,varname2[,…];

      Example: MERGE MentionAndUserGraph,HashTagAndUserGraph;

      The MERGE command overlays two discussion graphs atop each other. Nodes with the same feature domain and values will be merged.

      OUTPUT

      Syntax: OUTPUT TO “filename.graph” [FROM varname];

      Example: OUTPUT TO “mentions.graph”;

      The OUTPUT TO command saves a discussion graph to the specified file.

      File’s are saved in DGT’s native format. This format consists of 3 tab-separated columns. The first column is the edge identifier: the comma-separated list of nodes connected by this edge. The second column is the count of the number of times this co-occurrence relationship was observed to occur. The third column is a JSON-formatted representation of the context of the relationship or, in other words, the distribution of feature values conditioned on the co-occurrence relationship.

      Naming variables

      We can assign the result of commands to variables, and use these variables in later commands:

      Syntax:

      var x = COMMAND1;
      COMMAND2 FROM x;

      Example:

      var reviewData = LOAD Multiline(path:”finefoods.tar.gz”);
      var reviewFeatures = EXTRACT AffectDetector(),reviewscore FROM reviewData;

      Feature Extractor Reference

      Here’s a current list of feature extractors included in the discussion graph tool release.

      Feature Extractor Arguments Output Domain
      AffectDetector()

      Infers mood from text

      field: input field to analyze (default=’text’)Mood: weights for 7 moods (joviality, sadness, guilt, fatigue, hostility, serenity, fear)

      PosNeg: aggregation of positive/negative affects

      Gender()

      Infers gender from user names

      field: input field to analyze (default=’username’)

      discrete: whether to output discrete or weighted gender values (default=’true’)

      gender: m=male, f=female, u=unknown
      GeoPoint()

      explicit lat-lon coordinates

      field: input field to analyze (default=’geopoint’)

      rounding: number of decimal places to include

      geopoint: lat-lon value
      GeoShapeMapping()

      Maps lat-lon points to feature values via a user-specified GeoJSON formatted shapefile

      field: input field to analyze (default=’geopoint’).  this field should contain both lat and lon coordinates, separated by a space or comma.

      latfield: input field containing latitude value.

      lonfield: input field containing longitude value.

      shapefile: GeoJSON formatted shapefile

      propertynames: comma separated list of property:domain pairs.  The property names a property within the shapefile, and the domain specifies a custom domain name for that property.  If a lat-lon point falls within a shape specified in the shapefile, the feature extractor will output all the specified properties in the propertynames list.

      unknownvalue: value to assign to a lat-lon outside of given shapes

      Note: Please specify either the field argument or both the latfield and lonfield arguments.

      [custom domain name]
      Country()

      An instance of GeoShapeMapping that maps lat-lon to country/region two-letter codes and country/region names

      field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

      latfield: input field containing latitude value.

      lonfield: input field containing longitude value.

      unknownvalue: value to assign to a lat-lon outside of countries/regions

      Note: Please specify either the field argument or both the latfield and lonfield arguments.

      fips_country:

      country:

      USAState()

      An instance of GeoShapeMapping that maps lat-lon to USA subregions and states

      field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

      latfield: input field containing latitude value.

      lonfield: input field containing longitude value.

      unknownvalue: value to assign to a lat-lon outside of US states

      Note: Please specify either the field argument or both the latfield and lonfield arguments.

      USA_subregion:

      USA_state:

      USA_fips:

      CountyFIPS()

      An instance of GeoShapeMapping that maps lat-lon to US county names and FIPS codes

      field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

      latfield: input field containing latitude value.

      lonfield: input field containing longitude value.

      unknownvalue: value to assign to a lat-lon outside of US counties

      Note: Please specify either the field argument or both the latfield and lonfield arguments.

      countygeoid:

      countyname:

      Time()

      Extracts various temporal features

      field: input field to analyze (default=’creationdate’)

      options: list of time features to extract: absoluteminute, absolutehour, absoluteday, absoluteweek, monthofyear, dayofweek, hourofday. (default is to output all fields)

      format: ‘unix’ or ‘ticks’ (default=’unix’)

      absoluteminute:

      absolutehour:

      absoluteday:

      absoluteweek:

      monthofyear:

      dayofweek:

      hourofday:

      ProfileLocation()

      Maps geographic regions from user profile locations with a user-specified mapping file

      field: input field to analyze (default=’userlocation’)

      domain: set custom output domain

      mappingfile: model for mapping from user location names to geographic locations. DGT comes with a mapping file for major international metropolitan areas, and United States country regions and divisions.

      unknownvalue: value to assign to unrecognized profile locations

      [custom domain name]
      ProfileLocationToCountry()

      Maps user profile locations to 2-letter country/region FIPS codes

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      country:
      ProfileLocationToCountryName()

      Maps user profile locations to country/region names

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      countryname:
      ProfileLocationToUSASubregion() 

      Maps user profile locations to subregions of USA (e.g., Pacific, Mid-Atlantic)

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      usa_subregion:
      ProfileLocationToUSAState()

      Maps user profile locations to US states

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      usa_state:
      ProfileLocationToUSACounty() 

      Maps user profile locations to US county FIPS codes

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      usa_county:
      ProfileLocationToUSACountyName()

      Maps user profile locations to US county names

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      usa_countyname:
      ProfileLocationToMetroArea()

      Maps user profile locations to major metropolitan areas

      field: input field to analyze (default=’userlocation’)

      unknownvalue: value to assign to unrecognized profile locations

      metroarea:
      ExactPhrases()

      Matches specific phrases in a given list or mapping file

      field: input field to analyze (default=’text’)

      domain: set custom output domain

      accept: a comma-separated list of phrases to match

      acceptfile: a text file listing phrases. Use a tab-separated two-column file to specify canonical forms for matched phrases

      [custom domain name]
      Regex()

      Matches regular expressions

      field: input field to analyze

      domain: set custom output domain

      regex: the regular expression to match against text

      [custom domain name]
      Tokens()

      Extracts unigram tokens

      field: input field to analyze

      domain: set custom output domain

      stopwordsfile: file of tokens to ignore (default=none)

      porter: use porter stemming (default=”false”)

      [custom domain name]

      FAQ

      Common things you’ll want to do

      Load data in different formats

      DGT can load social media data in delimeter-separated TSV and CSV files, line-based JSON format (including the output of common twitter downloaders) and multi-line record formats.

      TSV and CSV data

      To load a TSV or CSV, use the following LOAD command. The path to a file is required. Also, either the hasHeader flag must be set to true (indicating the first row of the file is a header line) or the schema argument must be set.

      LOAD TSV(path:"filename.txt",
               fieldSeparator:",", // optional: default is tab character
               ignoreErrors:"true", // optional: default is false
               hasHeader:"false", // optional: default is false
               schema:"col1,col2,..." // either hasHeader:"true" or a schema is required
               );

      Multi-line record data

      A multi line record formatted file includes a single record field per-line, with a blank line separating records. For example:

      name: Bob

      text: hello world!

      messagetime:5/4/2013

      name: Alice

      text: hello back!

      messagetime:5/5/2013

      To load a multiline record, use the following LOAD command. Only the path argument is required. The schema is implicit in the file itself.

      LOAD Multiline(path:"filename.txt",
                     fieldSeparator:":", // optional: default is : character
                     ignoreErrors:"true" // optional: default is false
                     );

      JSON file

      DGT can read JSON line formatted files (where each line of a text file is a JSON object).

      LOAD JSON(path:"filename.txt",
                ignoreErrors:"true",
                schema:"field1:jsonpath1,field2:jsonpath2,...");

      The schema must specify both the fields to be extracted as well as their JSON paths. If multiple values in the JSON object match a given path, the field will be assigned multiple values.

      Twitter data

      DGT also includes a pre-defined data source for Twitter output of the twitter-tools utilities. (This is a JSON-line formatted file) To load the output of the twitter-tools utilities, use the following LOAD command. Only the path argument is required.

      LOAD Twitter(path:"filename.txt");

      This data source includes schema definitions for most of the common Twitter fields:

      FieldJSON path
      contextidid_str
      createdatc
      texttext
      inreplytostatusidin_reply_to_status_id
      inreplytoscreennamein_reply_to_screen_name
      useriduser/id_str
      usernameuser//name
      userscreennameuser/screen_name
      userlocationuser/location
      langlang
      userdescriptionuser/description
      userfollowerscountuser/followers_count
      userfriendscountuser/friends_count
      userlistedcountuser/listed_count
      usercreatedatuser/created_at
      userfavouritescountuser/favourites_count
      userutcoffsetuser[permalink post_id=171140]tc_offset
      usertimezoneuser/time_zone
      userverifieduser/verified
      userstatusescountuser/statuses_count
      retweetcreatedatretweeted-status/created_at
      retweetidretweeted_status/id_str
      retweettextretweeted_status/text
      retweetuserretweeted_status/user/id_str
      retweetusernameretweeted_status/user//name
      retweetuserscreennameretweeted_status/user/screen_name
      hashtagentities/symbols/text
      symbolentities/symbols/text
      urlentities/urls/url
      urlexpandedentities/urls/expanded_url
      mentionuseridentities/user_mentions/id_str
      mentionusernameentities/user_mentions//name
      mentionuserscreennameentities/user_mentions/screenname
      geopointgeo/coordinates/$$

      Filter out irrelevant messages

      Sometimes a specific social media message is simply irrelevant to a specific analysis. For example, in a study about hashtag usage on Twitter, we might want to ignore messages that do not have hashtags. To do this, we can use the PRIMARY keyword of the EXTRACT command.

      EXTRACT PRIMARY hashtag, PRIMARY mention, AffectDetector();

      In this example, we have marked the hashtag and mention fields as PRIMARY fields (any field or feature extractor may be marked as PRIMARY). This PRIMARY flag tells the EXTRACT command that it must find either a hashtag or a mention value in a message in order to continue processing it. If a message has either a hashtag or a mention, EXTRACT will also run the AffectDetector() and pass the values along to the rest of the script. If a message does not have any hashtag value and does not have any mention value, then that message will be ignored.

      The PRIMARY flag can be combined with the acceptfilter and rejectfilter arguments accepted by most feature extractors. If you want to only analyze social media messages by women, for example, you can use the acceptfilter argument to achieve this:

      EXTRACT PRIMARY Gender(accept:”f”), hashtag, mention;

      The Gender feature extractor understands the acceptfilter argument, and will only output feature values that match the list. The result in this case is that only messages where the author’s gender is identifiably female will be processed. (note that the hashtag and mention fields are no longer marked as PRIMARY fields).

      If you have a long list of values you want to accept, you can use the acceptfilterfilename argument. The syntax and behavior for the acceptfilter and acceptfilterfilename is the same as for the Phrases() feature extractor.

      Detect phrases and words in tweet text

      You can extend the Phrases() feature extractor with a different set of arguments to detect different phrases. Here is an adaptation of our “politician detector” from our simple example, but this time modified to detect super heroes. By default, phrase detection is case-insensitive.

      EXTRACT ExactPhrases(domain:”parent”,accept:”dad,mom,father,mother”);

      If you have a long list of phrases you want to detect, you can put them in a file and reference them in your processor. Note that you also have to specify the datafile as a resource, so that the framework knows to include that file as part of the job.

      EXTRACT ExactPhrases(domain:”parent”,acceptfile:”parentphrases.txt”);

      In its simple form, this file is simply a list of phrases to detect. You can also use this file to group or canonicalize detected phrases by adding a 2nd tab-separated column that includes the canonical form. For example, if you used the following file, it would detect the nicknames for parents and map them to their canonical name. That is, whenever the phrase extractor finds “mommy” or “mom” the extracted feature will be emitted as “mother”.

      mommother
      mothermother
      mommymother
      dadfather

      Import results into R, Excel, Gephi or other tools

      Often, you will want to perform further higher-level analyses (machine learning analyses, visualizations and/or statistical analyses) on the output of DGT. To do so, we provide utilities to convert from DGTs native output format to TSV and GEXF files that will let you load the data in R, Excel, Gephi and other tools.

      To convert to TSV, use the dgt2tsv.exe command:

      dgt2tsv.exe input.graph [outputfields] outputfilename.tsv

      The list of outputfields may include “count”, any of the domains output by a feature extractor, a domain name followed by “.count”, or a domain name followed by a specific feature value.

      For example, the following command will output a count of the number of messages seen for each edge in a discussion graph; the gender of the author; and the weight of the “fatigue” value in the Mood domain.

      dgt2tsv.exe input.graph count,gender,Mood.fatigue output.tsv

      To output a .gexf file that can be read by Gephi for graph analyses and visualizations, use the dgt2gexf.exe command:

      dgt2gexf.exe input.graph [outputfields] outputfilename.gexf