Discussion Graph Tool

Established: April 25, 2014

Discussion Graph Tool (DGT) is an easy-to-use analysis tool that provides a domain-specific language extracting co-occurrence relationships from social media and automates the tasks of tracking the context of relationships and other best practices. DGT provides a single-machine implementation, and also generates map-reduce-like programs for distributed, scalable analyses.

DGT simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data.

With just 3-4 simple lines of script, you can load your social media data, extract complex features, and generate a graph among arbitrary features. Throughout, DGT automates best-practices, such as tracking the context of relationships.

Download

 

Available Features

Out-of-the-box feature extraction for common scenarios, including mood and geo-location; as well as customizable dictionary and regular expression-based extractions.

Analyze text for signs of joviality, fatigue, sadness, guilt, hostility, fear, and serenity. Map lat-lon coordinates to FIPS county codes. Recognize gender based on name.

Identifies co-occurrence relationships within social media messages, user behaviors, locations or other features.

Extract planar graphs and hyper-graphs of co-occurrence relationships, and tracks contextual statistics for each relationship.

Import raw social media data from existing sources.

Reads delimeter-separated TSV and CSV files, line-based JSON format (including output of common Twitter downloaders) and multi-line record formats.

Analyze results in popular tools such asR, Gephi, and Excel

Outputs JSON, TSV and GEXF.

Extend DGT with custom feature extractors

Incorporate your own feature extractors with DGT through a simple API. This makes it easy for others to build on your techniques and mix-and-match with others.

More coming soon…

News

Aug 13: Some people were seeing errors trying to run the binaries because of an invalid signature on the binaries. We’ve fixed that now. Thanks for the bug reports!

Aug 8: We’ve updated the DGT release, adding support for weighting data and projection on weighted values.  We’ve also updated and expanded our location mapping capabilities to map lat-lon coordinates and user-specified locations to countries, US states and US counties.

June 19: Our first release is available!  Get in touch with your questions.  We’re looking for feedback. Tweet @emrek or email the team at discussiongraph@microsoft.com.  Thanks!

June 16:  In preparation for our tool release, we’ve added 2 new step-by-step walkthroughs on analyzing the moods of product reviews and extracting graphs of hashtag relationships on Twitter.

Read More

Our step-by-step walkthroughs, and our reference guide give details about the tool and its usage.

Read more about our tool and using it for deeper contextual analyses in our ICWSM 2014 paper, “Discussion Graphs: Putting Social Media Analysis in Context”, by Kıcıman, Counts, Gamon, De Choudhury and Thiesson. [PDF]

Discussion and Feedback

Have a question about how to use DGT for an analysis? Have a feedback or bug report?  Want to use your own feature extractor within DGT?

Contact @emrek via Twitter or reach all of us via email at discussiongraph@microsoft.com.

Coming Soon

We are continuing development of the public release of DGT.  Here is what is currently under development:

  • Qualitative sampling of raw data that supports each extracted relationship.
  • FILTER command for conditioning analyses on demographic or other feature values.
  • (Now available as of version 0.6) Improved support for extracting relationships among continuous or weighted feature values
  • Improved aggregation/summarization performancev0

People

Publications

Downloads

Feature Extractors

Feature Extractors

A feature extractor in DGT is responsible for analyzing the raw data of a social media message and recognizing, extracting, inferring or detecting higher level information.  The raw data of a social media message may include the text as well available metadata about the message, the message author, and other geo-temporal or social context.

DGT includes several out-of-the-box feature extractors for common scenarios.  These include some complex analysis tasks, such as mood inference and geo-location mapping, as well as support for simpler analyses, such as customizable dictionary and regular expression-based feature extractors.

The reference guide lists the feature extractors included in DGT, and examples of using the customizable feature extractors.

The TREC 2013 Microblog track provided a convenient set of tools for retrieving tweets, including a tool for sampling from the public twitter stream.  To install this tool and begin downloading tweets, follow these instructions:

  1. Install the prerequisite software
    1. Java Development Kit
    2. Apache Maven
  2. Download the twitter-tools zip file from https://github.com/lintool/twitter-tools/ and extract it on your computer
  3. Open a command-line to the directory where you extracted the twitter-tools zip file and run the following two commands to build the twitter-tools program
> cd twitter-tools-core
> mvn clean package appassembler:assemble

4. Follow the instructions on the twitter-tools site for creating your Twitter access tokens, setting up a twitter4j.properties file, and running the GatherStatusStream.bat program to retrieve tweets from the public Twitter stream

Install

Installing the Discussion Graph tool

This short step-by-step walks you through installing DGT and adding it to your execution path.

Discussion Graph Tool Install

To install the Discussion Graph Tool, Download the latest DGT release as a zip file.

  1. Install the prerequisite .Net Framework 4.5
  2. Check that the downloaded DGT zip file is “unblocked” on your computer.  Right-click on the downloaded zip file, and click “properties…” and ensure that the “Unblock” check box is checked and click Apply, or press the “Unblock” button in older versions of Windows.
    dgt_unblock
  3. Extract the dgt-0.5.zip file to a location, such as your user directory c:usersmyName  (where myName is your login), c:program files, or an alternate location, such as e:  Wherever you decide to extract the zip file, you should find a dgt directory, and within it a bin directory
  4. Edit the system environment variables to add the dgt-0.5bin directory to the execution path.
    1. To do so, open the Control Panel, search for Environment Variables and click “Edit the System Environment Variables”.
    2. In the Advanced tab, click the environment variables button
    3. Select the PATH variable from the system variables list and click the Edit button.
    4. Edit the variable value (the current search paths), and append a ‘;’ (semicolon character without the quotes) and the full path to the DGT binaries.  (don’t forget to include the trailing dgtbin directory, e.g., c:usersmyNamedgtbin, c:program filesdgtbin or e:dgtbin).
    5. Click OK.
  5. Test the installation
    1. Open a new command-line window (run cmd.exe)
    2. Type the command “dgt”.  You should see the following output.
>dgt –helpDiscussion Graph Tool Version 0.5More info: https://www.microsoft.com/en-us/research/project/discussion-graph-tool/Contact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

To learn more about the Discussion Graph Tool, read the getting started guide and the step-by-step walkthroughs.

Walkthroughs

Walkthrough #1: Analyzing Mood of Product Reviews

Analyzing Mood of Product Reviews

This walkthrough focuses on answering the question: How does mood (joviality, anger, guilt, …) correlate with product review score? Does this vary by gender? As a bonus, see how to extract a graph of products based on their common reviewers. Read the step-by-step.

In this walkthrough, we will be working with Amazon review data for fine food products. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

Getting the Discussion Graph Tool

Step 1. Download the Discussion Graph Tool (DGT)

If you haven’t already, download and install the discussion graph tool. The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

To double-check the installation, open a new command-line window and type the command “dgt –help”. You should see the following output:

>dgt –helpDiscussion Graph Tool Version 0.5More info: https://www.microsoft.com/en-us/research/project/discussion-graph-tool/Contact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

Step 2. Create a new directory for this walkthrough.  Here, we’ll use the directory E:dgt-sample

>mkdir e:dgt-sample

 Getting the Data

Before we start to write our first script, let’s get some data to analyze. We’ll be using Amazon review data collected by McAuley and Leskovec. This dataset includes over 500M reviews of 74k food-related products. Each review record includes a product id, user id, user name, review score, helpfulness rating, timestamp and both review and summary text.  The user names are often real names, and review scores are integers on a scale from 1 to 5

Step 3. Download finefoods.txt.gz from the Stanford Network Analysis Project’s data archive. Save the file to E:dgt-sample

> e:> cd e:dgt-samplee:dgt-sample> dirVolume in drive E is DISKVolume Serial Number is AAAA-AAAADirectory of E:dgt-sample06/10/2014  11:17 AM              .06/10/2014  11:17 AM              ..06/10/2014  11:16 AM       122,104,202 finefoods.txt.gz1 File(s)    122,104,202 bytes2 Dir(s)  45,007,622,144 bytes free

Writing the Script

There are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

Step 4. Create a new file mood-reviews.dgt Use notepad.exe, emacs, vi or your favorite text editor.

e:dgt-sample> notepad mood-reviews.dgt

Step 5. LOAD the data.

The first command in the script is going to be to load the data file. The reviews we downloaded are in a multi-line record format, where each line in the file represents a key-value field of a record; and records are separated by blank lines. The LOAD MultiLine() command will parse this data file.  Add the following line as the first command in the script file:

LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");

Since the multi-line format naturally embeds the schema within the data file, we don’t have to specify it in the LOAD command.  There are some spurious newlines in the finefoods.txt.gz data, so we do we need to set the ignoreErrors flag to true.  This will tell DGT to ignore data that is misformatted.

Step 6. EXTRACT higher-level features from the raw data

Add the following line as the second command in the script file:

EXTRACT AffectDetector(field:"review_text"),
        Gender(field:"review_profileName"),
        review_score;

This EXTRACT statement generates 3 higher-level features:

    • The AffectDetector() call infers the affect, or mood, of a text. The field argument tells it which of the raw fields to analyze. We’ll choose the long review field but could just as easily have selected the summary field. If you don’t pass a field argument, then the AffectDetector() extractor will by default look for a field named “text” in the raw data.
    • The Gender() call infers the gender of the author, based on the author’s first name. The field argument tells it which field includes the author’s name. If you don’t pass a field argument, then the Gender() extractor will by default look for a field named “username” in the raw data.
    • By naming the reviewscore field—without parentheses—we tell the script to pass the reviewscore field through without modification.

A note on naming outputs and inputs:By default, EXTRACT, PROJECT and OUTPUT commands operate on the results of the previous statement. You can also explicitly name the results of commands. To do so, use the “var x = “ notation to assign results to a variable, then add “FROM x” to later commands. For example:

var finefoodsdata = LOAD MultiLine(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT AffectDetector(field:"review_text"), Gender(field:"review_profileName"), reviewscore FROM finefoodsdata;

Step 7. PROJECT the data to focus on the relationships of importance

Now, we tell the script what we relationships we care about. Often, we’ll be using DGT to extract a graph of co-occurrence relations from a set of data. In this first example, we’re going to ask for a simpler result set, essentially using DGT as a simple aggregator or “group by” style function.  Add the following line to the script:

PROJECT TO review_score;

By projecting to “review_score”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record. Since in this dataset every record has at most one review score, that means that there are no co-occurrence relationships. The resulting graph is then simply the degenerate graph of 5 nodes (1 for each score from 1 to 5).  For each of these nodes, DGT aggregates the affect and gender information that we extracted.

Step 8. OUTPUT the results to disk

Finally, we add the following command to the script to save the results:

OUTPUT TO "finefoods_reviewscore_context.graph";

If you haven’t already, now would be a good time to save your script file…  The whole script should look like this:

LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
EXTRACT AffectDetector(field:"review_text"),
        Gender(field:"review_profileName"),
        review_score;
PROJECT TO review_score;
OUTPUT TO "finefoods_reviewscore_context.graph";

Run the Script

Step 9. From the command line, run DGT against the script mood-reviews.dgt:

e:dgt-sample> dgt.exe mood-reviews.dgt

The output file “finefoods_reviewscore_context.graph” should now be in the e:dgt-sample directory.  Each row of the output file represents a reviewscore, since that is what we projected to in our script. Columns are tab-separated and the first column of each row is the name of the edge (or nodes) in the graph; The second column is the count of records seen with the given review score; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

To import this data into R, Excel or other tools, we have included a command-line utility dgt2tsv.exe that can pull out specific values.  Use the following command to build a TSV file that summarizes the gender and mood for each review score:

e:dgt-sample> dgt2tsv.exe finefoods_reviewscore_context.graph count,gender.m,gender.f,gender.u,mood.joviality,mood.fatigue,mood.hostility,mood.sadness,mood.serenity,mood.fear,mood.guilt finefoods_reviewscore_gendermood.tsv

Here’s a quick graph of the results about how mood varies across review scores.

We see that joviality increases and sadness decreases with higher review scores.  We see that there is more hostility in lower review scores and more serenity in higher review scores.  While most moods are monotonically increasing or decreasing with review score, we see that guilt peaks in 2- and 3-star reviews.

Further Explorations

The design goal of DGT is to make it easy to explore the relationships embedded in social media data and capture the context of the discussions from which the relationships were inferred.

Are the distributions of mood across review scores different for men and women? Conditioning the mood distributions on gender as well as review score gives us this information.  We can do this simply by adding the gender field to our PROJECT command, as follows (changes from the original script are bolded):

LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");
EXTRACT AffectDetector(field:"review_text"),
        Gender(field:"review_profileName"),
        review_score;
PROJECT TO review_score, gender;
OUTPUT TO "finefoods_reviewscore_gender_context.graph";

Here’s a quick look at the results.  Here, I’ve graphed the joviality (solid line) and sadness (dashed line) for men (orange) and women (green).  We see that the general trends hold, though there are some differences that one might continue digging deeper into…

How are products related to each other by reviewer?  For example, how many people that wrote a review of “Brand A Popcorn” also wrote about “Brand X chocolate candies”?  We can answer this question by defining a co-occurrence relationship based on user id.  That is, we’ll say that two product ids are related if the same user reviewed both products.  Here’s how we do that in the script:

LOAD Multiline(path:"finefoods.txt.gz",ignoreErrors:"true");EXTRACT product_productId, review_userId;RELATE BY review_userId;PLANAR PROJECT TO product_productId AGGREGATE();OUTPUT TO "finefoods_products_relateby_user.graph";

(We’ll learn more about the RELATE BY and PLANAR PROJECT commands in the next walkthroughs.)  This will generate a discussion graph that connects pairs of products that were reviewed by the same person.  We can convert this into a file readable by the Gephi graph visualization tool using the dgt2gexf command:

e:dgt-sample> dgt2gexf.exe finefoods_products_relateby_user.graph count finefoods_products_relateby_user.gexf filterbycount=1000

The dgt2gexf command mirrors the dgt2tsv command.  In this case, we decided to use a filterbycount option to only output edges that have at least 1000 users who have co-reviewed the pair of products.  This filter helps keep the visualization relatively manageable.

Here’s the resulting product graph, laid out using Gephi’s Fructerman Reingold algorithm: Each of the clusters represents a group of products that are frequently co-reviewed food products on Amazon…

Walkthrough #2: Analyzing Twitter Hashtags

Analyzing Twitter Hashtags

This walkthrough focuses on twitter data and extracting a graph of related hashtags based on co-occurrences. Read the step-by-step.

DGT Walkthrough: Hashtags in Twitter

In this walkthrough, we will be working with public stream data from Twitter. First, we are going to ask the question, “what are the moods associated with positive and negative reviews?” Then, we will go a little deeper into the data and see how the mood distributions differ based on the gender of the reviewer, and also suggest other explorations.

Through this example, we will introduce the basic concepts and commands of a DGT script. We’ll show how to load data, extract fields and derived features from social media; and project and aggregate the results.

Getting the Discussion Graph Tool

Step 1. Download the Discussion Graph Tool (DGT)

If you haven’t already, download and install the discussion graph tool (Detailed installation instructions.) The rest of this walkthrough will assume that you have installed the tool and added it to your executable path.

To double-check the installation, open a new command-line window and type the command “dgt –help”. You should see the following output:

>dgt –helpDiscussion Graph Tool Version 1.0More info: https://www.microsoft.com/en-us/research/project/discussion-graph-tool/Contact: discussiongraph@microsoft.comUsage: dgt.exe filename.dgt [options]Options:–target=local|… Specify target execution environment.–config=filename.xml Specify non-default configuration file

Step 2. Create a new directory for this walkthrough. Here, we’ll use the directory E:dgt-sample

>mkdir e:dgt-sample

Getting Twitter Data

First, let’s get some data to analyze. We’ll be using Twitter data for this walkthrough.  Twitter doesn’t allow redistribution of its data, but does have an API for retrieving a sample stream of tweets.  There are a number of steps you’ll have to complete, including registering for API keys and access tokens from Twitter.  We’ve put up full instructions.

Step 3. Install twitter-tools package.  See our instructions.

Step 4. Download a sample of tweets.  Run the GatherStatusStream.bat for “a while”—press Ctl-C to stop the download.  This will generate a file (or files) called statuses.log.YYYY-MM-DD-HH where YY-MM-DD-HH represent the current date and hour.  The files may be compressed (indicated with a .gz file suffix)

Each of the line in this file represents a tweet (*), in JSON format, that includes all available metadata about the tweet, tweet author, etc.  (* the file also includes some other information, such as tweet deletions.  There’s no need to worry about those for this walkthrough.)

> twitter-tools-mastertwitter-tools-coretargetappassemblerbinGatherStatusStream.bat1000 messages received.2000 messages received.3000 messages received.4000 messages received.5000 messages received.6000 messages received.7000 messages received.8000 messages received.9000 messages received.10000 messages received.Terminate batch job (Y/N)? Y> dir statuses*Volume in drive C is DISKVolume Serial Number is AAAA-AAAADirectory of E:dgt-sampletwitter-tools-core06/13/2014  12:53 PM        49,665,736 statuses.log.2014-06-13-121 File(s)     49,665,736 bytes0 Dir(s)  43,039,879,168 bytes free

Writing the Script

As we saw in walkthrough #1, there are 4 basic commands we will use in our script: LOAD for loading data; EXTRACT for extracting features from the raw data; PROJECT for projecting specific relationships and context from the raw data; and OUTPUT for saving the result to a file. Let’s take things step-by-step.

Step 5. Create a new file twitter-hashtags.dgt Use notepad.exe, emacs, vi or your favorite text editor.

e:dgt-sample> notepad twitter-hashtags.dgt

Step 6. LOAD the data.

The first command in the script is going to be to load the data file. The tweets we downloaded are in a JSON-based record format, where each line in the file is a JSON-formatted key-value field of a record; and records are separated by blank lines. The LOAD Twitter() command can parse this file. Add the following line as the first command in the script file:

LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");

The Twitter data source already knows about ***the key fields in the Twitter JSON data file*** (ADD LINK), so we don’t have to specify any more information. The twitter-tools adds some non-JSON lines into its output, so we’ll also set the ignoreErrors flag to true. This will tell DGT to ignore misformatted lines in the input.

Step 7. EXTRACT higher-level features from the raw data

Add the following line as the second command in the script file:

EXTRACT AffectDetector(), Gender(), hashtag;

This EXTRACT statement generates 3 higher-level features:

    • The AffectDetector() call infers the affect, or mood, of a text.  By default, the AffectDetector() looks for a field named “text” in the raw data, though we could set the “field” argument to make it look at other fields instead.
    • The Gender() call infers the gender of the author, based on the author’s first name. By default, the Gender() extractor looks for a field named “username” in the raw data.  Again, we could override this using the “field” argument.
    • By naming the hashtag field—without parentheses—we tell the script to pass the hashtag field through without modification.
Note: The output of twitter-tools already includes hashtags, user mentions, urls and stock symbols as explicit fields already parsed out of the raw text. We’ll see in the further explorations how we can use exact phrase matching and regular expression matching to pull values out of the text ourselves.

Step 8. PROJECT the data to focus on the relationships of importance

Now, we tell the script what we relationships we care about. Here, we want to extract the pair-wise co-occurrence relationships among hashtags.  That is, which hashtags are used together?

PLANAR PROJECT TO hashtag;

By projecting to “hashtag”, we are telling DGT to build a co-occurrence graph among review scores. By default DGT assumes the co-occurrence relationships are defined by the co-occurrence of values within the same record.

In this exercise, we’re choosing to use a PLANAR PROJECT command because we’re going to visually display the resulting hashtag graph at the end of this walkthrough, and planar graphs are simply easier to render.  However, it’s worth noting that the planar representation is incomplete.  For example, if 3 hashtags always co-occur together that information will be lost because the planar graph cannot represent this information.  A hyper-graph can represent such complex co-occurrences, however.  For this reason, the PROJECT command defaults to a hyper-graph, and we recommend using this representation if you are going to be computing on the result.

Step 9. OUTPUT the results to disk

Finally, we add the following command to the script to save the results:

OUTPUT TO "twitter_hashtags.graph";

If you haven’t already, now would be a good time to save your script file… The whole script should look like this:

LOAD Twitter(path:"statuses.log.2014-06-13-12",ignoreErrors:"true");
EXTRACT AffectDetector(), Gender(), hashtag;
PLANAR PROJECT TO hashtag;
OUTPUT TO "twitter_hashtags.graph";

Run the Script

Step 9. From the command line, run DGT against the script twitter-hashtags.dgt:

e:dgt-sample> dgt.exe twitter-hashtags.dgt

The output file “twitter_hashtags.graph” should now be in the e:dgt-sample directory. Each row of the output file represents a relationship between a pair of hashatags, since we projected to the planar relationship between co-occurring hashtags in our script. Columns are tab-separated and the first column of each row is the name of the edge in the graph (the edge name is simply the concatenation of the two node names, in this case the two hashtags); The second column is the count of tweets seen with the pair of hashtags; and the third column is a JSON formatted bag of data distributions for gender and affect observations.

To import this data into visualization and analysis tools, we have included two command-line utilities dgt2tsv.exe and dgt2gexf.exe that can extract specific values into a tab-separated values (TSV) file or a Graph Exchange XML Format (GEXF) file.

We’ll use the dgt2gexf command and visualize the result with the Gephi graph visualization tool:

e:dgt-sample> dgt2gexf.exe twitter_hashtags.graph count twitter_hashtags.gexf

If your twitter sample is large, you might consider adding the option “filtercount=N” (without the quotes) to the command-line.  This will only include edges that were seen at least N times in your sample.  Use an appropriate number, from 10 to 1000 or higher, depending on the size of your sample.

Here’s the resulting hashtag graph.  Each of the clusters represents a group of hashtags that are frequently co-mentioned in our tiny sample of Twitter data…

For clarity and fun, we’ll filter out low-frequency edges and zoom into one of the clusters of hashtags about world-cup related topics.  We see from the thickness of the edges that #NED and #ESP are the most frequently co-occurring hashtags, and each also co-occurs relatively frequently with #WorldCup.  We also see a number of people piggy-backing on the popular #worldcup hashtag with topically unrelated hashtags (#followers, #followback, #retweet, #followme)  to solicit followers and retweets.

Further Explorations

There are many interesting things to explore in hashtag relationships, such as the evolution of hashtag relationships over time — for example, use PROJECT TO hashtag,absoluteday; — hashtag relationships conditioned on gender — PROJECT TO hashtag,Gender(); — and inspections of token distributions, moods and other features associated with hashtags and their relationships.

What are you going to explore next? Let us know what you do! My twitter handle is @emrek, or you can reach the whole team by emailing us at discussiongraph@microsoft.com. Thanks!

Reference Guide

Discussion Graph Tool Reference Guides

Basic Concepts

In the discussion graph tool framework, a co-occurrence analysis consists of the following key steps:

Step

Task DGT command

1

Reading from a social media data source. LOAD

2

Extracting low-level features from individual messages. EXTRACT

3

(optional)

Declaring the feature that defines a co-occurrence. What defines the fact that two or more features have co-occurred?

By default, two features are considered to co-occur if they both occur in the same social media message.

RELATE BY
Steps 2 and 3 implicitly define an initial discussion graph. All co-occurring feature values that were seen to co-occur in the raw social media data will be connected by hyper-edges to form a large, multi-dimensional hyper-graph. 

4

(optional)

By default, each social media message is weighted equally.  We can change this so that the data is weighted by user, location, or other feature.  For example, we might want data from every user to count equally, regardless of how many social media messages each user sent.  This would prevent our analyses from being dominated by users who post too frequently. WEIGHT BY

5

We project the initial discussion graph to focus on those relationships we care about for our analysis. For this step, the task must specify the domains we care about. PROJECT

6

Output results OUTPUT

7

(optional)

Often, we’ll want to further analyze our results with higher-level machine learning, network analyses, and visualization techniques. This is outside the scope of DGT.

For more details on the core concepts behind discussion graphs, we recommend reading our ICWSM 2014 paper.

A note on projecting weighted data

Often, feature values are weighted. For example, the affect classifier produces a weighted feature value indicating how likely a message is to be expressing joviality, sadness, etc. (In other cases, the use of the WEIGHT BY command implicitly creates a weighted value).

When it encounters a weighted feature value in its target domains, the PROJECT TO command treats the weights as probabilities of a feature value having occurred. For example, let’s continue our analysis of activity and location mentions such as in the following message:

"I'm having fun hiking tiger mountain" tweeted by Alice on a Saturday at 10am

Let’s say our mood analysis indicates that the message has joviality with a weight of “0.8”, serenity has a weight of “0.4” in this message, in addition to the other discrete features:

Domain Feature Weighted value
Mood Joviality 0.8
Mood Serenity 0.4
Activity hiking 1.0
Location tiger mountain 1.0
Author Alice 1.0

The two weighted features are interpreted as independent probabilities. That is, there is an 80% likelihood of this message being jovial and a 20% likelihood of not being jovial. Independently, there is a 40% likelihood of the message being serene, and 60% chance of not being serene.

If we project this single message to the relationship between location and mood (PROJECT TO Mood, Location;) this message will expand to the following 4 projected edges::

Edge Weight Metadata
Joviality and Tiger Mountain 0.48 hiking, Alice
Serenity and Tiger Mountain 0.08 hiking, Alice
Joviality and Serenity and Tiger Mountain 0.32 hiking, Alice
(No mood) and Tiger Mountain 0.12 hiking, Alice

Of course, when analyzing a larger corpus of social messages, each message will be expanded individually and the results aggregated.

Script Command Reference

The discussion graph tool’s scripting language currently supports the following commands.

Note that square brackets [ ] indicate optional elements of the command. Italicizedterms indicate user-specified arguments, variable names, etc. of the command.

LOAD

Syntax: LOAD Datasource([arguments]);

Example: LOAD MultiLine(path:”productreviews.txt”);

The LOAD command loads social media data from some datasource. The required arguments are datasource-specific. Generally, datasources require a path to the input file as well as schema information to interpret the file. See the Common things you’ll want to do section below for examples of loading TSV, Multiline record, JSON and Twitter files.

EXTRACT

Syntax: EXTRACT [PRIMARY] field|FeatureExtractor([arguments]),… [FROM varname];

Example: EXTRACT PRIMARY hashtag, Gender(), AffectDetector();

The EXTRACT command runs a series of feature extractors against the raw social media messages loaded from a data source via the LOAD command.

Extracting a field will pass through a field from the raw data unmodified.

Extracting a feature using a FeatureExtractor() will run the specified feature extractor against the social media message. Feature extractors may generate 0, 1 or more feature values for each message they process, and the domain of the feature need not match the name of the feature extractor. For example, the AffectDetector() generates features in several domains (Subjective, Mood and PosNegAffect), and other feature extractors, such as Phrases() can generate features in custom domains.

The PRIMARY flag acts as a kind of filter on the raw social media data. EXTRACT must find at least one PRIMARY field or feature in a message, otherwise the message will be ignored. If no fields or features are marked as PRIMARY, then EXTRACT will not filter messages.

FROM varname tells the EXTRACT command where to get its input data. If not specified, EXTRACT will read from the output of the previous command.

WEIGHT BY

Syntax: WEIGHT BY featureDomain[, …] [FROM varname];

Example: WEIGHT BY userid;

The WEIGHT BY command reweights the data from social media messages. By default, every social media message counts as a single observation.  If we see a co-occurrence relationship occurring in 2 social media messages, then the co-occurrence relationship will have a weight of 2.  We can change this using the WEIGHT BY command so that every unique user (or location or other feature value) counts as a single observation.  So, for example, if a co-occurrence relationship is expressed by 2 unique users, then it will have a weight of 2.  Conversely, if a single user expresses 2 distinct co-occurrence relationships, each relationship will have a weight of only 0.5.

Note that we can WEIGHT BY one feature but RELATE BY another feature.

RELATE BY

Syntax: RELATE BY featureDomain [FROM varname];

Example: RELATE BY userid;

The RELATE BY command declares the domain that defines a co-occurrence relationship. All features that co-occur with the same feature value in this domain are considered to have co-occurred.

FROM varname tells the RELATE BY command where to get its input data. If not specified, RELATE BY will read from the output of the previous command.

Note that we can WEIGHT BY one feature but RELATE BY another feature.

PROJECT

Syntax: PROJECT TO [featureDomain, …] [FROM varname];

Variants: PLANAR PROJECT TO [featureDomain, …] [FROM varname];

Variant: PLANAR BIPARTITE PROJECT TO [featureDomain, …] [FROM varname];

Example: PROJECT TO hashtag;

The PROJECT TO command will project an initial hyper-graph to focus on only relationships among the specified feature domains. That is, only edges which connect 1 or more nodes in the specified domains will be kept, and any nodes in other feature domains will be removed from the structure of the graph. By default, the PROJECT TO command generates a hyper-graph. This means that nodes that do not co-occur with other nodes will still be described by a degenerate 1-edge. Also, if many nodes simultaneously co-occur together, their relationship will be described by a k-edge (where k == the number of co-occurring nodes)

Often, especially for ease of visualization, it is useful to restrict the discussion graph to be a planar graph (where every edge in the graph connects exactly 2 nodes). The PLANAR PROJECT TO command achieves this. All hyper-edges will be decomposed and re-aggregated into their corresponding 2-edges.

Furthermore, it can be useful to restrict the graph to be bipartite, where only edges that cross domains are kept. For example, we may only care about the relationship between users and the hashtags they use, and not care about the relationship among hashtags themselves. The PLANAR BIPARTITE PROJECT TO command achieves this. Semantically, this is the equivalent of doing a planar projection and then dropping all edges that connect nodes are in the same domain.

MERGE

Syntax: MERGE varname1,varname2[,…];

Example: MERGE MentionAndUserGraph,HashTagAndUserGraph;

The MERGE command overlays two discussion graphs atop each other. Nodes with the same feature domain and values will be merged.

OUTPUT

Syntax: OUTPUT TO “filename.graph” [FROM varname];

Example: OUTPUT TO “mentions.graph”;

The OUTPUT TO command saves a discussion graph to the specified file.

File’s are saved in DGT’s native format. This format consists of 3 tab-separated columns. The first column is the edge identifier: the comma-separated list of nodes connected by this edge. The second column is the count of the number of times this co-occurrence relationship was observed to occur. The third column is a JSON-formatted representation of the context of the relationship or, in other words, the distribution of feature values conditioned on the co-occurrence relationship.

Naming variables

We can assign the result of commands to variables, and use these variables in later commands:

Syntax:

var x = COMMAND1;
COMMAND2 FROM x;

Example:

var reviewData = LOAD Multiline(path:”finefoods.tar.gz”);
var reviewFeatures = EXTRACT AffectDetector(),reviewscore FROM reviewData;

Feature Extractor Reference

Here’s a current list of feature extractors included in the discussion graph tool release.

Feature Extractor Arguments Output Domain
AffectDetector()

Infers mood from text

field: input field to analyze (default=’text’) Mood: weights for 7 moods (joviality, sadness, guilt, fatigue, hostility, serenity, fear)

PosNeg: aggregation of positive/negative affects

Gender()

Infers gender from user names

field: input field to analyze (default=’username’)

discrete: whether to output discrete or weighted gender values (default=’true’)

gender: m=male, f=female, u=unknown
GeoPoint()

explicit lat-lon coordinates

field: input field to analyze (default=’geopoint’)

rounding: number of decimal places to include

geopoint: lat-lon value
GeoShapeMapping()

Maps lat-lon points to feature values via a user-specified GeoJSON formatted shapefile

field: input field to analyze (default=’geopoint’).  this field should contain both lat and lon coordinates, separated by a space or comma.

latfield: input field containing latitude value.

lonfield: input field containing longitude value.

shapefile: GeoJSON formatted shapefile

propertynames: comma separated list of property:domain pairs.  The property names a property within the shapefile, and the domain specifies a custom domain name for that property.  If a lat-lon point falls within a shape specified in the shapefile, the feature extractor will output all the specified properties in the propertynames list.

unknownvalue: value to assign to a lat-lon outside of given shapes

Note: Please specify either the field argument or both the latfield and lonfield arguments.

[custom domain name]
Country()

An instance of GeoShapeMapping that maps lat-lon to country/region two-letter codes and country/region names

field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

latfield: input field containing latitude value.

lonfield: input field containing longitude value.

unknownvalue: value to assign to a lat-lon outside of countries/regions

Note: Please specify either the field argument or both the latfield and lonfield arguments.

fips_country:

country:

USAState()

An instance of GeoShapeMapping that maps lat-lon to USA subregions and states

field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

latfield: input field containing latitude value.

lonfield: input field containing longitude value.

unknownvalue: value to assign to a lat-lon outside of US states

Note: Please specify either the field argument or both the latfield and lonfield arguments.

USA_subregion:

USA_state:

USA_fips:

CountyFIPS()

An instance of GeoShapeMapping that maps lat-lon to US county names and FIPS codes

field: input field to analyze (default=’geopoint’). this field should contain both lat and lon coordinates, separated by a space or comma.

latfield: input field containing latitude value.

lonfield: input field containing longitude value.

unknownvalue: value to assign to a lat-lon outside of US counties

Note: Please specify either the field argument or both the latfield and lonfield arguments.

countygeoid:

countyname:

Time()

Extracts various temporal features

field: input field to analyze (default=’creationdate’)

options: list of time features to extract: absoluteminute, absolutehour, absoluteday, absoluteweek, monthofyear, dayofweek, hourofday. (default is to output all fields)

format: ‘unix’ or ‘ticks’ (default=’unix’)

absoluteminute:

absolutehour:

absoluteday:

absoluteweek:

monthofyear:

dayofweek:

hourofday:

ProfileLocation()

Maps geographic regions from user profile locations with a user-specified mapping file

field: input field to analyze (default=’userlocation’)

domain: set custom output domain

mappingfile: model for mapping from user location names to geographic locations. DGT comes with a mapping file for major international metropolitan areas, and United States country regions and divisions.

unknownvalue: value to assign to unrecognized profile locations

[custom domain name]
ProfileLocationToCountry()

Maps user profile locations to 2-letter country/region FIPS codes

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

country:
ProfileLocationToCountryName()

Maps user profile locations to country/region names

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

countryname:
ProfileLocationToUSASubregion() 

Maps user profile locations to subregions of USA (e.g., Pacific, Mid-Atlantic)

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

usa_subregion:
ProfileLocationToUSAState()

Maps user profile locations to US states

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

usa_state:
ProfileLocationToUSACounty() 

Maps user profile locations to US county FIPS codes

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

usa_county:
ProfileLocationToUSACountyName()

Maps user profile locations to US county names

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

usa_countyname:
ProfileLocationToMetroArea()

Maps user profile locations to major metropolitan areas

field: input field to analyze (default=’userlocation’)

unknownvalue: value to assign to unrecognized profile locations

metroarea:
ExactPhrases()

Matches specific phrases in a given list or mapping file

field: input field to analyze (default=’text’)

domain: set custom output domain

accept: a comma-separated list of phrases to match

acceptfile: a text file listing phrases. Use a tab-separated two-column file to specify canonical forms for matched phrases

[custom domain name]
Regex()

Matches regular expressions

field: input field to analyze

domain: set custom output domain

regex: the regular expression to match against text

[custom domain name]
Tokens()

Extracts unigram tokens

field: input field to analyze

domain: set custom output domain

stopwordsfile: file of tokens to ignore (default=none)

porter: use porter stemming (default=”false”)

[custom domain name]

FAQ

Common things you’ll want to do

Load data in different formats

DGT can load social media data in delimeter-separated TSV and CSV files, line-based JSON format (including the output of common twitter downloaders) and multi-line record formats.

TSV and CSV data

To load a TSV or CSV, use the following LOAD command. The path to a file is required. Also, either the hasHeader flag must be set to true (indicating the first row of the file is a header line) or the schema argument must be set.

LOAD TSV(path:"filename.txt",
         fieldSeparator:",", // optional: default is tab character
         ignoreErrors:"true", // optional: default is false
         hasHeader:"false", // optional: default is false
         schema:"col1,col2,..." // either hasHeader:"true" or a schema is required
         );

Multi-line record data

A multi line record formatted file includes a single record field per-line, with a blank line separating records. For example:

name: Bob

text: hello world!

messagetime:5/4/2013

name: Alice

text: hello back!

messagetime:5/5/2013

To load a multiline record, use the following LOAD command. Only the path argument is required. The schema is implicit in the file itself.

LOAD Multiline(path:"filename.txt",
               fieldSeparator:":", // optional: default is : character
               ignoreErrors:"true" // optional: default is false
               );

JSON file

DGT can read JSON line formatted files (where each line of a text file is a JSON object).

LOAD JSON(path:"filename.txt",
          ignoreErrors:"true",
          schema:"field1:jsonpath1,field2:jsonpath2,...");

The schema must specify both the fields to be extracted as well as their JSON paths. If multiple values in the JSON object match a given path, the field will be assigned multiple values.

Twitter data

DGT also includes a pre-defined data source for Twitter output of the twitter-tools utilities. (This is a JSON-line formatted file) To load the output of the twitter-tools utilities, use the following LOAD command. Only the path argument is required.

LOAD Twitter(path:"filename.txt");

This data source includes schema definitions for most of the common Twitter fields:

Field JSON path
contextid id_str
createdat c
text text
inreplytostatusid in_reply_to_status_id
inreplytoscreenname in_reply_to_screen_name
userid user/id_str
username user//name
userscreenname user/screen_name
userlocation user/location
lang lang
userdescription user/description
userfollowerscount user/followers_count
userfriendscount user/friends_count
userlistedcount user/listed_count
usercreatedat user/created_at
userfavouritescount user/favourites_count
userutcoffset userhttps://www.microsoft.com/en-us/research/project/u-measure/tc_offset
usertimezone user/time_zone
userverified user/verified
userstatusescount user/statuses_count
retweetcreatedat retweeted-status/created_at
retweetid retweeted_status/id_str
retweettext retweeted_status/text
retweetuser retweeted_status/user/id_str
retweetusername retweeted_status/user//name
retweetuserscreenname retweeted_status/user/screen_name
hashtag entities/symbols/text
symbol entities/symbols/text
url entities/urls/url
urlexpanded entities/urls/expanded_url
mentionuserid entities/user_mentions/id_str
mentionusername entities/user_mentions//name
mentionuserscreenname entities/user_mentions/screenname
geopoint geo/coordinates/$$

Filter out irrelevant messages

Sometimes a specific social media message is simply irrelevant to a specific analysis. For example, in a study about hashtag usage on Twitter, we might want to ignore messages that do not have hashtags. To do this, we can use the PRIMARY keyword of the EXTRACT command.

EXTRACT PRIMARY hashtag, PRIMARY mention, AffectDetector();

In this example, we have marked the hashtag and mention fields as PRIMARY fields (any field or feature extractor may be marked as PRIMARY). This PRIMARY flag tells the EXTRACT command that it must find either a hashtag or a mention value in a message in order to continue processing it. If a message has either a hashtag or a mention, EXTRACT will also run the AffectDetector() and pass the values along to the rest of the script. If a message does not have any hashtag value and does not have any mention value, then that message will be ignored.

The PRIMARY flag can be combined with the acceptfilter and rejectfilter arguments accepted by most feature extractors. If you want to only analyze social media messages by women, for example, you can use the acceptfilter argument to achieve this:

EXTRACT PRIMARY Gender(accept:”f”), hashtag, mention;

The Gender feature extractor understands the acceptfilter argument, and will only output feature values that match the list. The result in this case is that only messages where the author’s gender is identifiably female will be processed. (note that the hashtag and mention fields are no longer marked as PRIMARY fields).

If you have a long list of values you want to accept, you can use the acceptfilterfilename argument. The syntax and behavior for the acceptfilter and acceptfilterfilename is the same as for the Phrases() feature extractor.

Detect phrases and words in tweet text

You can extend the Phrases() feature extractor with a different set of arguments to detect different phrases. Here is an adaptation of our “politician detector” from our simple example, but this time modified to detect super heroes. By default, phrase detection is case-insensitive.

EXTRACT ExactPhrases(domain:”parent”,accept:”dad,mom,father,mother”);

If you have a long list of phrases you want to detect, you can put them in a file and reference them in your processor. Note that you also have to specify the datafile as a resource, so that the framework knows to include that file as part of the job.

EXTRACT ExactPhrases(domain:”parent”,acceptfile:”parentphrases.txt”);

In its simple form, this file is simply a list of phrases to detect. You can also use this file to group or canonicalize detected phrases by adding a 2nd tab-separated column that includes the canonical form. For example, if you used the following file, it would detect the nicknames for parents and map them to their canonical name. That is, whenever the phrase extractor finds “mommy” or “mom” the extracted feature will be emitted as “mother”.

mom mother
mother mother
mommy mother
dad father

Import results into R, Excel, Gephi or other tools

Often, you will want to perform further higher-level analyses (machine learning analyses, visualizations and/or statistical analyses) on the output of DGT. To do so, we provide utilities to convert from DGTs native output format to TSV and GEXF files that will let you load the data in R, Excel, Gephi and other tools.

To convert to TSV, use the dgt2tsv.exe command:

dgt2tsv.exe input.graph [outputfields] outputfilename.tsv

The list of outputfields may include “count”, any of the domains output by a feature extractor, a domain name followed by “.count”, or a domain name followed by a specific feature value.

For example, the following command will output a count of the number of messages seen for each edge in a discussion graph; the gender of the author; and the weight of the “fatigue” value in the Mood domain.

dgt2tsv.exe input.graph count,gender,Mood.fatigue output.tsv

To output a .gexf file that can be read by Gephi for graph analyses and visualizations, use the dgt2gexf.exe command:

dgt2gexf.exe input.graph [outputfields] outputfilename.gexf