Int'l AAAI Conference on Weblogs and Social Media
Datasets made available through ICWSM are not restricted to only ICWSM 2010 or even ICWSM in general. Our long-term goal is to make weblog and social media datasets available to the research community, and while we hope that ICWSM will be a premier venue for presenting that research, we are happy to see the ICWSM datasets used far and wide.
ICWSM 2009 Spinn3r Blog Dataset
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs.
To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.
Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.
Spinn3r provides free access to researchers. If you are interested in making use of their data beyond the ICWSM collection, for example to crawl linked posts or earlier stories from certain blogs, visit their site, spinn3r.com
When citing this dataset in a paper, please use the following reference:
K. Burton, A. Java, and I. Soboroff. The ICWSM 2009 Spinn3r Dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media (ICWSM 2009), San Jose, CA, May 2009.
JDPA Sentiment Corpus
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document. Mentions of each entity are marked as co-referential. Mentions are assigned semantic types consisting of the Automatic Content Extraction (ACE) mention types and additional domain-specific types. Meronymy (part-of and feature-of) and instance relations are also annotated. Expressions which convey sentiment toward an entity are annotated with the polarity of their prior and contextual sentiments as well the mentions they target. The following modifiers are annotated. These may target other modifiers or sentiment expressions
- negators (expressions which invert the polarity of a sentiment expression or modifier)
- neutralizers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier)
- committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier)
- intensifiers (expressions which shift the intensity of a sentiment expression or modifier)
Additionally, we have annotated when the opinion holder of a sentiment expression is someone other than the author of the blog by linking the expression to the holder. We also annotate when two entities are compared on a particular dimension.
The data, organized into training and testing sets, consists of 515 documents (blog posts) covering 330,762 tokens which make up 19,322 sentences. 87,532 mentions and 15,637 sentiment expressions are annotated.
To get access to the JDPA collection, download and sign the usage agreement, and email it to ICWSM.JDPA.Corpus@gmail.com. Once your form is processed, you will be sent a URL and password where you can download the collection.
Wikipedia User Contribution Dataset
http://nile.ics.uci.edu/events-dataset-api
Sara Javanmardi and Yasser Ganjisaffar, University of California, Irvine
This dataset has been prepared for an ongoing study on user reputation and content quality in Wikipedia at University of California, Irvine. This research is done mainly by two PhD candidates: Sara Javanmardi and Yasser Ganjisaffar under the supervision of Prof. Lopes, Prof. Baldi, and Prof. Grant. One of the building blocks of this study was a software component that can monitor changes in content of the wiki pages over time. We have developed the component and we are please to share one of our datasets on English Wikipedia which contains user contributions. For each article we have modeled the evolution of the content through insert and delete events over time (up to September 2009). Since the dumps released after October 2007 for English Wikipedia don't contain full text of revisions, and also processing the text of revisions is a complicated and time consuming task, we hope sharing this dataset helps to expedite research studies on Wikipedia and Social Media in general.
Community-Created Data Resources derived from the
Spinn3r Collection
The following resources were created by ICWSM 2009 data challenge authors and are being made available for use in the 2010 data challenge.
-
Spinn3r collection metadata
Meeyoung Cha, Juan Antonio Navarro Perez, and Hamed Haddadi, MPI-SWS. - "A collection containing the data that we extracted from the Spinn3er dataset. It includes data about all the posts from mayor blogging domains that were included in the dataset, as well as the extracted social graph among their corresponding blog users. We also include metadata that we collected from the most popular YouTube videos shared in blogs."
-
Large-scale personal story corpus
Andrew Gordon and Reid Swanson, USC. - "To facilitate the distribution of large-scale story corpora, our group has identified individual blog posts that contain personal stories within existing large-scale corpora of posts. Most recently, we identified nearly one million personal stories in the ICWSM 2009 Spinn3r Blog Dataset, which we call the ICWSM 2009 Story Subset."
-
Lucene index of the ICWSM 2009 collection
Dan Knights, JD Power & Assoc. -
"Includes: "
- README (explains how to use the lucene index and search)
- index-all.tar.gz (the lucene index)
- jdpa.lucene.tar.gz (the JDPA Java packages that does the indexing and searching)
- parsexml.py (a quick-and-dirty python script to parse and strip the post text from the Spinn3r xml)
Community
We have a mailing list for discussing the datasets at http://groups.google.com/group/icwsm-data. Please join to talk about whatever you're doing with the data. In particular, if you are looking for groups to collaborate with, here's a forum for you. We also have a project at Google Code, http://code.google.com/p/icwsm-data/, where we can host tools and resources that you create to go along with the datasets.
Data Chairs
Ian Soboroff, NIST
Akshay Java, Microsoft

