3rd Int'l AAAI Conference on Weblogs and Social Media
May 17 - 20, 2009, San Jose, California
Continuing the ICWSM tradition, ICWSM 2009 is making a dataset available to researchers in the blog and social media fields. We invite you to download the dataset, explore it, learn something interesting about it, and submit a paper about it to ICWSM 2009.
Good research topics might include...
- link analysis
- social network extraction
- tracing the evolution of news
- blog search and filtering
- psychological, sociological, ethnographic, or personality-based studies
- analysis of influence among bloggers
- blog summarization and discourse analysis
But you should feel free to explore any aspect of the data that you feel would be of interest to the ICWSM community.
List of papers accepted to the Data Challenge Workshop
Identifying Personal Stories in Millions of Weblog Entries [PDF]Andrew Gordon and Reid Swanson
SentiSearch: Exploring Mood on the Web [PDF]
Sara Sood and Lucy Vasserman
Flash Floods and Ripples: The Spread of Media Content through the Blogosphere [PDF] [Best Data workshop paper]
Meeyoung Cha, Juan Antonio Navarro Perez, and Hamed Haddadi
Event Intensity Tracking in Weblog Collections [PDF]
Viet Ha Thuc, Yelena Mejova, Christopher Harris and Padmini Srinivasan
Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network [PDF]
Ali Azimi Bolourian, Yashar Moshfeghi and C. J. van Rijsbergen
Authors are invited to submit papers to a special data challenge workshop, to be held on the last day of ICWSM. Papers for the workshop may be submitted here. The deadline for workshop submissions is March 1st. Submissions may be up to 8 pages in length, must be in PDF format, and must follow the ICWSM formatting guidelines. The workshop itself will feature presentations by authors as well as a broader discussion of data issues and opportunities confronting the social media community.
We also welcome authors to submit papers on the dataset to the main ICWSM conference. Time permitting, we will invite authors of accepted ICWSM papers on the dataset to also briefly present their work at the workshop.
The best paper (main conference or workshop) on the dataset will be selected by the data chairs and will receive a prize at the conference.
Please note that the datasets made available through ICWSM are not restricted to only ICWSM 2009 or even ICWSM in general. Our long-term goal is to make weblog and social media datasets available to the research community, and while we hope that ICWSM will be a premier venue for presenting that research, we are happy to see the ICWSM datasets used far and wide.
ICWSM 2009 Spinn3r Blog Dataset
239 people have downloaded the dataset so far! (as of March 25th, 2009)
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs.
To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.
Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.
Spinn3r provides free access to researchers. If you are interested in making use of their data beyond the ICWSM collection, for example to crawl linked posts or earlier stories from certain blogs, visit their site, spinn3r.com
When citing this dataset in a paper, please use the following reference:
K. Burton, A. Java, and I. Soboroff. The ICWSM 2009 Spinn3r Dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media (ICWSM 2009), San Jose, CA, May 2009.
Community
We have a mailing list for discussing the datasets at
http://groups.google.com/group/icwsm-data. Please join to talk about
whatever you're doing with the data. In particular, if you are
looking for groups to collaborate with, here's a forum for you. We
also have a project at Google Code,
http://code.google.com/p/icwsm-data/, where we can host tools and
resources that you create to go along with the datasets.
Data Chairs
Ian Soboroff, NIST
Akshay Java, Live Labs, Microsoft
![]() |
![]() |
![]() |
J.D. Power and Associates Web Intelligence Division | ||
![]() |
![]() |
![]() |
Microsoft | Nielsen Online | Spinn3r |
![]() |
![]() |
|
Videolectures.net | Visible Technologies |
Sponsored by the Association for the Advancement of Artificial Intelligence. For more info: icwsm09@aaai.org