ICWSM-2007 Weblog Datashare --------------------------- - The dataset consists of about 14M weblog posts from 3M weblogs collected by Nielsen BuzzMetrics for May 2006. The data is annotated with 1.7M blog-blog links. - Breakdown (of posts) by language: English 51% Chinese 14% Japanese 14% Russian 6% Spanish 3% French 2% Italian 2% Unknown 3% - The data is in XML format; there is one XML file per day. - Caveats: * Date/time are not normalized to a uniform time zone, so interpret dates with care * There are association errors in the data; in less than 2% of the posts, the permalink, title, author and/or date will be incorrectly associated with the content of the post * Only a portion of weblog and press outlinks are identified. Up to a half of the blog outlinks are missing. - Data is released in conjuntion with the First International Conference on Weblogs and Social Media: http://www.icwsm.org Here is the process for obtaining the data: (1) download the data share agreement (2) sign and fax the agreement as instructed in the document (3) e-mail datashare@icwsm.org to let us know that the fax is on its way Once we've received the fax, we will e-mail you a unique username/login that will permit you to download the data within 24 hours. ---------------------------- Format of the feed: weblog url title of the weblog (defaults to weblog url if title not found) permalink for the post (defaults to weblog url if permalink not found) title of the post author of the post (may be empty or missing) date of publication of the post time of publication of the post in format HHMMSS (defaults to 000000 if unknown) content of the post type of outlink: either "weblog" or "press" url in href of post content if type=="weblog", site is the parent weblog for the permlink url; if type=="press", site is the news portal hosting the news article tag/category associated with post