This project aims at profiling blogger interests correlated with their demographics. Amazon’s Common-Crawl corpus was used for this purpose. The crawled data corresponding to the blogger profile web-pages(Sample page) was used as the dataset for this analysis.
The selective download of the required dataset was made possible by the Common Crawl URL Index by Scott Robertson. About 8000 blogger profile pages(surprisingly low!) were found in the corpus using the URL index. Part of the reason for this low number is that the URL index at this time has been generated only for the half of 81TB amazon corpus.
Check out the project at GitHub
Some finds from this exercise were
1) 25% of blogger from Bangalore write about Music(Not so surprising as it is the rock capital of india)
2) NY, SF and Toronto account for 40% of the bloggers.
3) San Francisco and Dallas, each 17% account for the most bloggers with interest in Politics
4) San Francisco,Bangalore and Vancouver are among the top cities with bloggers whose interest is travel
The technology stack of the project comprises of
- Python script to facilitate downloading the selective chunks of ARC files in Common-Crawl corpus. This uses the JSON generated from the URL Search web-application API, when search for all domains with blogger.com/profile as prefix.
- Jsoup for crawling downloaded blogger profile web-pages
- Maui to extract topics from raw text. This was used in conjunction with AGROVOC vocabulary.
- MySQL database is used to dump the rough results of the crawl of the blogger profiles for further analysis.
- JFreeChart is used to display charts.
Extracted Dimensions : City,State,Country,Topic of Interest, Gender,Occupation
Extracted Measure : No of bloggers
The aforementioned dimensions were used in the generation of the reports. The sample reports can be found in the SnapShots folder in the GitHub project.
Thought process behind the project
Reason to use URL Index
Cost aversion was a factor which lead to the usage of the URL index, thus allowing for all the computation to be done in my local machine.
Reason to use Maui
Maui is a topic extractor having the ability to extract topic terms from text, even those which are not actually present in the text
For example, it could give me “Politics” as a topic, even though the term “politics” or any of its roots not being actually mentioned in the text.
Maui consumes RDF based vocabularies for topic extraction.I used the latest version of AGROVOC vocabulary as it contains a broad array of topics.
Reason to use blogger profile web-pages
Since the purpose was to analyze blogger interests based on demographical information, these profile pages served as a rich source of that information.It was possible to infer about the topics which are of interest to the blogger, without actually having to crawl his entire blog.As this was just a Proof of Concept(PoC) project, this minimized the computation cost as everything was done locally, without the need of EC2.
The blogger profile page also contains two sections called “My Blogs” and “Blogs I Follow”. The blog pages corresponding to these could also be downloaded and crawled for topic extraction. This job would be best done using the EC2 infrastructure as the data size could be exponentially greater. The python script could be modified to use the actual URL index located here to find out if we missed any data. But this would also require EC2 infrastructure as the index size is huge(250 GB!).