Friday, January 29, 2010 - 15:05
1 hour (actually 50 minutes)
One of the biggest hurdles in high performance computing today is the analysis of massive quantities of data. As the size of the datasets grows to petascale (and beyond), new techniques are needed to efficiently compute meaningful information from the raw data. Graph-based data (which is ubiquitous in social networks, biological interaction networks, etc) poses additional challenges due to the difficulty of parallelizing many common graph algorithms. A key component in success is the generation of "realistic" random data sets for testing and benchmarking new algorithms. The R-MAT graph generator introduced by Chakrabarti, Faloutsos, and Zhan (2004) offers a simple, fast method for generating very large directed graphs. One commonly held belief regarding graphs produced by R-MAT is that they are "scale free"; in other words, their degree distribution follows a power law as is observed in many real world networks. These properties have made R-MAT a popular choice for generating graphs for use in a variety of research disciplines including graph theoretic benchmarks, social network analysis, computational biology, and network monitoring. However, despite its wide usage and elegant, parsimonius design, our recent work provides the first rigorous mathematical analysis of the degree distributions of the generated graphs. Applying results from occupancy problems in probability theory, we derive exact expressions for the degree distributions and other parameters. We also prove that in the limit (as the number of vertices tends to infinity), graphs generated with R-MAT have degree distributions that can be expressed as a mixture of normal distributions. This talk will focus on the techniques used in solving this applied problem in terms of classical "ball and urn" results, including a minor extension of Chistyakov's theorem.