# Link Prediction Analysis in the Wikipedia Collaboration Graph

←

→

**Page content transcription**

If your browser does not render page correctly, please read the page content below

Link Prediction Analysis in the Wikipedia Collaboration Graph Ferenc Molnár Department of Physics, Applied Physics, and Astronomy Rensselaer Polytechnic Institute Troy, New York, 12180 Email: molnaf@rpi.edu Abstract—Using the page editing records going back to the A. Related work very beginning of Wikipedia, we define a dynamic collaboration Link prediction belongs to the field of network evolution graph of editors and social links between them. We focus on prediction of social link formation among Wikipedia editors. We models, which involves the study of many different social show the statistical analysis of five link prediction models, using networks, such as citation networks, communication networks, well-defined statistical measures, such as precision, accuracy, acquaintance networks, and of course, collaboration networks. sensitivity and specificity. Results show that the best predictor for All of these are strongly linked to the Internet, whose growth screening purposes (identifying most link formations correctly) is and its scale-free degree distribution is well described by the given by a model considering the strength of links already existing between the common neighbors of two editors, but the highest preferential attachment model [1]. It has been shown, however, probability of correct predictions is achieved by the Adamic/Adar that the evolution of social networks is driven by a different predictor. process [3]. Clustering, also known as the principle of triadic closure [2], plays a very important role. Collaboration networks have been modelled by their general I. I NTRODUCTION properties (e.g. [8]), but the problem of precise link prediction was introduced by Nowell and Kleiberg [9], who provide In the past decade we have witnessed the rapid expanse of a baseline of link prediction methods, and their analysis. the Internet, in both user numbers and volume of contents. They show that numerous link prediction methods can be It also involved the formation of web-based social networks, significantly more precise, than a random guess. Their work which grew to larger sizes than any other social networks motivated the study presented in this paper. Recently, many before the era of the Internet. One of these social networks was models and methods of link prediction were formulated and formed by Wikipedia, due to its open editing policy. Editors analyzed (e.g. [6], [7]). The incorporation of time-dependent of certain pages share a common interest in the field that the information to enhance predictions also have gained consider- page belongs to, which is a basis for social links to form able attention ([5], [4]). between them, and by helping each other, or even competing Here, we conduct a thorough statistical analysis of a number in perfecting a certain page, social links between them become of link prediction models, showing the relation between sensi- even stronger. tivity, specificity, and accuracy. In addition, the collaboration One of the many exciting questions about these networks graph is defined as a dynamic graph, in recognition of the is their evolution. It is unquestionably a complex process, and inherent dynamic nature of social networks. Wikipedia is an apart from a few general principles (e.g. triadic closure, [2]) ideal subject of study, because of the large number of its the exact dynamics may be different for each social network editors, and its long editing history provides a sufficiently large on the web. Can we find a model that can correctly describes dataset for sampling and evaluating predictions. the growth of a social network? Can we predict future links, based on the present state of the network? This is the essence II. P RELIMINARIES of the link prediction problem [9]. A. Dynamic Collaboration Graph The aim of this paper is to present a thorough statistical The collaboration graph, also known as the coeditors graph, analysis of multiple link prediction methods, and find the one a specific kind of social network. Generally, it is defined that describes the evolution of social links between the editors as the graph composed of editors as nodes; the strength of of Wikipedia most correctly. First, we define the collaboration a link between two editors indicate how many publications graph as a dynamic graph, constructed from the changelog of (Wikipedia pages, in our study) they edited together in a given Wikipedia over the first ten years of its existence. Then, we timespan. This graph is usually defined as a static graph at a give an overview of statistical analysis of binary predictors given time, accumulating editing records with some timespan, that we applied in this study. In section III, the results of and social links are inferred from it by setting a threshold on prediction analysis are presented. In the last section, the results the link strength (i.e. at least how many pages the editors had are summarized and compared to each other. to edit together).

In case of Wikipedia, we have data for over ten years, with Wikipedia. In the following subsections, we will select five of snapshots accumulating the page editing records over weeks. these models, and give a short overview of them. However, instead of using these snapsots as individual static Many of the predictors utilize the notion of the neighbor- graphs, we join them into a single, large dynamic collaboration hood of a node. These are simply the set of nodes adjacent graph. The nodes are the editors who ever edited a page during to a given node, which are connected by a social link. The the timespan of the entire dataset. The link strengths between mathematical definition is the following: nodes change over time, with the same time resolution as the snapshots of the input data, i.e. weeks. The dynamics are Γ(x) := {y : x is socially linked to y} (1) defined by the following update rules, evaluated at every time step: 1) Common neighbors predictor: The most simple predic- • if two editors edit a page together, strength between them tion that can be made based on the neighborhood of nodes increases by 8. is the number of common neighbors shared by two nodes. • every link strength is reduced by 1, until they reach zero. The underlying idea is that the more common neighbors are Using this definition, we can maintain a fine-detailed de- present, the more chance that the two people will find a scription of social links between editors. If two editors work common subject, upon which a social link can form between together randomly, the link between them is weak, and drops them. The prediction score is defined as follows: to zero in eight weeks. However, if they are working together repeatedly, the link strength between them keeps increasing, score(x, y) = |Γ(x) ∩ Γ(y)| (2) indicating a strong social interaction. In short, we maintain a time-dependent information of past history between editors, 2) Adamic/Adar: In their paper, A. Adamic and E. Adar not only weekly snapshots, which enables a more precise link proposed, that friendship between two persons can be pre- prediction for the future. dicted by measuring their similarity to each other [10]. The The dynamics are chosen mainly on the basis of computa- similarity is simply measured by the number of shared items, tion efficienty, because the run time of algorithms that generate but weighted, such that the unique items (shared only by these link predictions strongly depend on the graph being sparse or two people, and not by others) is more valuable, i.e. gives a not, and because our understanding of how the human brain stronger prediction, than the item which is shared among many stores (and forgets) long-time memories, including the ones people. Items, in our case, correspond to people, specifically related to social links, is very limited. Exponential decay was the friends already present at the given time. Therefore, the also considered, but the problem is that it would excessively prediction score given by this predictor is defined as follows: prolong the existence of weak links. The graph could become X 1 dense over some time, slowing down the analysis so much that score(x, y) = (3) it would become unfeasible. However, a linear decay keeps log |Γ(z)| z∈Γ(x)∩Γ(y) the graph size in check, because links can decay to zero in finite time, at which point they are actually removed from 3) Jaccard’s coefficient: This is a similarity metric of sam- the graph. The strength increment of 8 per editing together ple sets. Generally, it defined as the size of the intersection of is somewhat arbitrary. Based on preliminary computations, two sample sets divided by the size of the union of the sample random links (no actual social interactions) tend to decay to sets. In the application of link prediction, the samples are the zero, and true social links also form regardless of strength neighborhood of two nodes. From a probabilistic viewpoint, increment value. Higher increment would only give longer the score is again based on the number of common neighbors, decay times for random links, and higher strength values for but it is weighted by the probability that a (uniformly) ran- social links, but that would only shift the threshold parameter domly selected neighbor of either nodes is actually a common of the link predictors that we use. neighbor of both nodes. The score function is defined as: B. Prediction score functions |Γ(x) ∩ Γ(y)| score(x, y) = (4) |Γ(x) ∪ Γ(y)| The link prediction of a collaboration graph is generated by predictor functions for every link in the graph. These functions 4) Preferential attachment: It is based on the growth model use the present state of the graph (which includes history of social networks; the basic idea is that a new edge has a from the past, in our case, by using a dynamic graph), and probability to be incident on a node is proportional to the cur- give a prediction score for every possible link to exist in the rent neighborhoods size of that node. In case of collaboration graph in a future timespan. The higher score represents higher networks, it is suggested that new links form with probabilities chance for a social link to exist between two nodes. There proportional to the product of the neighborhood sizes of the are numerous models [9] which consider a wide range of two endpoints of a link [8]. Therefore, the score function is possible underlying processes driving the social interactions. defined as: Here, we do not aim to debate these models, but to compare their prediction performance, and see, which one fits best to score(x, y) = |Γ(x)| × |Γ(y)| (5)

5) Weighted common neighbors: We have also added our own predictor, which is an extension of the common neighbors predictor, designed specifically for the dynamic collaboration graph. In order to utilize the present link strength information, beyond whether it’s above or below the link threshold, we incorporate the link strengths as weighting factors for the prediction score. The idea is that if social links exist between common neighbors of two nodes, then it can be expected that the probability of a link formation between them is proportional to the strength of present social links to these Fig. 1: Confusion matrix of a binary predictor. It contains the common neighbors. The prediction score is defined as: number of samples corresponding to each possible outcome. X score(x, y) = S(x, z)S(z, y), (6) z∈Γ(x)∩Γ(y) • specificity = T N / (F P + T N ) where S(x, y) denotes the current strength of a link (between • precision = T P / (T P + F P ) endpoints x and y). • negative prediction value = T N / (T N + F N ) C. Binary predictions • accuracy = (T P + T N ) / (T P + T N + F P + F N ) Although link predictor functions give integer score values, These quantities can also be defined using conditional proba- they are treated as binary predictors, which means that they bilities, which give further insight into their meaning: have either a positive or a negative prediction (i.e. the link • sensitivity = Pr(positive prediction | link will form) will form, or not), and they are compared to binary outcomes • specificity = Pr(negative prediction | link will not form) (i.e. the link actually formed, or not). This is done by setting • precision = Pr(link will form | positive prediction) threshold parameters for both the predictions and the social • negative prediction value = Pr(link will not form | links. If the prediction score is above the prediction threshold, negative prediction) it is a positive prediction (social link predicted to form), • accuracy = Pr(prediction = outcome) otherwise, it is a negative prediction (social link predicted not These is always a tradeoff between sensitivity and specificity, to form). If the link strength in the collaboration graph is above depending on the prediction threshold that we use. The ROC the link threshold, it is a positive (social) link, otherwise it is curves (Receiver Operating Characteristic, [11]) show this a negative link (no social link). The prediction thresholds are exactly, by plotting sensitivity against (1−specificity). The different for each prediction model. Since the link strength advantage of these plots is that they directly visualize the is time-dependent, the existence of social links is also time- screening capability of the predictor. A random guess pre- dependent, with the same time resolution as the dynamic diction would have a point along the diagonal line on this graph. plot, but a perfect predictor would be the point at the top D. Predictor analysis left corner (at coordinates (0, 1)), having maximum sensitivity (no false negatives) and having maximum specificity (no false After setting the link and prediction thresold parameters, the positives). We can plot ROC curves by computing statistics predictor can be applied at a given time step of the dynamic at different prediction threshold parameters, and see which collaboration graph, at any given link. Nonzero prediction predictor (at which threshold parameter) can get closest to the score corresponds to an actual prediction, which is classified maximum sensitivity/specificity point. as positive or negative using the prediction thresold, and compared to the actual future state of the dynamic graph, with However, this only tells us the screening capability of a a given ∆T time between the present and future. One predic- predictor. We also need to know its accuracy, the actual success tion with the corresponding actual outcome is considered one rate of the predictor. More specifically, we need the precision sample. (also known as positive prediction value) and the negative The samples are collected over the time period of the first prediction value, because we can expect a very large number 150 weeks of Wikipedia editing records, for every possible of correct negative predictions, which would significantly link in each time step. For their analysis, we use statistical influence the accuracy, while we are more interested in positive tools borrowed from signal detection theory and predictive predictions (social link formations). Precision and accuracy analytics. Since we have a binary classification of predictions together however are enough for a complete description (be- and outcomes, we can use the confusion matrix (Fig. 1) to sides sensitivity and specificity), we don’t need the negative accumulate the samples. This is a 2 × 2 matrix, showing prediction value as a separate third quantity. the number of samples that have fallen into each of four We will also use another measure of the prediction quality, possible outcomes. From this table, we can derive a number the F1 score. The formula is defined as: of statistics, defined as follows: precision × sensitivity F1 = 2 × (7) • sensitivity = T P / (T P + F N ) precision + sensitivity

It values both sensitivity and precision equally, therefore it Algorithm 1 Score by common neighbors gives an overall measure of predictor performance. for all node i in graph G do L := LIST E. Prediction sampling algorithm for all node j in neighbors of i do Three predictors, namely the common neighbors, if strength(i, j) ≥ linkT hreshold then Adamic/Adar, and Jaccard’s coefficient predictors need add j to L to enumerate all common neighbors of all pairs of nodes, end if to generate all predictions at a given time step. The naive end for solution would be to loop over all possible node pairs, and if Length(L) ≥ 2 then compute neighborhood intersections. If the number of nodes, for j := 1 to Length(L) do and the average degree of a node are denoted by N and for k := j + 1 to Length(L) do D, respectively, then the expected run time of this solution score(L(j), L(k)) += f raction(i, L(j), L(k)) would be O(N 2 D), assuming adjacency list storage for the end for graph, and a hashset for computing the node neighborhood end for intersection. Since the number of nodes is over three million, end if this is not feasible. end for A better solution is to focus on the common neighbors themselves, and take advantage of the sparseness of the graph. There are many node pairs which do not share any between editors if they edited a total of k pages together, common neighbors, therefore there is no prediction for them where k is a threshold parameter. Figure 2 shows the degree at all, so we should not include them in the enumeration. distributions of these graphs. We were interested if we could Instead, we enumerate the nodes only once, and look at find a scale-free degree distribution, but we found that nodes their neighborhood: any pair of edges incident on a given with small degrees follow a different scaling than high-degree node, having strength larger than the link threshold, will nodes. This may suggest that the very active editors (high give a prediction for the link connecting the endpoints of degree nodes) are driven by a different social process than the those edges. Therefore, we only need to find all triangles rest of editors (low degree nodes). centered on a given node. The complexity of this enumeration is only O(N D2 ), and assuming that the √ graph is sparse at any given moment, such that D < N , this is better 100k -0.34 than O(N 2 ). The pseudocode for this method is given in 10k Algorithm 1, where f raction(i, x, y) is defined according k = 4 to the given prediction model; it gives fractional scores 1k -0.30 -2.53 k = 5 Node count k = 6 based on node x, y, and their common neighbor i. For k = 7 example, f raction(i, x, y) = 1 regardless of parameters 100 k = 8 for the common neighbors method; f raction(i, x, y) = -2.80 10 1/Log(Length(L)) for the Adamic/Adar predictor; f raction(i, x, y) = 1/|N eighbors(x) ∪ N eighbors(y)| 1 for the Jaccard’s coefficient. 1 10 100 1k 10k 100k 1M 10M To make the comparisions between predictions and actual Degree of nodes future, we do not store the entire dynamic graph of collabo- rations; it would require too much memory. Instead, we store two snapshots of the graph: one for the “present” time step, Fig. 2: Histogram of node degrees in the static collaboration and one ∆T time in the future. Both instances are updated graphs, integrated over the time of the entire dataset. The simultaenously using the input data of page editings, applying degree distributions were logarithmically binned, and the bins the same update rules at every time step. They are both are normalized by their size, such that the histogram is initialized from the same state (an empty graph), but the future proportional to the original degree distribution. Lines are fitted instance is advanced by ∆T time before the predictions and to different segments of the distributions, the slope of these comparisions begin. lines is indicated by the numbers on the figure. Values of k are III. R ESULTS the minimum number of pages that two editors edited together. A. Static graph properties As a preliminary analysis, static collaboration graphs were B. Parameter space mapping also analyzed. In these graphs all the page editing records For a complete analysis of the selected predictors, we have are integrated over the entire input data timespan (roughly ten to consider all input parameters that define the prediction. years). There are 3.1 million nodes in these graphs, the total These are the ∆T time between the present, when the pre- number of distinct editors in the dataset. Links are present diction is made, and the time in the future, for which the

prediction is made; the link threshold value, which decides the edge strength, above which the edge is considered as a 1,0 social link; and prediction threshold of the score, above which it is considered a positive prediction, and below it is a negative 0,8 prediction. To map this three-dimensional parameter space, we probability use the following range of parameter values: 0,6 Sensitivity • ∆T ∈ {1, 2, 3, 4, 5, 6, 7, 8} (weeks) Accuracy Precision • link threshold ∈ {30, 60, 90, 120, 150} 0,4 F1 The prediction threshold is different for each predictor, the 0,2 range must be selected such that the tradeoff between sensi- tivity and specificity is measurable. 0,0 The analysis shows that for all predictors, the link threshold 0,0 0,2 0,4 0,6 0,8 1,0 and ∆T parameters have very little influence on the quality of 1-specificity the predictor. We use this to simplify the display of the results. To display the dependence on ∆T and link threshold values, we use contour plots that show the achievable maximum Fig. 4: Statistical behaviour of the common neighbors predic- sensitivity (also, maximum specificity), maximum accuracy, tor, shown on the ROC-space. and maximum F1 values. These maxima were found by numerically scanning the range of the prediction threshold values, for the given ∆T and link thresholds. These plots are Higher values again correspond to higher precision and lower organized into a table of figures, Fig. 3. sensitivity. In this case, however, the range of thresholds allows to find statistics across the entire ROC-space, so the curves on C. Statistical analysis the figure fill the entire (0, 1) range of specificity. The statistical behaviour of each predictor is shown con- We can see a somewhat stronger ROC curve, compared to cisely on ROC-space plots. For every analysis (for every the simple common neighbors predictor. Maximum sensitivity predictor) two parameters are fixed: ∆T = 3 weeks; link and specificity is at 85%, but again, this corresponds to low threshold = 90. Then, by running prediction analyses for a precision. The overall best prediction is achieved at prediction range of prediction thresholds (different for each predictor), threshold = 3.98, where both precision and sensitivity are at the measured values of sensitivity, accuracy, precision, and F1 62%. values are plotted against (1−specificity). These points joined together make continouos curves. 1) Prediction using common neighbors: The measured sta- 1,0 tistical quantities of the common neighbors predictor is shown in Fig. 4. The following prediction threshold values were used: 0,8 prediction threshold ∈ {100.1i }30 probability Sensitivity i=0 (8) 0,6 Accuracy Precision Higher values correspond to higher precision and lower sensi- F1 0,4 tivity, but the relation is nonlinear. The curves do not extend beyond specificity value of 0.45, because this corresponds to 0,2 the prediction threshold (number of common neighbors) = 1. This is the minimum possible value, so we cannot get statistics 0,0 beyond this limit. 0,0 0,2 0,4 0,6 0,8 1,0 The ROC curve shows a good level of sensitivity, but the 1-specificity maximum sensitivity and specificity point (the one closest to the top left corner), which is 83% sensitivity, corresponds to very low precision. On the other hand, we can achieve very Fig. 5: Statistical behaviour of the Adamic/Adar predictor, high precision, almost 100%, by using a very high threshold, shown on the ROC-space. but in this case the sensitivity will be very low (i.e., many false negatives). Overall, the best quality (maximum F1 value) is 3) Prediction using Jaccard’s coefficient: The ROC-space achieved at prediction threshold = 15.8, where the precision plot is shown in Fig. 6 for this predictor. The range of threshold and sensitivity are both 60%. parameters used to generate the plot: 2) Prediction using Adamic/Adar: The analysis of this prediction threshold ∈ {100.1i }0i=−40 (10) predictor is shown in Fig. 5. The range of threshold parameters used: Note, the maximum possible value of this coefficient is 1, prediction threshold ∈ {100.1i }20i=−20 (9) which corresponds to the situation where the two nodes only

140 140 140 0.862 0.952 0.629 120 0.855 120 0.948 120 0.625 link threshold link threshold link threshold 0.848 0.943 0.621 100 100 100 Adamic/Adar 0.84 0.939 0.617 80 0.833 80 0.934 80 0.613 0.826 0.93 0.609 60 60 60 0.819 0.925 0.605 40 40 40 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 DT HweeksL DT HweeksL DT HweeksL maximum sensitivity maximum accuracy maximum F1 140 140 140 0.978 0.998 0.648 120 0.969 120 0.997 120 0.636 Weighted link threshold link threshold link threshold 0.959 0.995 0.624 100 100 100 Common 0.95 0.994 0.993 0.612 80 0.941 80 80 0.6 Neighbors 60 0.931 60 0.991 60 0.588 0.922 0.99 0.576 40 40 40 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 DT HweeksL DT HweeksL DT HweeksL maximum sensitivity maximum accuracy maximum F1 140 140 140 0.854 0.95 0.608 120 0.846 120 0.945 120 0.604 link threshold link threshold link threshold Common 100 0.838 0.83 100 0.941 0.936 100 0.599 0.595 Neighbors 80 0.822 80 0.931 80 0.591 0.814 0.927 0.586 60 60 60 0.806 0.922 0.582 40 40 40 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 DT HweeksL DT HweeksL DT HweeksL maximum sensitivity maximum accuracy maximum F1 140 140 140 0.742 0.929 0.417 120 0.727 120 0.921 120 0.405 link threshold link threshold link threshold Jaccard’s 100 0.712 0.696 100 0.912 0.904 100 0.392 0.38 Coefficient 80 0.681 80 0.896 80 0.368 0.666 0.887 0.355 60 60 60 0.651 0.879 0.343 40 40 40 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 DT HweeksL DT HweeksL DT HweeksL maximum sensitivity maximum accuracy maximum F1 140 140 140 0.829 0.937 0.599 120 0.819 120 0.932 120 0.585 link threshold link threshold link threshold Preferential 100 0.81 0.8 100 0.927 100 0.571 0.556 0.922 Attachment 80 0.79 80 0.918 80 0.542 0.781 0.913 0.528 60 60 60 0.771 0.908 0.514 40 40 40 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 DT HweeksL DT HweeksL DT HweeksL maximum sensitivity maximum accuracy maximum F1 Fig. 3: Maximum achievable sensitivity, accuracy, and F1 values, as a function of ∆T time (between prediction and actual outcome), and link threshold parameters, for each predictor. have common neighbors. The best overall performance is found at threshold = 0.25, The statistics show that this predictor has much worse where precision and sensitivity are 38%. characteristics than previous ones. Maximum sensitivity and 4) Prediction using Preferential attachment: The analysis specificity is only 70%. The precision and accuracy values of this predictor is shown in Fig. 7. The range of threshold do not reach 100% as the prediction threshold increases, the parameters: maximum possible accuracy is 90% at threshold = 0.50, and maximum precision is 48%, found at the same threshold value. prediction threshold ∈ {100.25i }16 i=0 (11)

following range of threshold parameters was used: 1,0 prediction threshold ∈ {100.25i }28 i=4 (12) 0,8 Note, that the lowest possible prediction value now depends on the link threshold: prediction value ≥ (link threshold)2 . It probability Sensitivity 0,6 Accuracy was observed, that some links manage to gain strength in the Precision F1 order of thousands, so the highest values of predictions usually 0,4 range in the millions. The analysis shows, that this method has superior screening 0,2 capability. Its maximum sensitivity (with maximum speci- ficity) exceeds all other predictors: 97%, when prediction 0,0 threshold = 150000. It also has a very good overall per- 0,0 0,2 0,4 0,6 0,8 1,0 formance: 61% is the maximal precision and sensitivity, at 1-specificity prediction threshold = 1.7 × 106 . Fig. 6: Statistical behaviour of the Jaccard’s coefficient pre- dictor, shown on the ROC-space. 1,0 0,8 Note, that this predictor’s lowest possible value is 1, when probability Sensitivity 0,6 both nodes have only one neighbor. Higher values correspond Accuracy Precision to higher precision and lower sensitivity. The curves do not F1 0,4 extend beyond specificity of 0.1, becuase the threshold value = 1 corresponds to this point, and it can not be smaller. 0,2 The ROC-curve is somewhat better than for the Jaccard’s coefficient, but it’s worse than the common neighbors, the 0,0 maximum sensitivity (and specificity) is 81%, achieved at 0,0 0,2 0,4 0,6 0,8 1,0 prediction threshold = 30. Precision can reach nearly 100%, 1-specificity but like in case of other predictors, it would result in very low sensitivity. The overall best prediction is achieved at prediction threshold = 100, where the sensitivity and precision are both Fig. 8: Statistical behaviour of the weighted common neigh- 54%. bors predictor, shown on the ROC-space. D. Comparision of predictors 1,0 There are three statistical aspects for comparing the predic- tors. We can strive for maximum accuracy, if we want to be 0,8 most correct in our predictions for the future. Alternatively, we may look for the best screening method, which is able to probability Sensitivity 0,6 Accuracy identify most of positive and negative link formations correctly Precision F1 for the future. In other words, we can strive for maximum 0,4 sensitivity and specificity. The third option is the golden mean of the two previous goals: If we need a predictor that is 0,2 both highly sensitive and highly precise, then we look for the maximum achievable F1 ratio. According to these cases, Figs. 0,0 9, 10 and 11 compare the examined predictors to each other. 0,0 0,2 0,4 0,6 0,8 1,0 1-specificity For maximum sensitivity, the weighted common neighbors is clearly the best method. It also means that the actual strength of existing social links are indeed very important in Fig. 7: Statistical behaviour of the preferential attachment the similarity computation between nodes, and this information predictor, shown on the ROC-space. is more precise at every given time step, if information about the past is included. In case of maximum accuracy, we must be careful to 5) Prediction using weighted common neighbors: Finally, correcly interpret Fig. 10. The weighted common neighbors the analysis of our proposed predictor is shown in Fig. 8. The is theoretically capable of achieving nearly 100% accuracy,

Weighted common neighbors Adamic/Adar Adamic/Adar Weighted common neighbors Common neighbors Common neighbors Preferential attachment Preferential attachment Jaccard's coefficient Jaccard's coefficient 0 20 40 60 80 100 0,0 0,1 0,2 0,3 0,4 0,5 0,6 Maximum sensitivity / specificity Maximum F score (overall quality) 1 Fig. 9: The maximum achievable sensitivity and specificity Fig. 11: The maximum achievable overall quality (F1 score) values with the predictors. with the predictors. Weighted common neighbors Adamic/Adar predictor, among the examined methods in this Adamic/Adar paper. By overall quality, Adamic/Adar gives the most precise Common neighbors prediction with the highest sensitivity. Preferential attachment ACKNOWLEDGMENT Jaccard's coefficient The author would like to thank professor Malik Magdon- 0 20 40 60 80 100 Ismail (Computer Science department, Rensselaer Polytechnic Maximum accuracy Institute) for providing the Wikipedia dataset, and valuable lectures on computational analysis of social processes. Fig. 10: The maximum achievable prediction accuracy with R EFERENCES the predictors. [1] A.-L. Barabási, R. Albert, Emergence of scaling in random networks, Science, 286(5439), 509512, 1999. [2] M. Granovetter, The Strength of Weak Ties, American Journal of Sociol- ogy, 78(6), 1360–1380, 1973. but this is because it gives a very large number of correct [3] E. M. Jin, M. Girvan, M. E. J. Newman, The structure of growing social networks, Physical Review Letters E, 64(046132), 2001. negative predictions (inherently, because of the sparseness [4] T. Tylenda, R. Angelova, S. Bedathur, Towards Time-aware Link Predic- of the network), while the rate of correct positive predic- tion in Evolving Social Networks, Proceedings of the 3rd Workshop on tions is diminished. Note, however, that the next best is the Social Network Mining and Analysis, 2009. [5] Z. Huang, D. Lin, The Time-Series Link Prediction Problem with Applica- Adamic/Adar method, which has sufficiently high precision at tions in Communication Surveillance, INFORMS Journal on Computing, its highest accuracy value. 2008. When considering maximum overall performance, the best [6] H. H. Song, T. W. Cho, V. Dave, Y. Zhang, L. Qiu, Scalable Proximity Estimation and Link Prediction in Online Social Networks, Proceedings F1 score is attained by the Adamic/Adar method. However, it of the 9th ACM SIGCOMM conference on Internet measurement confer- is notable, that the weighted common neighbors is also very ence, 2009. close, and in fact it can achieve even higher scores, when the [7] D. Wang, D. Pedreschi, C. Song, F. Giannotti, A.-L. Barabási, Human Mobility, Social Ties, and Link Prediction, In Proceedings of the 17th link threshold is higher, and ∆T is smaller, see Fig. 3. ACM SIGKDD intl. conf. on Knowledge discovery and data mining, 2011. IV. C ONCLUSION [8] A.-L. Barabasi, H. Jeong, Z. Néda, E. Ravasz, A. Schubert, T. Vicsek, Evolution of the social network of scientific collaboration, Physica A, If we consider that all predictors have predicting capabil- 311(3), 590–614, 2002. ity well beyond a random guess, it is clear, that there is [9] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. Am. Soc. Inf. Sci., 58(7), 1019–1031, 2007. a strong social process between the editors of Wikipedia. [10] L. A. Adamic, E. Adar, Friends and neighbors on the web, Social This is, however, a complex process, which involves many Networks, 25(3), 211–230, 2003. human factors. The best predictors shown here can capture [11] K.H. Zou, A.J. O’Malley, L. Mauri, Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models, Circula- most of these factors by using careful assumptions about the tion, 115(5), 654-657, 2007. strength of social links between editors, derived from noting but collaborations on edited pages. When looking for a predictor, one must always be clear about the goal that he wishes to achieve. We have seen that maximum sensitivity needs different parameters than maximum accuracy. Computation of these statistical proper- ties revealed that the best screening method is the weighted common neighbors, and the most accurate predictor is the

You can also read