by
Suman Kundu
Center for Soft Computing Research, Indian Statistical Institute,
203 B. T. Road, Kolkata - 700108
A network of relationships between social entities (actors) like people, movies and articles.
Sociologists prefer to call this representation Sociogram
Network data, sometime represented in two-way matrices, called sociomatrices
In a sociomatrix, rows $\rightarrow$ senders and column $\rightarrow$ receivers
There are several ways to model social network data viz. game theoretic modeling, statistical modeling,
agent based modeling.
However, highly overlapping neighborhoods in real life big data social networks enforce uncertainties in decision making which demands better modeling technique
One of the major research areas in the field is
It is effectively used in viral marketing and finding top stories in news networks
Were restrictive either in terms of performance or execution time, specially for large scale social networks
E.g., the centrality based heuristic methods are superior to greedy methods in terms of execution time,
whereas the reverse is the true in terms of performance
Another important task is to identify virtual groups in social networks, known as
Solution helps in optimizing the Internet infrastructure, boost sell by recommending correct products
Partitioning of the whole network $\rightarrow$ a node belongs to only one community
Overlapping community structure where a node belongs to multiple communities with equal membership
However, a node may also belong to more than one community with different degrees of associations
Addresses the target set selection and community detection in social networks
With the objective $\rightarrow$ improve the performance both in terms of quality and execution time
Concerns with formulating a new knowledge representation scheme to model a social network for better uncertainty handling
Objective $\rightarrow$ demonstrate the effectiveness of granular computing using fuzzy sets for modeling social networks; thereby providing a new avenue of analysis.
Proposed two new centrality measures (diffusion degree and maximum influence degree)
Calculated the upper bound of a node's influence
Pal, S. K., Kundu, S., & Murthy, C. A. (2014). Centrality Measures, Upper Bound, and Influence Maximization in Large Scale Directed Social Networks. Fundamenta Informaticae, 130(3), 317–342.
Kundu, S., Murthy, C. A., & Pal, S. K. (2011). A New Centrality Measure for Influence Maximization in Social Networks. In Proc. of 4th International Conference on Pattern Recognition and Machine Intelligence (PReMI’11) (pp. 242–247). Moscow: LNCS vol. 6744, Springer-Verlag.
improved the quality of the input list using a novel deprecation based greedy strategy
Kundu, S. & Pal, S. K. (September 2015). Deprecation based Greedy Strategy for Target Set Selection in Large Scale Social Networks. Information Sciences. 316, 107-122.
Proposed a novel Fuzzy Granular model to represent Social Network data
Redefined existing social network measures for FGSN
Defined "Entropy of FGSN"
Kundu, S. & Pal, S. K. (September 2015). FGSN: Fuzzy Granular Social Networks - Model and Applications. Information Sciences, 314, 100-117.
Demonstrated the effectiveness of FGSN for Target Set Selection and Community Detection problems
Analyzed the scalability of the model
Pal, S. K. & Kundu, S. Granular social network: model and applications. in Handbook of Big Data Technologies, A. Zomaya and S. Sakr (eds.), Cham: Springer International Publishing, 2017, 617–651.
Proposed a novel community detection algorithm which finds fuzzy-rough communities in networks
Formulated a new index Normalized Fuzzy Mutual Information to compare two sets of community structures
Kundu, S. & Pal, S. K. (December 2015). Fuzzy-Rough Community in Social Networks. Pattern Recognition Letters, 67 Part 2, 145-152.
Relative importance of a vertex within a graph, e.g., how important a person is within a social network
where $\sigma(K)$ returns the expected number of active nodes for a given set of initial active nodes $K$
Centrality based heuristic algorithms is used to select target set $\rightarrow$ Select top $k$ nodes based on their centrality value
Low computational cost but poor performance
Nodes | In-Degree |
---|---|
1, 2, 4, 5, 8, 11, 12, 13, 15 | 0 |
9 | 1 |
3, 6, 14 | 2 |
7 | 3 |
10 | 4 |
Highest degree node is 10
we included the contribution of neighbors as well as the parameter of the diffusion
where $\lambda_{u, v}$ is the propagation probability of link $e(u, v)$ and $\Gamma(v)$ is the set of followers of node $v$
Mathematically, the influence of a node $v$ at distance $n$ is any subset of,
$$\xi_v^n = \{u\in V|a^n_{u, v}>0, a^n_{u, v}\in A^n\}.$$
where $A$ is the adjucency matrix
So, the upper bound of influence for node $v$ at distance $n$ is $\alpha_n^\dagger(v) = |\xi_v^n|$
$$C_{MID}(v) = \sum_{n=1}^D \alpha_n^\dagger(v)$$
where $D$ is the diameter of the network.
With uniform propagation probability setup
Selection does not depend upon nodes already selected.
Hence, it may happen that the actual influence of including a node is much less than that of the others yet to consider.
the algorithm works when the influence function is
Good news is most of the influence functions for information diffusion are monotonic and sub-modular.
A granule is a clump of objects (points), in the universe of discourse, drawn together, e.g., by indistinguishability, similarity, proximity, or functionality
A social network is viewed as a collection of relations between social actors and their interactions
These actors form closely operative groups which are often indistinguishable in the process of problem solving
This resembles the concept of granules
The basic concepts of "conceptual similarities" 'between nodes', 'cluster of nodes', 'relation between nodes and their interactions' etc. do not lend themselves to precise definition
i.e., they have ill-define boundaries
So, it is appropriate and natural that a social network is represented in Fuzzy Granulation framework
FGSN is represented by a triple $\mathcal{S} = (\mathcal{C}, \mathcal{V}, \mathcal{G})$ where,
A granule $g\in \mathcal{G}$ around a node ($c\in \mathcal{C}$) is defined by fuzzy membership values
Properties of the granule shall describe node $c$'s network properties
In this work we use $$\begin{aligned} \mu_c(v) & = & 0 & \text{ for } d(c, v) > r \\ & & \dfrac{1}{1 + d(c, v)} & \text{ otherwise} \end{aligned}$$
Further, we normalized membership values
$$\tilde{\mu_c}(v) = \dfrac{\mu_c(v)}{\sum_{i\in \mathcal{C}} \mu_i(v)} \text{ such that } \sum_{i\in \mathcal{C}} \tilde{\mu_i}(v) = 1$$The degree of a node in FGSN representation, we call it Granular Degree, is defined by the cardinality of the granule centered at the node
$$\mathcal{D}(c) = |A_c| = \sum_{v\in\mathcal{V}} \tilde{\mu_c}(v)$$For a FGSN, the embeddedness of any pair defines how much a granule is embedded inside the other
$$\mathcal{E}(a, b) = |A_a\cap A_b| = \sum_{v\in\mathcal{V}} \min(\tilde{\mu_a}(v), \tilde{\mu_b}(v))$$In the new knowledge representation of FGSN we would like to find out densely connected groups, i.e., communities
Identify the granules with dense neighbourhood (whose granular degree exceeds a threshold) and
Merge them if they are nearby (merging dense regions)
For a given $\theta$, a granule $A_p$ is said to be a $\theta$-core, if the Granular Degree of $A_p$ is greater or equals to $\theta$, i.e., $\mathcal{D}(A_p) \geq \theta$
The process needs to search for $\theta$-cores belonging to the same community
we call them 'Community Reachable Cores'
Neighborhood of a granule $A_c$ is the set of all granules whose centers lie in support set of $A_c$
Granules $A_p$ and $A_q$ are directly community reachable $\theta$-cores, if $A_p$ and $A_q$ are $\theta$-cores and $A_p$ is in the neighborhood of $A_q$
For a given social network $\mathcal{S = (C, V, G)},
\theta,$ and $\epsilon$,
a community $\mathbb{C}$ is a non empty subset of granules $\mathcal{G}$ satisfying the following conditions:
here $\tilde{\mathcal{E}} $ is the normalized granular embeddedness defined as $\tilde{\mathcal{E}}(A_p, A_q) = \dfrac{|A_p\cap A_q|}{|A_p\cup A_q|}$
A node $p$ is said to be an orphan if it is not a member of any community.
$\theta$ controls the density of the detected communities
As $\theta$ increases, possibility of detecting more dense community increases
$\theta$ may be referred as the density co-efficient of community detection
$\epsilon$ is the coupling co-efficient of the community
higher value of coupling $\rightarrow$ loosely connected groups get merged into single community
lower coupling $\rightarrow$ may result in more number of communities than actual
Communities, thus identified, have fuzzy boundaries.
This communities can be viewed in terms of lower and upper approximations of rough set theory.
Nodes belong to the lower approximate region of a community $\rightarrow$ nodes definitely belong to.
And nodes in boundary (upper - lower) region $\rightarrow$ nodes possibly belong to.
It may be appropriate to assign fuzzy membership (0,1) to those in the boundary regions and unity (1) value to those in the lower approximate regions
where $H(\mathbb{C}^X)$ is the information contain in fuzzy partition matrix $\mathbb{C}^X$ and $H(\mathbb{C}^X|\mathbb{C}^Y)$ is the conditional information measure which denotes the information contain in $\mathbb{C}^X$ for given fuzzy partition matrix $\mathbb{C}^Y$.
Higher the value of NFMI larger the similarity (or relevance) between $\mathbb{C}^X$ and $\mathbb{C}^Y$
In an ideal case, i.e., when two community structures are identical NFMI will be $1$.
When two community structures are complement to each other NFMI would be $0$
NFMI can be used to measure the goodness of a community structure $\mathbb{C}^X$, given the actual one $\mathbb{C}^Y$
Network size = 1001, Min community size = 150, Max community size = 250. $\eta$: Avg ratio of external degree with total degree, $O_n$: Fraction of overlapping nodes
($O_n=0.15$)
($\eta=0.4$)
The FGSN model that we have provided is first of its kind
Dillen, N. B., & Chakraborty, A. (2016). Modularity-Based Community Detection in Fuzzy Granular Social Networks. In Proc. of the International Congress on Information and Communication Technology: ICICT 2015, Volume 1 (pp. 577–585). Singapore: Springer Singapore.
Kumar, P., Gupta, S., & Bhasker, B. (2017). An upper approximation based community detection algorithm for complex networks. Decision Support Systems, 96(April), 103–118.
As $m>k$, the worst case solution needs $O((m+k)^2)$ time, in terms of number of evolution
Used to see the general statistics of the FGSN
The granular degree shows similar pattern with degree distribution
whereas, granular betweenness shows differences in their distributions
Network size = 1001, Min community size = 150, Max community size = 250. $\eta$: Avg ratio of external degree with total degree, $O_n$: Fraction of overlapping nodes
($O_n=0.15$)
($\eta=0.4$)
Time taken by CELF as compare with slowest DGS for different data sets
Data Set | CELF | DGS |
31.89 min | 18.02 min | |
EUEmail | 7.9 hours | 7.18 min |
Slashdot | 2 hours | 10.52 min |
OCLinks | 6.58 min | 37.98 sec |
USAirport | 12.39 min | 1.14 min |
CELF need $O(k*|V|)$ time
whereas, DGS need $O((m+k)^2)$ time; $k < m < v$
DGS converges quickly leading to very low computational time as compare to CELF
Calculating Shapley values is a tedious process.
As per the authors "the direct, naive approach for computing the Shapley values of the players is not a tractable one."
Shapley values are approximated.
average marginal contributions of a node in $t$ number of randomly selected permutations of nodes
Each permutation defines a unique order of nodes
the choice of $t$
if $t > k$ then it takes more time than Greedy approach
and for smaller values of $t$ in large scale networks its correctness is not proved.
* Choosing some random permutation for $|V|$ nodes is an error prone process, specially for large value of $|V|$
Also, we proved the correctness of DGS theoretically
The work of Taurob and Tucker (2015) deals with sentiment analysis using social network data and involves natural language processing.
It uses tweeter messages as document and existing information retrieval algorithms are used to rank the product features.
The problem addressed is not related to the scope of the thesis
Journals
Book Chapter
Conference