Granular Model for Social Networks, Target Set Selection and Fuzzy-Rough Community Detection

by
Suman Kundu

Center for Soft Computing Research, Indian Statistical Institute,
203 B. T. Road, Kolkata - 700108

Outline

  • Background and Objective of the Thesis
  • Chapter Highlights and Publications
  • Centrality Measures, Upper Bound & Target Set Selection
  • Deprecation Based Greedy Strategy
  • Fuzzy Granular Social Network - Model and Applications
  • Fuzzy Rough Community Detection
  • Discussions on Examiner's Queries/Doubts
  • Publications

Background and Objective of the Thesis

Social Network

A network of relationships between social entities (actors) like people, movies and articles.

  • Represented (usually) with graphs
    • actors $\rightarrow$ nodes
    • relationships $\rightarrow$ edges
  • Relations $\rightarrow$ directed or undirected

Sociologists prefer to call this representation Sociogram

Sociomatrices

Network data, sometime represented in two-way matrices, called sociomatrices

In a sociomatrix, rows $\rightarrow$ senders and column $\rightarrow$ receivers

Several Other Ways

There are several ways to model social network data viz. game theoretic modeling, statistical modeling,
agent based modeling.

However, highly overlapping neighborhoods in real life big data social networks enforce uncertainties in decision making which demands better modeling technique

One of the major research areas in the field is

Target Set Selection (TSS)

It is effectively used in viral marketing and finding top stories in news networks

Earlier attempts to solve TSS

Were restrictive either in terms of performance or execution time, specially for large scale social networks

E.g., the centrality based heuristic methods are superior to greedy methods in terms of execution time,

whereas the reverse is the true in terms of performance

Another important task is to identify virtual groups in social networks, known as

Community Detection

Solution helps in optimizing the Internet infrastructure, boost sell by recommending correct products

For community structure in network, two views exist

Partitioning of the whole network $\rightarrow$ a node belongs to only one community

Overlapping community structure where a node belongs to multiple communities with equal membership

However, a node may also belong to more than one community with different degrees of associations

The Current Thesis

Addresses the target set selection and community detection in social networks

With the objective $\rightarrow$ improve the performance both in terms of quality and execution time

The Current Thesis

Concerns with formulating a new knowledge representation scheme to model a social network for better uncertainty handling

Objective $\rightarrow$ demonstrate the effectiveness of granular computing using fuzzy sets for modeling social networks; thereby providing a new avenue of analysis.

Chapters, Highlights and Publications

Chapter 1: Centrality Measures, Upper Bound, and Target Set Selection

Proposed two new centrality measures (diffusion degree and maximum influence degree)

Calculated the upper bound of a node's influence

Pal, S. K., Kundu, S., & Murthy, C. A. (2014). Centrality Measures, Upper Bound, and Influence Maximization in Large Scale Directed Social Networks. Fundamenta Informaticae, 130(3), 317–342.

Kundu, S., Murthy, C. A., & Pal, S. K. (2011). A New Centrality Measure for Influence Maximization in Social Networks. In Proc. of 4th International Conference on Pattern Recognition and Machine Intelligence (PReMI’11) (pp. 242–247). Moscow: LNCS vol. 6744, Springer-Verlag.

Chapter 2: Deprecation Based Greedy Strategy for Target Set Selection

improved the quality of the input list using a novel deprecation based greedy strategy

Kundu, S. & Pal, S. K. (September 2015). Deprecation based Greedy Strategy for Target Set Selection in Large Scale Social Networks. Information Sciences. 316, 107-122.

Chapter 3: Fuzzy Granular Social Networks (FGSN)

Proposed a novel Fuzzy Granular model to represent Social Network data

Redefined existing social network measures for FGSN

Defined "Entropy of FGSN"

Kundu, S. & Pal, S. K. (September 2015). FGSN: Fuzzy Granular Social Networks - Model and Applications. Information Sciences, 314, 100-117.

Chapter 4: Applications of FGSN

Demonstrated the effectiveness of FGSN for Target Set Selection and Community Detection problems

Analyzed the scalability of the model

Pal, S. K. & Kundu, S. Granular social network: model and applications. in Handbook of Big Data Technologies, A. Zomaya and S. Sakr (eds.), Cham: Springer International Publishing, 2017, 617–651.

Chapter 5: Fuzzy-Rough Communities

Proposed a novel community detection algorithm which finds fuzzy-rough communities in networks

Formulated a new index Normalized Fuzzy Mutual Information to compare two sets of community structures

Kundu, S. & Pal, S. K. (December 2015). Fuzzy-Rough Community in Social Networks. Pattern Recognition Letters, 67 Part 2, 145-152.

Centrality Measures, Upper Bound and Target Set Selection

What is a centrality measure?

Relative importance of a vertex within a graph, e.g., how important a person is within a social network

TSS: A Maximization Problem

$$ \begin{aligned} & & \;{\begin{matrix}\\ \normalsize \text{max} \\ ^{K} \end{matrix}}\; & & \sigma(K) \\ & & \text{subject to} & & |K| = k, k > 0. \end{aligned} $$

where $\sigma(K)$ returns the expected number of active nodes for a given set of initial active nodes $K$

Traditionally,

Centrality based heuristic algorithms is used to select target set $\rightarrow$ Select top $k$ nodes based on their centrality value

Low computational cost but poor performance

Example

Nodes In-Degree
1, 2, 4, 5, 8, 11, 12, 13, 15 0
9 1
3, 6, 14 2
7 3
10 4

Highest degree node is 10

But the difusion process

In the proposed Diffusion Degree

we included the contribution of neighbors as well as the parameter of the diffusion

Diffusion Degree of $v$ is defined as

$$C_{DD}(v) = \sum_{u\in{\Gamma(v)}}{\{\lambda_{u, v} + (\lambda_{u, v} \times \sum_{i\in{\Gamma(u)}}{\lambda_{i, u}})\}}$$

where $\lambda_{u, v}$ is the propagation probability of link $e(u, v)$ and $\Gamma(v)$ is the set of followers of node $v$

Upper Bound of Influence

Mathematically, the influence of a node $v$ at distance $n$ is any subset of,

$$\xi_v^n = \{u\in V|a^n_{u, v}>0, a^n_{u, v}\in A^n\}.$$

where $A$ is the adjucency matrix

So, the upper bound of influence for node $v$ at distance $n$ is $\alpha_n^\dagger(v) = |\xi_v^n|$

Maximum Influence Degree (MID) of node $v$ is defined by

$$C_{MID}(v) = \sum_{n=1}^D \alpha_n^\dagger(v)$$

where $D$ is the diameter of the network.

Results for Twitter

With uniform propagation probability setup

With non-uniform propagation probability setup

Execution Time

Deprecation based Greedy Strategy for Target Set Selection

Issues with Centrality based
Heuristic Methods

Selection does not depend upon nodes already selected.

Hence, it may happen that the actual influence of including a node is much less than that of the others yet to consider.

The Algorithm

We proved that

the algorithm works when the influence function is

  • Monotonic: $$\sigma(S \cup \{v\}) \geq \sigma(S) \quad \forall v\in V \text{ and } \forall S\in 2^V$$
  • Sub-modular: $$\sigma(S \cup \{v\}) - \sigma(S) \geq \sigma(T \cup \{v\}) - \sigma(T)$$ $$\forall S \subseteq T \subseteq V \text{ and }\forall v\in V; S, T \in 2^V$$

Good news is most of the influence functions for information diffusion are monotonic and sub-modular.

Results

Results

DGS on DiDH: 18.025 min | CELF: 31.89 min

Comparative Results with Other Algo

Fuzzy Granular Social Network

What is Granule?

A granule is a clump of objects (points), in the universe of discourse, drawn together, e.g., by indistinguishability, similarity, proximity, or functionality

Why Granules?

A social network is viewed as a collection of relations between social actors and their interactions

These actors form closely operative groups which are often indistinguishable in the process of problem solving

This resembles the concept of granules

Why Fuzzy Granules?

The basic concepts of "conceptual similarities" 'between nodes', 'cluster of nodes', 'relation between nodes and their interactions' etc. do not lend themselves to precise definition

i.e., they have ill-define boundaries

So, it is appropriate and natural that a social network is represented in Fuzzy Granulation framework

Example: Zachary Karate Club Network

FGSN: Representation

FGSN is represented by a triple $\mathcal{S} = (\mathcal{C}, \mathcal{V}, \mathcal{G})$ where,

  • $\mathcal{V}$ is a finite set of nodes of the network
  • $\mathcal{C}\subseteq \mathcal{V}$ is a finite set of granule centers
  • $\mathcal{G}$ is the finite set of all Granules

Membership Function

A granule $g\in \mathcal{G}$ around a node ($c\in \mathcal{C}$) is defined by fuzzy membership values

Properties of the granule shall describe node $c$'s network properties

In this work we use $$\begin{aligned} \mu_c(v) & = & 0 & \text{ for } d(c, v) > r \\ & & \dfrac{1}{1 + d(c, v)} & \text{ otherwise} \end{aligned}$$

  • where $d(c, v)$ is the distance from $c$ to $v\in \mathcal{V}$
  • $r$ is the radius of the granule

Membership Function

Further, we normalized membership values

$$\tilde{\mu_c}(v) = \dfrac{\mu_c(v)}{\sum_{i\in \mathcal{C}} \mu_i(v)} \text{ such that } \sum_{i\in \mathcal{C}} \tilde{\mu_i}(v) = 1$$

Granular Degree of a Node

The degree of a node in FGSN representation, we call it Granular Degree, is defined by the cardinality of the granule centered at the node

$$\mathcal{D}(c) = |A_c| = \sum_{v\in\mathcal{V}} \tilde{\mu_c}(v)$$

Granular Embeddedness of a Pair

For a FGSN, the embeddedness of any pair defines how much a granule is embedded inside the other

$$\mathcal{E}(a, b) = |A_a\cap A_b| = \sum_{v\in\mathcal{V}} \min(\tilde{\mu_a}(v), \tilde{\mu_b}(v))$$

Application: Target Set Selection

Effect of $r$

Fuzzy-Rough Community Detection

Problem Statment

In the new knowledge representation of FGSN we would like to find out densely connected groups, i.e., communities

Key Idea of Finding Communities

Identify the granules with dense neighbourhood (whose granular degree exceeds a threshold) and

Merge them if they are nearby (merging dense regions)

$\theta$-cores and Community Detection

For a given $\theta$, a granule $A_p$ is said to be a $\theta$-core, if the Granular Degree of $A_p$ is greater or equals to $\theta$, i.e., $\mathcal{D}(A_p) \geq \theta$

A community may have multiple
$\theta$-cores

The process needs to search for $\theta$-cores belonging to the same community

we call them 'Community Reachable Cores'

Three ways in determining $\theta$-cores belonging to same community

  1. Directly community reachable cores
  2. Indirectly community reachable cores
  3. $r$-connected community reachable cores

Neighborhood of a Granule

Neighborhood of a granule $A_c$ is the set of all granules whose centers lie in support set of $A_c$

Directly Community Rechable
$\theta$-cores

Granules $A_p$ and $A_q$ are directly community reachable $\theta$-cores, if $A_p$ and $A_q$ are $\theta$-cores and $A_p$ is in the neighborhood of $A_q$

Indirectly Community Reachable
$\theta$-Cores

$r$-connected community reachable
$\theta$-Cores

Community

For a given social network $\mathcal{S = (C, V, G)}, \theta,$ and $\epsilon$,
a community $\mathbb{C}$ is a non empty subset of granules $\mathcal{G}$ satisfying the following conditions:

  • $\forall A_p, A_q \in{\mathbb{C}}$, $A_p$ and $A_q$ are community reachable cores
  • $\forall A_p\in\mathbb{C}, \tilde{\mathcal{E}}(A_p, \bigcup_{A_q\in\mathbb{C}\setminus A_p} A_q) > 1/\epsilon$

here $\tilde{\mathcal{E}} $ is the normalized granular embeddedness defined as $\tilde{\mathcal{E}}(A_p, A_q) = \dfrac{|A_p\cap A_q|}{|A_p\cup A_q|}$

Orphans

A node $p$ is said to be an orphan if it is not a member of any community.

Density Co-efficient

$\theta$ controls the density of the detected communities

As $\theta$ increases, possibility of detecting more dense community increases

$\theta$ may be referred as the density co-efficient of community detection

Coupling Co-efficient

$\epsilon$ is the coupling co-efficient of the community

higher value of coupling $\rightarrow$ loosely connected groups get merged into single community

lower coupling $\rightarrow$ may result in more number of communities than actual

Fuzzy-Rough Community

Communities, thus identified, have fuzzy boundaries.

This communities can be viewed in terms of lower and upper approximations of rough set theory.

Nodes belong to the lower approximate region of a community $\rightarrow$ nodes definitely belong to.

And nodes in boundary (upper - lower) region $\rightarrow$ nodes possibly belong to.

It may be appropriate to assign fuzzy membership (0,1) to those in the boundary regions and unity (1) value to those in the lower approximate regions

Fuzzy-Rough Community: illustration

Normalized Fuzzy Mutual Information (NFMI)

$$NFMI(\mathbb{C}^X:\mathbb{C}^Y) = \frac{1}{2}[\dfrac{H(\mathbb{C}^X) - H(\mathbb{C}^X|\mathbb{C}^Y)}{H(\mathbb{C}^X)}$$ $$+ \dfrac{H(\mathbb{C}^Y) - H(\mathbb{C}^Y|\mathbb{C}^X)}{H(\mathbb{C}^Y)}]$$

where $H(\mathbb{C}^X)$ is the information contain in fuzzy partition matrix $\mathbb{C}^X$ and $H(\mathbb{C}^X|\mathbb{C}^Y)$ is the conditional information measure which denotes the information contain in $\mathbb{C}^X$ for given fuzzy partition matrix $\mathbb{C}^Y$.

Normalized Fuzzy Mutual Information

Higher the value of NFMI larger the similarity (or relevance) between $\mathbb{C}^X$ and $\mathbb{C}^Y$

In an ideal case, i.e., when two community structures are identical NFMI will be $1$.

When two community structures are complement to each other NFMI would be $0$

NFMI can be used to measure the goodness of a community structure $\mathbb{C}^X$, given the actual one $\mathbb{C}^Y$

Result: LFR Benchmark Data

Network size = 1001, Min community size = 150, Max community size = 250. $\eta$: Avg ratio of external degree with total degree, $O_n$: Fraction of overlapping nodes

($O_n=0.15$)

NFMI for different values of $O_n$

($\eta=0.4$)

Examiner's Queries

Q-1: What are the recent advances in the field of granular model for social network?

The FGSN model that we have provided is first of its kind

We are currently working on

  • Finding most diverse groups in the network
  • Rough Set based ICM simulator
    • will improve the execution time of Monte Carlo
  • Social tension analysis
    • will lead to link prediction

Others researchers working in

Dillen, N. B., & Chakraborty, A. (2016). Modularity-Based Community Detection in Fuzzy Granular Social Networks. In Proc. of the International Congress on Information and Communication Technology: ICICT 2015, Volume 1 (pp. 577–585). Singapore: Springer Singapore.

Kumar, P., Gupta, S., & Bhasker, B. (2017). An upper approximation based community detection algorithm for complex networks. Decision Support Systems, 96(April), 103–118.

Q-2: In Chapter 3, at page number 72, how you arrive at worst case time complexity of DGS algorithm

Worst case time complexity of DGS occurs when

  • In each iteration, only 1 node is marked deprecated

So, the process needs following no. of evolutions

  • $m$
  • $m-1$
  • $m-2$
  • ...
  • $m-(m-k+1)$
  • $m-(m-k)$

i.e.,

  • $m(m-k+1) - \frac{(m-k+1)(m-k)}{2}$
  • $(m-k+1)[m - \frac{m-k}{2}]$
  • $\frac{1}{2}\times (m-k+1)(2m-m+k)$
  • $\frac{1}{2}\times (m+k-2k+1)(m+k)$
  • $\frac{1}{2}\times [(m+k)^2 - (2k-1)(m+k)]$

As $m>k$, the worst case solution needs $O((m+k)^2)$ time, in terms of number of evolution

Q-3: Block diagram of DGS algorithm given in Figure 3.1, please take a simple example and explain the Marking.

Q-4: As given in Figure 4.3, kindly give the physical relevance of two curves.

Used to see the general statistics of the FGSN

The granular degree shows similar pattern with degree distribution

whereas, granular betweenness shows differences in their distributions

Q-5: In Section 6.4.2, where different graph set-theoretic algorithms are compared and in subsequent results, their performance is shown with respect to Benchmark problem. Thus in the different Figures like 6.4 and 6.5, what are the main contribution of the author here.

  • FRC-FGSN is the proposed granular computing based algorithm
  • others are graph theoretic algorithm

Network size = 1001, Min community size = 150, Max community size = 250. $\eta$: Avg ratio of external degree with total degree, $O_n$: Fraction of overlapping nodes

($O_n=0.15$)

NFMI for different values of $O_n$

($\eta=0.4$)

Q-6: The author is requested to provide three popular examples of real life where Fuzzy Granular Social Networks playing a crucial role.

  1. Viral marketing: through large scale online social networks
    • each granule hold the influence domain
  2. Job search through social networks: Whom to approach
    • No additional information in lower bound
  3. Recommender systems
    • Lower bounds are closely related can be recommended

Q-7: In Figure 3.8, why the execution time of CELF Algorithm is very high.

Time taken by CELF as compare with slowest DGS for different data sets

Data Set CELF DGS
Twitter 31.89 min 18.02 min
EUEmail 7.9 hours 7.18 min
Slashdot 2 hours 10.52 min
OCLinks 6.58 min 37.98 sec
USAirport 12.39 min 1.14 min

Theoretically,

CELF need $O(k*|V|)$ time

whereas, DGS need $O((m+k)^2)$ time; $k < m < v$

Practically,

DGS converges quickly leading to very low computational time as compare to CELF

Q-8: While discovering the influence node in Social Networks, Shapley value-based approach seems to be quite effective (please refer to reference 78, Narayanam and Narahari, 2011). Compare to this approach which method you are adopting and why, for identifying influential nodes.

Shapley value-based approach [78]

Calculating Shapley values is a tedious process.

As per the authors "the direct, naive approach for computing the Shapley values of the players is not a tractable one."

Shapley values are approximated.

Shapley values are the

average marginal contributions of a node in $t$ number of randomly selected permutations of nodes

Each permutation defines a unique order of nodes

The algorithm depends upon

the choice of $t$

if $t > k$ then it takes more time than Greedy approach

and for smaller values of $t$ in large scale networks its correctness is not proved.

* Choosing some random permutation for $|V|$ nodes is an error prone process, specially for large value of $|V|$

In our approach (DGS)

  • we take heuristic based input list instead of random selection as in [78]
  • Both algorithms calculate marginal gain similarly, however, in our algorithm we prefer to deprecate nodes instead of ranking all the nodes first, for selecting $k$
  • This approach of DGS gives two benefits, (i) converges quickly, (ii) needs less computation time with iteration.

Also, we proved the correctness of DGS theoretically

Q-9: Tuarob and Tucker suggested a method for extracting product features using large scale social media data. Please comment on what method you have adopted and in what way your method is superior to Taurob and Tucker (2015) method.

The work of Taurob and Tucker (2015) deals with sentiment analysis using social network data and involves natural language processing.

It uses tweeter messages as document and existing information retrieval algorithms are used to rank the product features.

The problem addressed is not related to the scope of the thesis

Related Publications

Journals

  • Pal, S. K., Kundu, S., & Murthy, C. A. (2014). Centrality Measures, Upper Bound, and Influence Maximization in Large Scale Directed Social Networks. Fundamenta Informaticae, 130(3), 317–342.
  • Kundu, S. & Pal, S. K. (September 2015). Deprecation based Greedy Strategy for Target Set Selection in Large Scale Social Networks. Information Sciences. 316, 107-122.
  • Kundu, S. & Pal, S. K. (September 2015). FGSN: Fuzzy Granular Social Networks - Model and Applications. Information Sciences, 314, 100-117.
  • Kundu, S. & Pal, S. K. (December 2015). Fuzzy-Rough Community in Social Networks. Pattern Recognition Letters, 67 Part 2, 145-152.

Book Chapter

  • Pal, S. K. & Kundu, S. Granular social network: model and applications, in Handbook of Big Data Technologies, A. Zomaya and S. Sakr (eds.), Springer, Heidelberg (to appear).

Conference

  • Kundu, S., Murthy, C. A., & Pal, S. K. (2011). A New Centrality Measure for Influence Maximization in Social Networks. In Proc. of 4th International Conference on Pattern Recognition and Machine Intelligence (PReMI’11) (pp. 242–247). Moscow: LNCS vol. 6744, Springer-Verlag.

Thank You