Skip to main content

Mapping Nodes to Vectors: An Intro to Node Embedding

In an earlier post, I had stated that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items. Given that graphs are everywhere, it is not surprising to see the ideas of word and sentence embeddings being extended to graphs in the form of node embeddings. 

What are Node Embedding?

Node embeddings are encodings of the properties and relationships of nodes in a low-dimensional vector space. This enables nodes with similar properties or connectivity patterns to have similar vector representations. Using node embeddings can improve performance on various graph analytics tasks such as node classification, link prediction, and clustering.  

Methods for Node Embeddings

There are several common techniques for learning node embeddings:

- DeepWalk and Node2Vec use random walks on the graph to generate node sequences, which are then fed into a word2vec skip-gram model to get embeddings. The random walk strategies allow capturing both local and broader network structure.

- Graph convolutional networks like GCN, GAT learn embeddings by propagating and transforming node features across graph edges. The neural network architecture captures both feature information and graph structure.

- Matrix factorization methods like GraRep and HOPE factorize some matrix representation of the graph to produce embeddings. For example, factorizing the adjacency matrix or a higher-order matrix encoding paths.

We are going to look at Node2Vec embeddings.

Node2Vec Embedding Steps

Before going over the Node2Vec embedding steps, let's first try to understand random walks on graphs. The random walks are used to convert the relationships present in the graph to a representation similar to sentences. 

A random walk is a series of steps beginning from some node in the graph and moving to its neighboring nodes selected in a random manner. A random walk of length k means k random steps from an initially selected node. An example of a random walk of length 3 is shown below in Figure 1 where edges in green show an instance of the random walk from node marked X. Each random walk results in a sequence of nodes, analogous to sentences in a corpus.

Fig.1. An example of a random walk of length 3.

Using random walks, an overall pipeline of steps for generating node embeddings is shown below in Figure 2. The random walks method first generates a set of node sequence of specified length for every node. Next, the node sequences are passed on to a language embedding model that generates a low-dimensional, continuous vector representation using SkipGram model. These vectors can be then used to for any down-stream task using node similarities. It is also possible to map the embeddings to two-dimensional space for visualization using popular dimensionality reduction techniques.


Fig.2. Pipeline of steps for generating node embeddings

The DeepWalk node embedding method used random walks to gather node context information. Node2VEC instead uses biased random walks that provide better node context information. Both breadth-first and depth-first strategies are used in biased random walk to explore the neighborhood of a node. The biased random walk is carried out using two parameters, p and q. The parameter p models transition probabilities to return back to the previous node while the parameter  defines the “ratio” of BFS and DFS. Figure 3 below illustrates biased random walk. Starting with node v, it shows a walk of length 4. Blue arrows indicate the breadth-first search (BFS) directions and red arrows show the depth-first search (DFS) directions. The transition probabilities are governed by the search parameters as shown.

Fig.3.

Example of Generating Embeddings

I will use Zachary's Karate Club dataset to illustrate the generation of node embeddings. It is a small dataset that has been widely used in studies of graph based methods. The embeddings will be generated using the publicly available implementation of   node2vec. 

       
# Import necessary libraries
import networkx as nx
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
from node2vec import Node2Vec
# Load Karate Club data
G = nx.karate_club_graph()
cols = ["blue" if G.nodes[n]["club"]=='Officer' else "red" for n in G.nodes()]
nx.draw_networkx(G, node_color=cols)
       
df_edges=nx.to_pandas_edgelist(G)  #creating an edge list data frame
node2vec = Node2Vec(G, dimensions=16, walk_length=30, num_walks=100, workers=4,p=1.0,q=0.5)
model.wv.save_word2vec_format('./karate.emb')#Save embeddings
#Create a dataframe of embeddings for visualization
df= pd.read_csv('karate.emb',sep=' ',skiprows=[0],header=None) # creating a dataframe from the embedded graph 
df.set_index(0,inplace=True)
df.index.name='node_id'
df.head()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
node_id
33 0.092881 -0.116476 0.109677 0.216758 -0.216537 -0.273324 0.897161 0.099800 -0.199117 -0.012935 0.095815 -0.022268 -0.246529 0.166690 -0.315830 -0.080442
0 -0.103973 0.217445 -0.321607 -0.188096 0.191852 0.158175 0.668216 0.027499 0.438449 -0.021251 0.072351 0.126618 -0.257281 -0.223262 -0.590806 -0.194275
32 0.072306 -0.285375 0.167020 0.190507 -0.130808 -0.225752 0.810336 0.078936 -0.430512 0.188738 0.072464 -0.023813 -0.209342 -0.106820 -0.501649 0.008207
2 0.189961 0.159125 -0.195857 0.293849 -0.102961 -0.130784 0.730172 -0.042656 0.085806 -0.365449 -0.013965 0.187779 -0.158719 0.019433 -0.280343 -0.301981
1 0.099712 0.355814 -0.091135 0.015275 0.107660 0.123524 0.606924 0.004724 0.201826 -0.244784 0.389210 0.387045 -0.031511 -0.156609 -0.425399 -0.469062
Now, let's project embeddings to visualize.
       
df.columns = df.columns.astype(str)
transform = TSNE
trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(df)
plt.figure(figsize=(7, 7))
plt.axes().set(aspect="equal")
plt.scatter(
    node_embeddings_2d[:, 0],
    node_embeddings_2d[:, 1], c = labels )
plt.title("{} visualization of node embeddings".format(transform.__name__))
plt.show()

 

The visualization shows three clusters of node embeddings. This implies possibly three communities in the data. This is different from the known structure of the Karate club data having two communities with one member with an ambiguous assignment. However, we need to keep in mind that no node properties were used while creating embeddings. Another point to note is that node2vec algorithm does have several parameters and their settings do impact the results.

The following paper provides an excellent review of different node embedding methods mentioned in this blog. You may want to check it out.

Comments

Popular posts from this blog

LLaMA 2 and its Symbolic Regression Explanation

On July 17, a new family of AI models, LLaMA 2 was announced by Meta. LLaMA 2 is trained on a mix of publicly available data. According to Meta LLaMA 2 performs significantly better than the previous generation of LLaMA models. Two flavors of the model: LLaMA 2 and LLaMA 2-Chat, a model fine tuned for two-way conversations, were released. Each flavor further has three versions with the parameters ranging from 7 billions to 70 billions. Meta is also freely releasing the code and data behind the model for  researchers to build upon and improve the technology. There are several ways to access LLaMA 2 for development work; you can download it from HuggingFace or access it via Microsoft Azure or Amazon SageMaker . For those interested in interacting with the LLaMA 2-Chat version, you can do so by visiting llama2.ai , a chatbot model demo hosted by the venture capitalist Andreessen Horowitz. This is the route I took to interact with LLaMA 2-Chat. Since I was reading an excellent paper on

Reinforcement Learning with Human Feedback: A Powerful Approach to AI Training

The unprecedented capabilities exhibited by the large language models (LLMs) such as ChatGPT and GPT-4 have created enormous excitement as well as concerns about the impact of AI on the society in near and far future. Behind the success of LLMs and AI in general lies among other techniques a learning approach called Reinforcement Learning with Human Feedback (RLHF). In this blog post, we will try to understand what RLHF is and why it offers a powerful approach to training AI models. However, before we do that, let's try to understand the concept of reinforcement learning (RL). What is Reinforcement Learning (RL)? RL, inspired by the principles of behavioral psychology, is a machine learning technique wherein the learner, called an agent , learns decision making by exploring an environment through a trial-and-error process to achieve its goal. Each action by the agent results in feedback in the form of a reward or punishment . While performing actions and receiving feedback, the a

Low Rank Adaptation (LoRA): Enhancing Fine-Tuning of LLMs

Pre-trained large language models (LLMs) are being used for numerous natural language processing applications. These models perform well out of the box and are fine-tuned for any desired down-stream application. However, fine-tuning these models to adapt to specific tasks often poses challenges due to their large parameter sizes. To address this, a technique called Low Rank Adaptation (LoRA) has emerged, enabling efficient fine-tuning of LLMs. In this post, we will try to understand LoRA, and delve into its importance and application in fine-tuning LLMs. We will begin our journey by first looking at the concept of rank of a matrix, followed by a look at matrix factorization, and then to LoRA. Rank of a Matrix The rank of a matrix indicates the number of independent rows or column in the matrix. As an example, consider the following 4x4 matrix A: A = [[2, 4, 6, 8], [1, 3, 5, 7], [4, 8, 12, 16], [3, 9, 15, 21]] Looking at the first and third row of this matrix, we see that the third row