Word2vec Quick Tutorial using the Default Implementation in C

Word2vec Quick Tutorial using the Default Implementation in C

Last updated:
Word2vec Quick Tutorial using the Default Implementation in C
Source

Word2Vec is a novel way to create vector representations of words in a way that preserves their meaning, i.e. words with similar meaning tend to be located in similar positions when represented in the vector space (as vectors).

The first implementation was developed by some folks working at Google.

Here is a very simple guide to downloading and installing word2vec on a Linux box and start using it for basic operations.

Checkout project using SVN

~$ svn checkout http://word2vec.googlecode.com/svn/trunk/ word2vec

Build the project using make

Enter the directory created by svn and run make:

~/word2vec$ make

Run the included shell file demo-word.sh

This will download some data for you (it includes 100 MB worth of Wikipedia article data, but it's compressed so its size is 30MB) and train the model using it.

This can take some time (about 10 minutes for me) because it's doing a lot of processing and will try to run in parallel if you have a multithreaded computer.

This will create an output file called vectors.bin (the words in vector format).

~/word2vec$ ./demo-word.sh

Once the model has been trained (for example, running demo-words.sh as described in the last step), you'll notice that a new file has been created, namely vectors.bin. You can then start using the project proper.

Calculating distances between word pairs

~/word2vec$ ./distance vectors.bin

Example output (distance between the word "cat" and other words)

Enter word or sentence (EXIT to break): cat

Word: cat  Position in vocabulary: 2601

                                              Word       Cosine distance
------------------------------------------------------------------------
                                              meow      0.621209
                                              cats      0.568651
                                            feline      0.550209
                                           caracal      0.542168
                                               dog      0.538465
                                            kitten      0.535119
                                          purebred      0.529065
                                             felis      0.508320
                                             eared      0.503065
                                            bobcat      0.499513
                                             tapir      0.493953
                                             tabby      0.487100
                                         oncifelis      0.482763
                                          longhair      0.476867
                                              lica      0.462677

(content suppressed)

Finding analogies between words

E.g. "king" is to "man" as "queen" is to ... ?

~/word2vec$ ./word-analogy vectors.bin

Sample output for "king", "man" and "queen":

Enter three words (EXIT to break): king man queen

Word: king  Position in vocabulary: 187

Word: man  Position in vocabulary: 243

Word: queen  Position in vocabulary: 903

                                              Word              Distance
------------------------------------------------------------------------
                                             woman      0.559819
                                              girl      0.465427
                                            loving      0.460078
                                            wonder      0.434169
                                              maid      0.426818
                                         beautiful      0.422880
                                            lovely      0.418539
                                            thighs      0.413182
                                             gimme      0.410801
                                              love      0.405989
                                         gentleman      0.402662
                                            mister      0.402502
                                             loner      0.392148

content suppressed