Saving Fasttext Model in vec Format
To obtain a VEC file containing all word vectors, you can use the following Python script, inspired by the official bin_to_vec example:
from fasttext import load_model
# Load the original BIN model
f = load_model(YOUR-BIN-MODEL-PATH)
lines=[]
# Get all words from the model
words = f.get_words()
# Open a file to write the VEC file
with open(YOUR-VEC-FILE-PATH,'w') as file_out:
# Write the number of total words and vector dimension to the first line
file_out.write(str(len(words)) + " " + str(f.get_dimension()) + "\n")
# Write each word and its vector to the file
for w in words:
v = f.get_word_vector(w)
vstr = ""
for vi in v:
vstr += " " + str(vi)
try:
file_out.write(w + vstr + '\n')
except:
pass
The resulting VEC file may be large, but you can adjust the format of the vector components to reduce its size. For example, to keep only 4 decimal digits, replace vstr += " " + str(vi)
with vstr += " " + "{:.4f}".format(vi)
.
Alternatively, you can use the gensim library, which has a wv.save_word2vec_format
function that simplifies the process of generating .vec
files:
from gensim.models import FastText
# Load the sentences
sentences = open('data.txt','r').readlines() #data.txt contains a sentence on every line.
# Tokenize the sentences using your desired method
tokenized_sentences = tokenize(sentences)
# Create the FastText model
model = FastText(vector_size=300, window=5, min_count=1, sentences=tokenized_sentences, epochs=10)
# Save the vectors to a .vec file
model.wv.save_word2vec_format("embeddings.vec")
Keep in mind that while these approaches are useful, it's important to consider the specific requirements of your task and select the method that best fits your needs.