# Praktikum 1

Percobaan 1

Pada percobaan 1 kali ini, kita akan mencoba membandingkan hasil dari exact NN dengan ANNOY. data yang kita buat adalah data random 2D, dengan 1000 data point, mencari 3 data terdekat dari query point, Metric Similarity menggunakan Euclidean, dengan 3 tree Annoy.

Instalasi Annoy dulu untuk langkah awal.

```python
!pip install annoy
```

Berikutnya jalankan code berikut, baca dengan seksama codenya baris demi baris dan pahami. lakukan beberapa kali percobaan dan perhatikan juga hasilnya. catat hasilnya jika menggunakan jumlah tree yang berbeda.

```python
import numpy as np
import matplotlib.pyplot as plt
import time

from annoy import AnnoyIndex

# 1. Dataset 2D
np.random.seed(42)
n_points = 1000
X = np.random.rand(n_points, 2) * 100  # titik random dalam ruang 100x100

# Query point (ambil salah satu titik random)
query = X[np.random.randint(0, n_points)]

# 2. Exact NN (brute force)
start = time.time()
distances = np.linalg.norm(X - query, axis=1)
idx_exact = np.argsort(distances)[:3]  # ambil 3 terdekat
time_exact = time.time() - start

print("Exact NN index:", idx_exact)
print("Exact NN jarak:", distances[idx_exact])
print("Waktu Exact:", round(time_exact*1000, 4), "ms")

# 3. Annoy NN (3 tree)
f = 2  # dimensi
t = AnnoyIndex(f, 'euclidean')
for i, vec in enumerate(X):
    t.add_item(i, vec)

t.build(3)  # 3 trees

start = time.time()
idx_ann = t.get_nns_by_vector(query, 3)  # cari 3 NN
time_ann = time.time() - start

print("\nAnnoy NN index:", idx_ann)
print("Annoy NN jarak:", [np.linalg.norm(X[i]-query) for i in idx_ann])
print("Waktu Annoy:", round(time_ann*1000, 4), "ms")

# 4. Visualisasi hasil
plt.figure(figsize=(6,6))
plt.scatter(X[:,0], X[:,1], c="lightgray", s=20, label="Dataset")
plt.scatter(query[0], query[1], c="red", marker="x", s=100, label="Query")

# Exact NN ditandai biru
plt.scatter(X[idx_exact,0], X[idx_exact,1], c="blue", s=80, label="Exact NN")

# Annoy NN ditandai hijau
plt.scatter(X[idx_ann,0], X[idx_ann,1], c="green", s=50, marker="s", label="Annoy NN")

plt.legend()
plt.title("Exact NN vs Annoy NN (3 trees)")
plt.show()
```

<div align="left"><figure><img src="https://3041032130-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5CvtE8Xh9b75jKUaRr5Y%2Fuploads%2F7waFASfSoyjisYMt7F91%2Fimage.png?alt=media&#x26;token=eb8111b3-e454-45db-9e47-d7c78ad6a979" alt="" width="375"><figcaption></figcaption></figure></div>

<figure><img src="https://3041032130-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5CvtE8Xh9b75jKUaRr5Y%2Fuploads%2FvlgwYg59PPTi1FSO5QSI%2Fimage.png?alt=media&#x26;token=8715b401-ce63-4eff-abcb-81a57d06cf55" alt=""><figcaption></figcaption></figure>

Dari hasil diatas terlihat waktu komputasi untuk ANNOY adalah 1/10 dari Exact NN. Lakukan percobaan dan isikan hasil percobaan pada tabel berikut.

<table><thead><tr><th width="157.3636474609375">Distance Metrics</th><th width="80.81817626953125">Tree</th><th width="134">Jumlah data</th><th width="255.636474609375">Hasil Index terdekat ENN vs ANN</th><th>Waktu komputasi Vs</th></tr></thead><tbody><tr><td>Euclidean</td><td>3</td><td>1000</td><td>[219 898 593], [219 898 770]</td><td>1.2271 , 0.1264</td></tr><tr><td>Euclidean</td><td>8</td><td>1000</td><td>...</td><td>...</td></tr><tr><td>Euclidean</td><td>3</td><td>100,000</td><td>...</td><td>...</td></tr><tr><td>Angular</td><td>3</td><td>1000</td><td>...</td><td>...</td></tr><tr><td>Angular</td><td>8</td><td>1000</td><td>...</td><td>...</td></tr><tr><td>Angular</td><td>3</td><td>100,000</td><td>...</td><td>...</td></tr></tbody></table>

Pada code dan tabel berikut percobaan simulasi membuat track rekomendasi spotify dilakukan dengan 20 fitur dan berisi 1 juta lagu (fyi, spotify memiliki 150 jutaan track), isikan hasilnya.

```python
import numpy as np
import time
from sklearn.metrics.pairwise import euclidean_distances
from annoy import AnnoyIndex

# ---- 1. Buat dataset mirip Spotify ----
n_tracks = 50_000_000   # 50 juta track
n_features = 20        # contoh: danceability, energy, tempo, dll.

# dataset besar (random untuk simulasi)
X = np.random.rand(n_tracks, n_features).astype(np.float32)

# query track (misalnya lagu baru)
query = np.random.rand(1, n_features).astype(np.float32)

# ---- 2. Exact NN (brute force) ----
start = time.time()
distances = euclidean_distances(query, X)[0]   # hitung semua jarak
exact_idx = np.argsort(distances)[:5]          # ambil 5 terdekat
exact_time = time.time() - start

print("Exact NN result:", exact_idx)
print("Exact NN time:", round(exact_time, 3), "seconds")

# ---- 3. Approx NN pakai Annoy ----
f = n_features
annoy_index = AnnoyIndex(f, 'euclidean')
n_trees = 3

# build index
for i in range(n_tracks):
    annoy_index.add_item(i, X[i])
annoy_index.build(n_trees)

start = time.time()
annoy_idx = annoy_index.get_nns_by_vector(query[0], 5)  # ambil 5 lagu yang mirip
annoy_time = time.time() - start

print("Annoy result:", annoy_idx)
print("Annoy time:", round(annoy_time, 3), "seconds")

```

<table><thead><tr><th width="157.3636474609375">Distance Metrics</th><th width="80.81817626953125">Tree</th><th width="134">Jumlah data</th><th width="255.636474609375">Hasil Index terdekat ENN vs ANN</th><th>Waktu komputasi Vs</th></tr></thead><tbody><tr><td>Euclidean</td><td>8</td><td>1000000</td><td>...</td><td>...</td></tr><tr><td>Angular</td><td>8</td><td>1000000</td><td>...</td><td>...</td></tr></tbody></table>

Pertanyaannya: Kenapa code dibagian build index tidak dihitung waktunya?&#x20;
