I-Halder
diff --git a/‎README.md
+26 b/‎README.md
+26
diff --git a/‎acc-vs-samples-Llama8B.png
45.4 KB b/‎acc-vs-samples-Llama8B.png
45.4 KB
diff --git a/‎acc-vs-temp-Gemma7B.png
34.4 KB b/‎acc-vs-temp-Gemma7B.png
34.4 KB
diff --git a/‎acc-vs-temp-Llama8B.png
48.7 KB b/‎acc-vs-temp-Llama8B.png
48.7 KB
diff --git a/‎environment.yml
+170 b/‎environment.yml
+170
diff --git a/‎evaluate_response_job.sh
+18 b/‎evaluate_response_job.sh
+18
diff --git a/‎generate_response_job.sh
+18 b/‎generate_response_job.sh
+18
@@ -0,0 +1,26 @@
+# Inference experiments with large language models
+
+## The effect of temperature
+The importance of statistical approaches to the study of large language models has been emphasized recently by the AI experts at [Antropic](https://www.anthropic.com/research/statistical-approach-to-model-evals). Large language models are usually stochastic - given a fixed prompt it generates different inference time responses each time due to non-zero temperature. This property is helpful in increasing creativity during language generation tasks. However, it is not clear how a non-zero temperature helps in solving precise mathematical problems. To understand this better we set up simple statistical experiments in this repo. The experiments are performed using [vLLM](https://docs.vllm.ai/en/latest/) on [MATH](https://paperswithcode.com/dataset/math) dataset.
+
+### [Llama 8B](https://ai.meta.com/blog/meta-llama-3/)
+<center>
+<img alt="fig1" width="800px" src="acc-vs-samples-Llama8B.png">
+</center>
+
+The experiment is done with max_tokens = 1024, num_few_shot = 2, top_p = 0.95,  temperature = 0.6 on Llama 8B on a randomly chosen problem 5: "What is the 100th term of the arithmetic sequence 6, 10, 14, 18, ...?" We clearly see that there is a large variation to the response for a small number of inference samples. An earlier notable study on such questions is by [Matthew Renze and Erhan Guven](https://arxiv.org/pdf/2402.05201v1), in their work, 10 inference samples are considered for each problem. Given our results above it is clear that such results are not reliable. To overcome this difficulty we choose 1000 inference samples at each value of temperature and plot the mean accuracy below:
+
+<center>
+<img alt="fig2" width="800px" src="acc-vs-temp-Llama8B.png">
+</center>
+
+We clearly see that with an increase in temperature, the accuracy drops significantly for problem 5. On the other hand, for the average over MATH dataset we see an initial oscillation and then a decrease in accuracy as we increase temperature. This initial oscillation suggests there might be an interesting dependence of accuracy on the difficulty of the problem - for example, problems of a certain type might show an increase in accuracy as temperature is increased initially. To understand this phenomenon better we set up experiments on a further refined MATH dataset consisting of only algebra level 1 problems on Gemma 7B. 
+
+### [Gemma 7B](https://ai.google.dev/gemma)
+
+<center>
+<img alt="fig1" width="800px" src="acc-vs-temp-Gemma7B.png">
+</center>
+
+To our surprise, we find that there is a critical temperature at which accuracy attens to a local maxima. We would like to emphasize that we have increased our inference samples significantly, reducing the variance to a negligible value for a given problem. However, when we look across problems, variance also receives a contribution from the dataset size. Notably, the MATH dataset contains only 273 algebra, level 1 problems, on which we have experimented above. To have a more statistically robust prediction in the future, we need to have access to a much bigger dataset.
+
@@ -0,0 +1,170 @@
+name: env_vLLM # give name of the environment
+channels:
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - bzip2=1.0.8=h7b6447c_0
+  - ca-certificates=2024.9.24=h06a4308_0
+  - ld_impl_linux-64=2.40=h12ee557_0
+  - libffi=3.3=he6710b0_2
+  - libgcc-ng=9.1.0=hdf63c60_0
+  - libstdcxx-ng=9.1.0=hdf63c60_0
+  - libuuid=1.0.3=h7f8727e_2
+  - ncurses=6.3=h7f8727e_2
+  - openssl=1.1.1w=h7f8727e_0
+  - python=3.10.4=h12debd9_0
+  - readline=8.1.2=h7f8727e_1
+  - sqlite=3.38.5=hc218d9a_0
+  - tk=8.6.12=h1ccaba5_0
+  - xz=5.2.5=h7f8727e_1
+  - zlib=1.2.12=h7f8727e_2
+  - pip:
+    - absl-py==2.1.0
+    - accelerate==1.0.1
+    - aiohappyeyeballs==2.4.3
+    - aiohttp==3.10.10
+    - aiosignal==1.3.1
+    - annotated-types==0.7.0
+    - antlr4-python3-runtime==4.11.0
+    - anyio==4.6.2.post1
+    - async-timeout==4.0.3
+    - attrs==24.2.0
+    - certifi==2024.8.30
+    - chardet==5.2.0
+    - charset-normalizer==3.4.0
+    - click==8.1.7
+    - cloudpickle==3.1.0
+    - cmake==3.30.5
+    - colorama==0.4.6
+    - dataproperty==1.0.1
+    - datasets==2.20.0
+    - dill==0.3.8
+    - diskcache==5.6.3
+    - distro==1.9.0
+    - docker==7.1.0
+    - evaluate==0.4.3
+    - exceptiongroup==1.2.2
+    - fastapi==0.115.4
+    - filelock==3.16.1
+    - frozenlist==1.5.0
+    - fsspec==2024.5.0
+    - h11==0.14.0
+    - httpcore==1.0.6
+    - httptools==0.6.4
+    - httpx==0.27.2
+    - huggingface-hub==0.26.1
+    - idna==3.10
+    - interegular==0.3.3
+    - jinja2==3.1.4
+    - jiter==0.6.1
+    - joblib==1.4.2
+    - jsonlines==4.0.0
+    - jsonschema==4.23.0
+    - jsonschema-specifications==2024.10.1
+    - lark==1.2.2
+    - llvmlite==0.43.0
+    - lm-eval==0.4.3
+    - lm-format-enforcer==0.10.3
+    - lxml==5.3.0
+    - markupsafe==3.0.2
+    - mbstrdecoder==1.1.3
+    - more-itertools==10.5.0
+    - mpmath==1.3.0
+    - msgpack==1.1.0
+    - multidict==6.1.0
+    - multiprocess==0.70.16
+    - nest-asyncio==1.6.0
+    - networkx==3.4.2
+    - ninja==1.11.1.1
+    - nltk==3.9.1
+    - numba==0.60.0
+    - numexpr==2.10.1
+    - numpy==1.26.4
+    - nvidia-cublas-cu12==12.1.3.1
+    - nvidia-cuda-cupti-cu12==12.1.105
+    - nvidia-cuda-nvrtc-cu12==12.1.105
+    - nvidia-cuda-runtime-cu12==12.1.105
+    - nvidia-cudnn-cu12==9.1.0.70
+    - nvidia-cufft-cu12==11.0.2.54
+    - nvidia-curand-cu12==10.3.2.106
+    - nvidia-cusolver-cu12==11.4.5.107
+    - nvidia-cusparse-cu12==12.1.0.106
+    - nvidia-ml-py==12.560.30
+    - nvidia-nccl-cu12==2.20.5
+    - nvidia-nvjitlink-cu12==12.6.77
+    - nvidia-nvtx-cu12==12.1.105
+    - openai==1.52.2
+    - outlines==0.0.46
+    - packaging==24.1
+    - pandas==2.2.3
+    - pathvalidate==3.2.1
+    - peft==0.13.2
+    - pillow==11.0.0
+    - pip==24.2
+    - portalocker==2.10.1
+    - prometheus-client==0.21.0
+    - prometheus-fastapi-instrumentator==7.0.0
+    - propcache==0.2.0
+    - protobuf==5.28.3
+    - psutil==6.1.0
+    - py-cpuinfo==9.0.0
+    - pyairports==2.1.1
+    - pyarrow==18.0.0
+    - pyarrow-hotfix==0.6
+    - pybind11==2.13.6
+    - pycountry==24.6.1
+    - pydantic==2.9.2
+    - pydantic-core==2.23.4
+    - pydra-config==0.0.1
+    - pytablewriter==1.2.0
+    - python-dateutil==2.9.0.post0
+    - python-dotenv==1.0.1
+    - pytz==2024.2
+    - pyyaml==6.0.2
+    - pyzmq==26.2.0
+    - ray==2.38.0
+    - referencing==0.35.1
+    - regex==2024.9.11
+    - requests==2.32.3
+    - rouge-score==0.1.2
+    - rpds-py==0.20.0
+    - sacrebleu==2.4.3
+    - safetensors==0.4.5
+    - scikit-learn==1.5.2
+    - scipy==1.14.1
+    - sentencepiece==0.2.0
+    - setuptools==75.1.0
+    - six==1.16.0
+    - sniffio==1.3.1
+    - sqlitedict==2.1.0
+    - starlette==0.41.2
+    - sympy==1.13.3
+    - tabledata==1.3.3
+    - tabulate==0.9.0
+    - tcolorpy==0.1.6
+    - threadpoolctl==3.5.0
+    - tiktoken==0.8.0
+    - tokenizers==0.20.1
+    - torch==2.4.0
+    - torchvision==0.19.0
+    - tqdm==4.66.3
+    - tqdm-multiprocess==0.0.11
+    - transformers==4.46.0
+    - triton==3.0.0
+    - typepy==1.3.2
+    - typing-extensions==4.12.2
+    - tzdata==2024.2
+    - urllib3==2.2.3
+    - uvicorn==0.32.0
+    - uvloop==0.21.0
+    - vllm==0.5.4
+    - vllm-flash-attn==2.6.1
+    - watchfiles==0.24.0
+    - websockets==13.1
+    - wheel==0.44.0
+    - word2number==1.1
+    - xformers==0.0.27.post2
+    - xxhash==3.5.0
+    - yarl==1.16.0
+    - zstandard==0.23.0
+prefix: indranilhalder/env/env_vLLM #give path to the environment
@@ -0,0 +1,18 @@
+#!/bin/bash
+#
+#SBATCH --job-name=gpt2-eval
+#SBATCH --out="gpt2-eval-%A_%a.out"
+#SBATCH --cpus-per-task=32
+#SBATCH --mem=32G
+#SBATCH --nodes=1
+#SBATCH --time=1:00:00
+#SBATCH --array=0
+#SBATCH --gres=gpu:1
+#SBATCH --partition=kempner_h100
+#SBATCH --account=kempner_pehlevan_lab
+
+export SAVE_DIR=samples/gpt2_samples1000_temp1-2_shot2
+
+module load python/3.10.12-fasrc01 # update 
+source activate /n/netscratch/pehlevan_lab/Everyone/indranilhalder/env/env_vLLM # update the location of env
+python iLLM/evaluate/math_datasets.py samples_dir=$SAVE_DIR/math_samples save_dir=$SAVE_DIR/math_eval dset=math
@@ -0,0 +1,18 @@
+#!/bin/bash
+#
+#SBATCH --job-name=gpt2-generate
+#SBATCH --out="gpt2-generate-%A_%a.out"
+#SBATCH --cpus-per-task=32
+#SBATCH --mem=32G
+#SBATCH --nodes=1
+#SBATCH --time=1:00:00
+#SBATCH --array=0
+#SBATCH --gres=gpu:1
+#SBATCH --partition=kempner_h100
+#SBATCH --account=kempner_pehlevan_lab
+
+export SAVE_DIR=samples/gpt2_samples1000_temp1-2_shot2
+
+module load python/3.10.12-fasrc01 # update
+source activate /n/netscratch/pehlevan_lab/Everyone/indranilhalder/env/env_vLLM # update the location of env
+python iLLM/generate/MATH.py model=gpt2 save_dir=$SAVE_DIR/math_samples temperature=1.2 num_samples=10 num_few_shot=2 num_workers=32 --list vllm_args --disable-log-requests list-- --list stop_strings Problem: list--