INSTALLING UNSTRUCTURED FOR LANGCHAIN DOCUMENT LOADERS Create a new environment: use a python version less than 3.11!! I am using miniconda for virtual environments conda create -n unstructured python=3.10 activate the environment conda activate unstructured Upgrade pip and setuptools: - pip install --upgrade setuptools - python.exe -m pip install --upgrade pip (if you change your python version these tools might be downgraded) if getting a permission error with pip upgrade then try this: python.exe -m pip install --upgrade pip --user pip install langchain pip install openai Read the installation instruction of these two documents carefully before begining: https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/unstructured_file.html (gonna run into errors because GIT, TORCH) https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst Install Git https://git-scm.com/ Next step will take a while: pip install unstructured[local-inference] if getting numpy errors then: --- UPGRADE numpy for detectron git install: pip install numpy --upgrade pip install cython pip3 install torch torchvision torchaudio (This install only uses the CPU, look at torch's getting started for installing wih GPU support) very useful link for installing Detectron2: https://haroonshakeel.medium.com/detectron2-setup-on-windows-10-and-linux-407e5382df1 git clone https://github.com/facebookresearch/detectron2.git cd detectron2 (go to the detercton2 folder) pip install -e . (wih the dot included) cd.. (go back to the main working directory) pip install opencv-python pip install layoutparser[layoutmodels,tesseract] Install other dependencies: https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst pip install python-magic (first) pip install python-magic-bin (second) Downlaod Poppler: download 7-zip: https://www.7-zip.org/ unzip poppler and place it in your working directory https://blog.alivate.com.au/poppler-windows/ add poppler BIN TO PATH download and install tesseract: https://github.com/UB-Mannheim/tesseract/wiki TAKE NOTE OF THE PATH and add it to your environment path: C:\Program Files\Tesseract-OCR pip install pytesseract SKIPPED libreoffice, libxml2, libxslt https://ask.libreoffice.org/t/install-python-package-for-libre-office/66934 https://pypi.org/project/unotools/ Run the following to install NLTK dependencies. unstructured will handle this automatically soon. python -c "import nltk; nltk.download('punkt')" python -c "import nltk; nltk.download('averaged_perceptron_tagger')" restart your VS code and make sure your Unstructured env is active in VS Code by checking python interpreter by clicking ctrl + shift + p