Frequently Asked Questions (FAQ) ================================ General Questions ----------------- * `Why use streamparse?`_ * `Is streamparse compatible with Python 3?`_ * `How can I contribute to streamparse?`_ * `How do I trigger some code before or after I submit my topology?`_ * `How should I install streamparse on the cluster nodes?`_ * `Should I install Clojure?`_ * `How do I deploy into a VPC?`_ * `How do I override SSH settings?`_ * `How do I dynamically generate the worker list?`_ Why use streamparse? ~~~~~~~~~~~~~~~~~~~~ To lay your Python code out in topologies which can be automatically parallelized in a Storm cluster of machines. This lets you scale your computation horizontally and avoid issues related to Python's GIL. See :ref:`parallelism`. Is streamparse compatible with Python 3? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Yes, streamparse is fully compatible with Python 3 starting with version 3.3 which we use in our `unit tests`_. .. _unit tests: https://github.com/Parsely/streamparse/blob/master/.travis.yml How can I contribute to streamparse? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Please see the `CONTRIBUTING`_ document in Github .. _CONTRIBUTING: https://github.com/Parsely/streamparse/blob/master/CONTRIBUTING.rst How do I trigger some code before or after I submit my topology? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After you create a streamparse project using ``sparse quickstart``, you'll have a ``fabfile.py`` in that directory. In that file, you can specify two functions (``pre_submit`` and ``post_submit``) which are expected to accept four arguments: * ``topology_name``: the name of the topology being submitted * ``env_name``: the name of the environment where the topology is being submitted (e.g. ``"prod"``) * ``env_config``: the relevant config portion from the ``config.json`` file for the environment you are submitting the topology to * ``options``: the fully resolved Storm options Here is a sample ``fabfile.py`` file that sends a message to IRC after a topology is successfully submitted to prod. .. code-block:: python # my_project/fabfile.py from __future__ import absolute_import, print_function, unicode_literals from my_project import write_to_irc def post_submit(topo_name, env_name, env_config): if env_name == "prod": write_to_irc("Deployed {} to {}".format(topo_name, env_name)) How should I install streamparse on the cluster nodes? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ streamparse assumes your Storm servers have Python, pip, and virtualenv installed. After that, the installation of all required dependencies (including streamparse itself) is taken care of via the `config.json` file for the streamparse project and the ``sparse submit`` command. Should I install Clojure? ~~~~~~~~~~~~~~~~~~~~~~~~~ No, the Java requirements for streamparse are identical to that of Storm itself. Storm requires Java and `bundles Clojure as a requirement`_, so you do not need to do any separate installation of Clojure. You just need Java on all Storm servers. .. _bundles Clojure as a requirement: https://github.com/apache/storm/blob/5383ac375cb2955e3247d485e46f1f58bff62810/pom.xml#L320-L322 How do I deploy into a VPC? ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Update your ``~/.ssh/config`` to use a bastion host inside your VPC for your commands:: Host *.internal.example.com ProxyCommand ssh bastion.example.com exec nc %h %p If you don't have a common subdomain you'll have to list all of the hosts individually:: Host host1.example.com ProxyCommand ssh bastion.example.com exec nc %h %p ... Set up your streamparse config to use all of the hosts normally (without bastion host). How do I override SSH settings? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is highly recommended that you just modify your ``~/.ssh/config`` file if you need to tweak settings for setting up the SSH tunnel to your Nimbus server, but you can also set your SSH password or port in your ``config.json`` by setting the ``ssh_password`` or ``ssh_port`` environment settings. .. code-block:: json { "topology_specs": "topologies/", "virtualenv_specs": "virtualenvs/", "envs": { "prod": { "user": "somebody", "ssh_password": "THIS IS A REALLY BAD IDEA", "ssh_port": 52, "nimbus": "streamparse-box", "workers": [ "streamparse-box" ], "virtualenv_root": "/data/virtualenvs" } } } How do I dynamically generate the worker list? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In a small cluster it's sufficient to specify the list of workers in ``config.json``. However, if you have a large or complex environment where workers are numerous or short-lived, ``streamparse`` supports querying the nimbus server for a list of hosts. An undefined list (empty or None) of ``workers`` will trigger the lookup. Explicitly defined hosts are preferred over a lookup. Lookups are configured on a per-environment basis, so the ``prod`` environment below uses the dynamic lookup, while ``beta`` will not. .. code-block:: json { "topology_specs": "topologies/", "virtualenv_specs": "virtualenvs/", "envs": { "prod": { "nimbus": "streamparse-prod", "virtualenv_root": "/data/virtualenvs" }, "beta": { "nimbus": "streamparse-beta", "workers": [ "streamparse-beta" ], "virtualenv_root": "/data/virtualenvs" } } }