I’m writing some code to parse an nginx config file in Python. The goal is to extract all the upstream ‘pools’ and put them into a nice data structure for later use.
I’ve come up with the below solution but am unsure about my approach. This seems like something that should have already been done for some other application, I just couldn’t construct the right search terms to find what I need. I’m also wondering if there is some python trick that I am unaware of that could achieve what I want with less (perceived?) bloat.
Need:
- Extract every instance of ‘upstream’ in nginx config, make it useful.
Example data:
worker_processes 2;
pid /var/run/nginx.pid;
error_log /var/log/nginx/error_log debug;
debug_points abort;
events {
worker_connections 1024;
use epoll;
debug_connection 10.10.231.159;
}
http {
upstream pool1 {
server 10.10.240.48:8888;
server 10.10.231.159:8888;
}
upstream pool2 {
server 10.10.240.48:8889;
server 10.10.231.159:8889;
}
server {
listen 0.0.0.0:80;
access_log /var/log/nginx/access_log_80;
location /nginx_status {
stub_status on;
access_log off;
allow all;
}
location / {
proxy_pass http://pool1;
}
location /blah {
proxy_pass http://pool2;
}
}
}
Example code (comments explain the logic):
class NgxConfig(object):
def __init__(self,logging,config_file):
# Set us up the bomb
self.logging = logging
self.upstreams = {}
try:
f = open(config_file)
self.config_file = [line.strip() for line in f.readlines()]
except IOError:
self.logging.error("Cannot process nginx config file!")
sys.exit("Cannot process nginx config file!")
f.close()
self.parse_upstreams()
def parse_upstreams(self):
""" Parse upstreams from config
"""
# Setup markers
us_start_matched = 0
us_end_matched = 0
# Enumerate over config to keep track of position
for pos,line in enumerate(self.config_file):
# See if our line matches "upstream" at all.
usm = re.search('^upstream([^"]+){',line)
if usm:
# Matched upstream, set our position and move on
us_start_matched = pos
continue
# We have a position set for upstream, look for the end of its block
if us_start_matched != 0 and line == '}':
# Got the end of the block
us_end_matched = pos
# Extract the name of the upstream
usm = re.search('^upstream\s+([^"]+)\s+{',self.config_file[us_start_matched])
# Setup list of upstreams
self.upstreams[usm.group(1)] = []
# Get the servers in the upstream between the start and end of the block
# Also remove needless characters, only need the server info
srvs = [s.strip('server; ') for s in self.config_file[us_start_matched+1:us_end_matched]]
# Set them in the list
self.upstreams[usm.group(1)] = srvs
# Reset position markers and move on
us_start_matched = 0
us_end_matched = 0
continue
pprint.pprint(self.upstreams)
Result:
{'pool1': ['10.10.240.48:8888', '10.10.231.159:8888'],
'pool2': ['10.10.240.48:8889', '10.10.231.159:8889']}
Thoughts, concerns, criticisms?
Thanks!
EDIT: May have found a bug in my blog software, seems that even though I had marked this as a ‘draft’ it was still publicly viewable via tag feed!

Comments
Why do you not use any already exsited parser engines? Like pyparser?
[HTML_REMOVED]I can’t seem to find anything in the standard library that does something like pyparsing (which is what I would want). The thing is, that I’m not parsing the entire config, only the ‘upstream’ part of it. Seems to me that having pyparsing as a dependency would be a bit much for this.
Don’t get me wrong, I’m not all about reinventing the wheel, but I think this is one of those cases where an external lib would be overkill.[HTML_REMOVED]
Although, based on your suggestion I did investigate shlex and am studying up on it to see if will fit my needs.
EDIT: drewp set me straight :) and markdown doesn’t like strike…
How are you measuring whether a dependency is too much? Somehow, python wasn’t overkill for your deployment even thought it’s way overpowered for what you need to do. This isn’t embedded-land, where each byte matters.
The thing to check is whether pyparsing is in PyPI, and it is. Therefore, it’s one line in your setup.py or buildout.cfg. Pretend it was part of the stdlib, just with a different release schedule and that one extra ‘activation’ line.
As to the initial problem, I intend to manage my nginx configs by generating them, which is always easier than parsing them. It’ll also be easier to debug when something goes wrong, since it’s easier for me to look at the generated nginx config than it is for you to look at your parse tree.
Meanwhile, though, it would probably be cool if someone did a library for parsing semicolon-terminated-lines-with-{}-blocks, since we never seem to run out of config formats that use them :(
You make a good point. My use of the word ‘overkill’ is a bad exaggeration of my laziness, and a generalization for I don’t want to manage another dependency down the road. Though you go on to make a good point on how to manage that below…
I come from a world without the fun and power of pythons distribution utilities and it often slips my mind (though it shouldn’t given my position as a sysadmin). I was initially going to just tar it up and distribute it that way, but now that you mention it…
I currently manage my initial nginx configs by generating them and keeping revisions in source control. What I’ve run into is a lack of ways to dynamically remove a server from an upstream pool for the application I’m balancing, which is a proprietary software package. The idea is to go beyond nginx’s capabilities for removing an upstream server. I have ‘health checks’ against the servers in an upstream with which I can test a specific page for a status and if it doesn’t pass the check, remove the server from the config and HUP nginx. I will be keeping track of changes with source control and will have facilities to manually roll-back from previous revisions. I was going to do this with nagios, but I felt this a good exercise for my still wobbly python legs.
I agree, though I can’t complain that much. A lot of admins I have worked with puke all over themselves when they see a mess of XML. I’m sure they would much prefer a nice clear tree of options like the nginx config. Though ultimately, I’d prefer to see a YAML config, or even something like ConfigObj lets you do with nested ini style setup, though I feel like even that gets yucky the more you nest.
I think we’re using ‘generate’ differently. I’m going to have an nginx.conf that includes upstreams.conf. upstreams.conf will always be generated by a program based on a data file that I write and maybe some other information that the program can gather automatically. It will never be hand-edited; the only program to ever read it will be nginx. I won’t check upstreams.conf into any revision control just like I don’t check .o or .pyc files into revision control.
The data files that my program reads, of course, can be YAML or ConfigObj or whatever I want. The answer to that for me nowadays is always RDF/turtle, because it’s general enough to handle any data structure, you can seamlessly split the data into any number of files (or even databases), you can seamlessly merge the data from multiple files (or even entire projects), it’s never ambiguous, it includes a straightforward system for documentation, etc.
Yes, we are doing things a bit differently.
Like you’re approach, I have recently decided to separate my upstreams and include them from the base nginx configuration. I will be generating a simple version of the configurations based on some values from my configuration engine (think cf, or bcfg2) and those will live in version control in the event of catastrophe where a complete rebuild is needed. Beyond that, however, the upstreams will be managed via my process, which will make decisions on what should be in upstreams.conf for various ‘pools’. It will do this with basic health checks (extensible) and handlers (also extensible).
One question I’m struggling with is.. How should I handle it if it does get hand edited.. I’m currently checking to see if the file gets modified and I have a property that gets set if my code is the one that edited it. If it gets hand edited, I’m struggling with reloading it completely or rewriting the file from the values that my program has..
Neat! I like that! I’m a big fan of YAML..
Yeah, I gotta agree with redbaron that some kind of parser library would be useful. Otherwise, your approach is fine, but if you want to make it more generalized, where you get a data structure that reflects the entire file, you’ll need some way of popping and pushing stuff off of a tokenized stream.
A recent issue of Python magazine had a great walkthrough of pyparsing, and now I’m fairly comfortable with using it.
A few other tiny remarks:
Instead of using the init function to load the file and then store the lines in a self.config_file attribute, I might write the parse_upstreams method as a freestanding function that takes a single parameter that is anything that quacks like a file. I would find it easier to work on parse_upstreams if I could feed it strings at an interactive session, like this:
Could make it easy to write a lot of tests that way too.
Maybe I would write a version of parse_upstreams that takes a file and a different version that eats strings.
Also, how come you’re passing in a logging object as a param, rather than just using logging.getLogger to get a reference?
Unrelated: I really like the font size in this text area!
I’m going to mock something up with pyparsing today, actually, based on the replies and suggestions here.
Awesome, I’ll hunt it out (I am thinking I have a the one you’re speaking of)
I actually rewrote things a bit since this post where
parse_upstreamsbecameget_upstreamsand is a public method of the NgxConfig class and I moved all file ops to that method. The reason for this is so I can call it directly from another part of my app. This class is only a small part of a larger app.Neato, I’d heard of
StringIObut hadn’t used it yet, I’ll play with that! Could make it easy to write a lot of tests that way too.Why split the work in two? Clarity?
Because I hadn’t read deep enough into the documentation to know that I could do that :)
Me too!
I’m glad to hear that pyparsing is an attractive option for your project. Here is a possible solution to your “upstream” block scanner, illustrating some pyparsing features that you might find useful. (Code assumes that the body of the config file has been stored in the string var nginx_config, perhaps using some statement like nginx_config = file(“nginx.config”).read().)
This prints out:
As far as easy of inclusion of pyparsing, you have a couple of options. As stated above, you can add it as a line to your setup.py. Or, if you like, just include the single source file pyparsing.py in with the same directory as your parser code. I’ve intentionally kept pyparsing source to a single source file, so that projects can include it directly with their source code, with little package muss/fuss. (Matplotlib does this, for instance.) But either way works, according to your own preference.
Cheers, — Paul
Thanks for taking the time to point me in the right direction with pyparsing, it is much appreciated and enlightening!
Quick questions to verify that I’m understanding the code..
This defines what the parse engine should expect for a valid IP?
And this defines what to expect for the port?
This would be a generic alpha numeric string including
_?You’re code is really nice and clean, no regexps, bonus!
I’ll probably go with the single source file installation and package it with my project.
Thanks again for taking the time to enlighten me on how I can benefit from pyparsing and distribute it with my project.
As you guessed, yes, yes, and yes. Send a note when you get your application done, and I’ll add you to the “Who’s Using Pyparsing” page on the wiki.
Comment form for «A cleaner way of extracting block of text from file?»