Posted Under: , ,

I’m writing some code to parse an nginx config file in Python. The goal is to extract all the upstream ‘pools’ and put them into a nice data structure for later use.

I’ve come up with the below solution but am unsure about my approach. This seems like something that should have already been done for some other application, I just couldn’t construct the right search terms to find what I need. I’m also wondering if there is some python trick that I am unaware of that could achieve what I want with less (perceived?) bloat.

Need:

  • Extract every instance of ‘upstream’ in nginx config, make it useful.

Example data:

worker_processes 2;
pid  /var/run/nginx.pid;
error_log /var/log/nginx/error_log debug;
debug_points abort;

events {
  worker_connections  1024;
  use epoll;
  debug_connection 10.10.231.159;
}
http {
  upstream pool1 {
      server 10.10.240.48:8888;
      server 10.10.231.159:8888;
  }
  upstream pool2 {
      server 10.10.240.48:8889;
      server 10.10.231.159:8889;
  }
  server {
      listen 0.0.0.0:80;
      access_log /var/log/nginx/access_log_80;
      location /nginx_status {
          stub_status on;
          access_log off;
          allow all;
      }
      location / {
          proxy_pass http://pool1;
      }
      location /blah {
          proxy_pass http://pool2;
      }
  }
}

Example code (comments explain the logic):

class NgxConfig(object):
    def __init__(self,logging,config_file):
        # Set us up the bomb
        self.logging = logging
        self.upstreams = {}
        try:
            f = open(config_file)
            self.config_file = [line.strip() for line in f.readlines()]
        except IOError:
            self.logging.error("Cannot process nginx config file!")
            sys.exit("Cannot process nginx config file!")
        f.close()

        self.parse_upstreams()

    def parse_upstreams(self): 
        """ Parse upstreams from config
        """
        # Setup markers
        us_start_matched = 0
        us_end_matched = 0 
        # Enumerate over config to keep track of position
        for pos,line in enumerate(self.config_file):
            # See if our line matches "upstream" at all.
            usm = re.search('^upstream([^"]+){',line)
            if usm:
                # Matched upstream, set our position and move on
                us_start_matched = pos
                continue
            # We have a position set for upstream, look for the end of its block
            if us_start_matched != 0 and line == '}':
                # Got the end of the block
                us_end_matched = pos
                # Extract the name of the upstream
                usm = re.search('^upstream\s+([^"]+)\s+{',self.config_file[us_start_matched])
                # Setup list of upstreams
                self.upstreams[usm.group(1)] = []
                # Get the servers in the upstream between the start and end of the block
                # Also remove needless characters, only need the server info
                srvs = [s.strip('server; ') for s in self.config_file[us_start_matched+1:us_end_matched]]
                # Set them in the list
                self.upstreams[usm.group(1)] = srvs
                # Reset position markers and move on
                us_start_matched = 0
                us_end_matched = 0
                continue

        pprint.pprint(self.upstreams)

Result:

{'pool1': ['10.10.240.48:8888', '10.10.231.159:8888'],
 'pool2': ['10.10.240.48:8889', '10.10.231.159:8889']}

Thoughts, concerns, criticisms?

Thanks!

EDIT: May have found a bug in my blog software, seems that even though I had marked this as a ‘draft’ it was still publicly viewable via tag feed!

11 comments | 0 pingbacks
Add post to: Delicious Reddit Slashdot Digg Technorati Google
Comment

Comments

redbaron 26.10.2008 9:44

Why do you not use any already exsited parser engines? Like pyparser?

reply
Benjamin Smith 26.10.2008 12:46

[HTML_REMOVED]I can’t seem to find anything in the standard library that does something like pyparsing (which is what I would want). The thing is, that I’m not parsing the entire config, only the ‘upstream’ part of it. Seems to me that having pyparsing as a dependency would be a bit much for this.

Don’t get me wrong, I’m not all about reinventing the wheel, but I think this is one of those cases where an external lib would be overkill.[HTML_REMOVED]

Although, based on your suggestion I did investigate shlex and am studying up on it to see if will fit my needs.

EDIT: drewp set me straight :) and markdown doesn’t like strike…

reply
drewp 26.10.2008 18:01

How are you measuring whether a dependency is too much? Somehow, python wasn’t overkill for your deployment even thought it’s way overpowered for what you need to do. This isn’t embedded-land, where each byte matters.

The thing to check is whether pyparsing is in PyPI, and it is. Therefore, it’s one line in your setup.py or buildout.cfg. Pretend it was part of the stdlib, just with a different release schedule and that one extra ‘activation’ line.

As to the initial problem, I intend to manage my nginx configs by generating them, which is always easier than parsing them. It’ll also be easier to debug when something goes wrong, since it’s easier for me to look at the generated nginx config than it is for you to look at your parse tree.

Meanwhile, though, it would probably be cool if someone did a library for parsing semicolon-terminated-lines-with-{}-blocks, since we never seem to run out of config formats that use them :(

reply
Benjamin Smith 27.10.2008 9:22

How are you measuring whether a dependency is >too much? Somehow, python wasn’t overkill for >your deployment even thought it’s way >overpowered for what you need to do. This >isn’t embedded-land, where each byte matters.

You make a good point. My use of the word ‘overkill’ is a bad exaggeration of my laziness, and a generalization for I don’t want to manage another dependency down the road. Though you go on to make a good point on how to manage that below…

The thing to check is whether pyparsing is in >PyPI, and it is. Therefore, it’s one line in >your setup.py or buildout.cfg. Pretend it was >part of the stdlib, just with a different >release schedule and that one extra >‘activation’ line.

I come from a world without the fun and power of pythons distribution utilities and it often slips my mind (though it shouldn’t given my position as a sysadmin). I was initially going to just tar it up and distribute it that way, but now that you mention it…

As to the initial problem, I intend to manage >my nginx configs by generating them, which is >always easier than parsing them. It’ll also be >easier to debug when something goes wrong, >since it’s easier for me to look at the >generated nginx config than it is for you to >look at your parse tree.

I currently manage my initial nginx configs by generating them and keeping revisions in source control. What I’ve run into is a lack of ways to dynamically remove a server from an upstream pool for the application I’m balancing, which is a proprietary software package. The idea is to go beyond nginx’s capabilities for removing an upstream server. I have ‘health checks’ against the servers in an upstream with which I can test a specific page for a status and if it doesn’t pass the check, remove the server from the config and HUP nginx. I will be keeping track of changes with source control and will have facilities to manually roll-back from previous revisions. I was going to do this with nagios, but I felt this a good exercise for my still wobbly python legs.

Meanwhile, though, it would probably be cool >if someone did a library for parsing >semicolon->terminated-lines-with-{}-blocks, >since we never seem to run out of config >formats that use them :(

I agree, though I can’t complain that much. A lot of admins I have worked with puke all over themselves when they see a mess of XML. I’m sure they would much prefer a nice clear tree of options like the nginx config. Though ultimately, I’d prefer to see a YAML config, or even something like ConfigObj lets you do with nested ini style setup, though I feel like even that gets yucky the more you nest.

reply
drewp 27.10.2008 13:01

I think we’re using ‘generate’ differently. I’m going to have an nginx.conf that includes upstreams.conf. upstreams.conf will always be generated by a program based on a data file that I write and maybe some other information that the program can gather automatically. It will never be hand-edited; the only program to ever read it will be nginx. I won’t check upstreams.conf into any revision control just like I don’t check .o or .pyc files into revision control.

The data files that my program reads, of course, can be YAML or ConfigObj or whatever I want. The answer to that for me nowadays is always RDF/turtle, because it’s general enough to handle any data structure, you can seamlessly split the data into any number of files (or even databases), you can seamlessly merge the data from multiple files (or even entire projects), it’s never ambiguous, it includes a straightforward system for documentation, etc.

reply
Benjamin Smith 12.11.2008 16:08

I think we’re using ‘generate’ differently. I’m going to have an nginx.conf that includes upstreams.conf. upstreams.conf will always be generated by a program based on a data file that I write and maybe some other information that the program can gather automatically. It will never be hand-edited; the only program to ever read it will be nginx. I won’t check upstreams.conf into any revision control just like I don’t check .o or .pyc files into revision control.

Yes, we are doing things a bit differently.

Like you’re approach, I have recently decided to separate my upstreams and include them from the base nginx configuration. I will be generating a simple version of the configurations based on some values from my configuration engine (think cf, or bcfg2) and those will live in version control in the event of catastrophe where a complete rebuild is needed. Beyond that, however, the upstreams will be managed via my process, which will make decisions on what should be in upstreams.conf for various ‘pools’. It will do this with basic health checks (extensible) and handlers (also extensible).

One question I’m struggling with is.. How should I handle it if it does get hand edited.. I’m currently checking to see if the file gets modified and I have a property that gets set if my code is the one that edited it. If it gets hand edited, I’m struggling with reloading it completely or rewriting the file from the values that my program has..

The data files that my program reads, of course, can be YAML or ConfigObj or whatever I want. The answer to that for me nowadays is always RDF/turtle, because it’s general enough to handle any data structure, you can seamlessly split the data into any number of files (or even databases), you can seamlessly merge the data from multiple files (or even entire projects), it’s never ambiguous, it includes a straightforward system for documentation, etc.

Neat! I like that! I’m a big fan of YAML..

reply
Matt 26.10.2008 14:14

Yeah, I gotta agree with redbaron that some kind of parser library would be useful. Otherwise, your approach is fine, but if you want to make it more generalized, where you get a data structure that reflects the entire file, you’ll need some way of popping and pushing stuff off of a tokenized stream.

A recent issue of Python magazine had a great walkthrough of pyparsing, and now I’m fairly comfortable with using it.

A few other tiny remarks:

Instead of using the init function to load the file and then store the lines in a self.config_file attribute, I might write the parse_upstreams method as a freestanding function that takes a single parameter that is anything that quacks like a file. I would find it easier to work on parse_upstreams if I could feed it strings at an interactive session, like this:

from StringIO import StringIO
f = StringIO("some tricky text")
matt.parse_upstreams(f)

Could make it easy to write a lot of tests that way too.

Maybe I would write a version of parse_upstreams that takes a file and a different version that eats strings.

Also, how come you’re passing in a logging object as a param, rather than just using logging.getLogger to get a reference?

Unrelated: I really like the font size in this text area!

reply
Benjamin Smith 27.10.2008 9:35

Yeah, I gotta agree with redbaron that some kind >of parser library would be useful. Otherwise, your >approach is fine, but if you want to make it more >generalized, where you get a data structure that >reflects the entire file, you’ll need some way of >popping and pushing stuff off of a tokenized >stream.

I’m going to mock something up with pyparsing today, actually, based on the replies and suggestions here.

A recent issue of Python magazine had a great >walkthrough of pyparsing, and now I’m fairly >comfortable with using it.

Awesome, I’ll hunt it out (I am thinking I have a the one you’re speaking of)

A few other tiny remarks:

Instead of using the init function to load the >file and then store the lines in a >self.config_file attribute, I might write the >parse_upstreams method as a freestanding function >that takes a single parameter that is anything that quacks like a file.

I actually rewrote things a bit since this post where parse_upstreams became get_upstreams and is a public method of the NgxConfig class and I moved all file ops to that method. The reason for this is so I can call it directly from another part of my app. This class is only a small part of a larger app.

I would find it easier to work on parse_upstreams >if I could feed it strings at an interactive >session, like this:

from StringIO import StringIO f = StringIO(“some tricky text”) matt.parse_upstreams(f)

Neato, I’d heard of StringIO but hadn’t used it yet, I’ll play with that! Could make it easy to write a lot of tests that way too.

Maybe I would write a version of parse_upstreams >that takes a file and a different version that >eats strings.

Why split the work in two? Clarity?

Also, how come you’re passing in a logging object >as a param, rather than just using >logging.getLogger to get a reference?

Because I hadn’t read deep enough into the documentation to know that I could do that :)

Unrelated: I really like the font size in this >text area!

Me too!

reply
Paul McGuire 10.11.2008 18:24

I’m glad to hear that pyparsing is an attractive option for your project. Here is a possible solution to your “upstream” block scanner, illustrating some pyparsing features that you might find useful. (Code assumes that the body of the config file has been stored in the string var nginx_config, perhaps using some statement like nginx_config = file(“nginx.config”).read().)

from pyparsing import Word,nums,Combine,alphas,alphanums,\
    Suppress,Keyword,OneOrMore,Group

# function to create range validation parse actions
def validInRange(lo,hi):
    def parseAction(tokens):
        if not lo <= int(tokens[0]) <= hi:
            raise ParseException("",0,
                        "integer outside range %d-%d" % (lo,hi))
    return parseAction

# define basic building blocks
integer = Word(nums)
ip_int = integer.copy().setParseAction(validInRange(0,255))
ip_addr = Combine(ip_int + ('.'+ip_int)*3)
ip_port = integer.copy().setParseAction(validInRange(1025,65535))
ip_addr_port = ip_addr("ip_addr") + ':' + ip_port("ip_port")
ident = Word(alphas, alphanums+"_")

# define punctuation needed — but use Suppress so it does
# not clutter up the output tokens
SEMI,LBRACE,RBRACE = map(Suppress,";{}")

# define a server entry that will be found in each upstream block
server_def = Keyword("server") + ip_addr_port + SEMI

# define an upstream block
upstream_block = Keyword("upstream") + ident("stream_id") + \
    LBRACE + OneOrMore(Group(server_def))("servers") + RBRACE

# now scan through the string containing the nginx config
# data, extract the upstream blocks and their corresponding
# server definitions — access tokens using results names as
# specified when defining server_def and upstream_block
for usb in upstream_block.searchString(nginx_config):
    print usb.stream_id
    for srvr in usb.servers:
        print srvr.ip_addr, srvr.ip_port
    print

This prints out:

pool1
10.10.240.48 8888
10.10.231.159 8888

pool2
10.10.240.48 8889
10.10.231.159 8889

As far as easy of inclusion of pyparsing, you have a couple of options. As stated above, you can add it as a line to your setup.py. Or, if you like, just include the single source file pyparsing.py in with the same directory as your parser code. I’ve intentionally kept pyparsing source to a single source file, so that projects can include it directly with their source code, with little package muss/fuss. (Matplotlib does this, for instance.) But either way works, according to your own preference.

Cheers, — Paul

reply
Benjamin Smith 12.11.2008 16:46

I’m glad to hear that pyparsing is an attractive option for your project. Here is a possible solution to your “upstream” block scanner, illustrating some pyparsing features that you might find useful. (Code assumes that the body of the config file has been stored in the string var nginx_config, perhaps using some statement like nginx_config = file(“nginx.config”).read().)

Thanks for taking the time to point me in the right direction with pyparsing, it is much appreciated and enlightening!

Quick questions to verify that I’m understanding the code..

ip_int = integer.copy().setParseAction(validInRange(0,255)) ip_addr = Combine(ip_int + (‘.’+ip_int)*3)

This defines what the parse engine should expect for a valid IP?

ip_port = integer.copy().setParseAction(validInRange(1025,65535)) ip_addr_port = ip_addr(“ip_addr”) + ‘:’ + ip_port(“ip_port”)

And this defines what to expect for the port?

ident = Word(alphas, alphanums+”_”)

This would be a generic alpha numeric string including _?

You’re code is really nice and clean, no regexps, bonus!

As far as easy of inclusion of pyparsing, you have a couple of options. As stated above, you can add it as a line to your setup.py. Or, if you like, just include the single source file pyparsing.py in with the same directory as your parser code. I’ve intentionally kept pyparsing source to a single source file, so that projects can include it directly with their source code, with little package muss/fuss. (Matplotlib does this, for instance.) But either way works, according to your own preference.

I’ll probably go with the single source file installation and package it with my project.

Thanks again for taking the time to enlighten me on how I can benefit from pyparsing and distribute it with my project.

reply
Paul McGuire 13.11.2008 0:29

As you guessed, yes, yes, and yes. Send a note when you get your application done, and I’ll add you to the “Who’s Using Pyparsing” page on the wiki.

reply

Comment form for «A cleaner way of extracting block of text from file?»

Required. 30 chars of fewer.

Required.

captcha image Please, enter symbols, which you see on the image